Information Organization & Retrieval | Automatic Classification Schemes

Page 1

Pasquale J. Festa INF384C – Organizing and Providing Access to Information Prof. Efron 4.5.2007 Automatic Classification Assignment 1. Rule: Classify all articles with 4 or more Authors as medical. When looking at the histogram it is clear to see that medical and non-medical articles start to mix in the 4 to 5 authors range. While my rule will allow a number of non-medical articles to creep into the medical class, this is necessary due to the fact that one can not have fractions of an author (despite the fact that the histogram makes it seem as if you can, though this is due to the fact that the numbers we see are [1] made up and [2] most likely averages). When approximating where documents lie in the histogram, it appears that a small proportion of articles that are medical lie just below the 5 author mark. If I were to make 5 my cut off point, I would then be misclassifying medical documents as non-medical documents. By making my cut off point 4, I am, instead, misclassifying a number of non-medical documents as medical documents. As my job is to index medical documents, I feel that the lesser of the two evils would be to misclassify non-medical documents as medical documents as having nonapplicable information would do less harm to my company than missing much needed medical information may. 2.

Medical documents: Classified correctly 100% (100 out of 100) Non-Medical documents: Classified correctly 83% (83 out of 100) Overall Percentage of correct classification: 91.5% (183 out of 200)

3.

If we assume that the distribution of data we have is an unbiased estimator of unforeseen

data then we well be assuming that every set of 200 new documents will follow this same pattern. However, as we see there is a section where documents overlap (the 4 to 6 author range), our model runs the possibility of adding too many non-medical documents to the medical class. In the next group of 200, 17 more would be misclassified and now we would have 34 non-medical documents in our medical class. With 34 of the 234 documents in the medical class being (in actuality) non-medical documents we start to find ourselves running into trouble. Despite our model being completely accurate in terms of classifying medical documents, we find it adding unnecessary information to our document set. Our new class would end up containing 34 non-


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
Information Organization & Retrieval | Automatic Classification Schemes by passy hearst - Issuu