Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.

Similar presentations


Presentation on theme: "Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most."— Presentation transcript:

1 Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most pertinent to each document Indexing: Select a set of keywords / index terms appropriate to each document

2 Classification Techniques Manual (a.k.a. Knowledge Engineering) –typically, rule-based expert systems Machine Learning –Probabalistic (e.g., Naïve Bayesian) –Decision Structures (e.g., Decision Trees) –Profile-Based compare document to profile(s) of subject classes similarity rules similar to those employed in I.R. –Support Machines (e.g., SVM)

3 Machine Learning Procedures Usually train-and-test –Exploit an existing collection in which documents have already been classified a portion used as the training set another portion used as a test set –permits measurement of classifier effectiveness –allows tuning of classifier parameters to yield maximum effectiveness Single- vs. multi-label –can 1 document be assigned to multiple categories?

4 Automatic Indexing Assign to each document up to k terms drawn from a controlled vocabulary Typically reduced to a multi-label classification problem –each keyword corresponds to a class of documents for which that keyword is an appropriate descriptor

5 Case Study: SVM categorization Document Collection from DTIC –10,000 documents previously classified manually –Taxonomy of 25 broad subject fields, divided into a total of 251 narrower groups –Document lengths average 2705  1464 words, 623  274 significant unique terms. –Collection has 32457 significant unique terms

6 Document Collection

7

8 Sample: Broad Subject Fields 01--Aviation Technology 02--Agriculture 03--Astronomy and Astrophysics 04--Atmospheric Sciences 05--Behavioral and Social Sciences 06--Biological and Medical Sciences 07--Chemistry 08--Earth Sciences and Oceanography

9 Sample: Narrow Subject Groups Aviation Technology 01 Aerodynamics 02 Military Aircraft Operations 03 Aircraft 0301 Helicopters 0302 Bombers 0303 Attack and Fighter Aircraft 0304 Patrol and Reconnaissance Aircraft

10 Distribution among Categories

11

12 Baseline Establish baseline for conventional techniques –classification –training SVM for each subject area “off-the-shelf” document modelling and SVM libraries

13 Why SVM? Prior studies have suggested good results with SVM relatively immune to “overfitting” – fitting to coincidental relations encountered during training low dimensionality of model parameters

14 Machine Learning: Support Vector Machines Binary Classifier –Finds the plane with largest margin to separate the two classes of training samples –Subsequently classifies items based on which side of line they fall Font size Line number hyperplane margin

15 SVM Evaluation

16 Baseline SVM Evaluation –Training & Testing process repeated for multiple subject categories –Determine accuracy overall positive (ability to recognize new documents that belong in the class the SVM was trained for) negative (ability to reject new documents that belong to other classes) –Explore Training Issues

17 SVM “Out of the Box” 16 broad categories with 150 or more documents Lucene library for model preparation LibSVM for SVM training & testing –no normalization or parameter tuning Training set of 100/100 (positive/negative samples) Test set of 50/50

18

19 “OOtB” Interpretation Reasonable performance on broad categories given modest training set size. Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

20 Training Set Size

21 accuracy plateaus for training set sizes well under the number of terms in the document model

22 Training Issues Training Set Size –Concern: detailed subject groups may have too few known examples to perform effective SVM training in that subject –Possible Solution: collection may have few positive examples, but has many, many negative example Positive/Negative Training Mixes –effects on accuracy

23 Increased Negative Training

24 Training Set Composition experiment performed with 50 positive training examples –OotB SVM training increasing the number of negative training examples has little effect on overall accuracy but positive accuracy reduced

25 Interpretation may indicate a weakness in SVM –or simply further evidence of the importance of optimizing SVM parameters may indicate unsuitability of treating SVM output as simple boolean decision –might do better as “best fit” in a multi-label classifier


Download ppt "Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most."

Similar presentations


Ads by Google