Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation of Decision Forests on Text Categorization Hao Chen School of Information Mgmt. & Systems Univ. of California Berkeley, CA 94720 Tin Kam Ho.

Similar presentations


Presentation on theme: "Evaluation of Decision Forests on Text Categorization Hao Chen School of Information Mgmt. & Systems Univ. of California Berkeley, CA 94720 Tin Kam Ho."— Presentation transcript:

1 Evaluation of Decision Forests on Text Categorization Hao Chen School of Information Mgmt. & Systems Univ. of California Berkeley, CA Tin Kam Ho Bell Labs Lucent Technologies 700 Mountain Avenue Murray Hill, NJ 07974

2 Text Categorization  Text Collection  Feature Extraction  Classification  Evaluation

3 Text Collection  Reuters –Newswires from Reuters in 1987 –Training set: 9603 –Test set: 3299 –Categories: 95  OHSUMED –Abstracts from medical journals –Training set: –Test set: 3616 –Categories: 75 (within Heart Disease subtree)

4 Feature Extraction  Stop Word Removal –430 stop words  Stemming –Porter’s stemmer  Term Selection –by Document Frequency –Category independent selection –Category dependent selection  Feature Extraction –TF  IDF

5 Classification  Method –Each document may belong to multiple categories –Treating each category as a separate classification problem –Binary classification  Classifiers –kNN (k Nearest Neighbor) –C4.5 (Quinlan) –Decision Forest

6 C4.5  A method to build decision trees  Training –Grow the tree by splitting the data set –Prune the tree back to prevent over-fitting  Testing –Test vector goes down the tree and arrives at a leaf. –Probability that the vector belongs to each category is estimated.

7 Decision Forest  Consisting of many decision trees combined by averaging the class probability estimates at the leaves.  Each tree is constructed in a randomly chosen (coordinate) subspace of the feature space.  An oblique hyperplane is used as a discriminator at each internal node of the trees.

8 Why choose these 3 classifiers?  We do not have a parametric model for the problem (we cannot assume Gaussian distributions etc.)  kNN and decision tree (c4.5) are the most popular nonparametric classifiers. We use them as the baselines for comparison  We expect decision forest to do well since we have a high dimensional problem for which it is known to do well from previous studies

9 Evaluation  Measurements –Precision p = a / (a+b) –Recall r = a / (a+c) –F 1 value F1 = 2rp / (r+p)  Tradeoff between Precision and Recall –kNN tends to have higher precision than recall, especially when k becomes larger. YES is correctNo is correct Assigned YESab Assigned NOcd

10 Averaging scores  Macro-averaging –Calculate precision/recall for each category –Average all the precision/recall values –Assign equal weight to each category  Micro-averaging –Sum up classification decision of each document –Calculate precision/recall from the summations –Assign equal weight to each document –This was used in experiment because the number of documents in each category varies considerably.

11 Performance in F 1 Value

12 Comparison between Classifiers  Decision Forest better than C4.5 and kNN  In category dependent case, C4.5 better than kNN  In category independent case, kNN better than C4.5

13 Category Dependent vs. Independent method  For Decision Forest and C4.5, category dependent better than independent.  But for kNN, category independent better than dependent.  No obvious explanation found.

14 Reuters vs. OHSUMED  All classifiers degrades from Reuters to OHSUMED  kNN degrades faster(26%) than C4.5(12%) and DF(12%)

15 Reuters vs. OHSUMED  OHSUMED is a harder problem because: –Documents are more evenly distributed –This even distribution confuses kNN recall rate more than others, because there are more confusion classes in the fixed size neighborhood.

16 Conclusion  Decision Forest is substantially better than C4.5 and kNN in text categorization  Difficult to make comparison with results of other classifiers outside this experiment, because –Different ways of spliting training/test set –Different term selection methods


Download ppt "Evaluation of Decision Forests on Text Categorization Hao Chen School of Information Mgmt. & Systems Univ. of California Berkeley, CA 94720 Tin Kam Ho."

Similar presentations


Ads by Google