Download presentation

Presentation is loading. Please wait.

Published byTracy Linden Modified over 2 years ago

1
**Evaluation of Decision Forests on Text Categorization**

Hao Chen School of Information Mgmt. & Systems Univ. of California Berkeley, CA 94720 Tin Kam Ho Bell Labs Lucent Technologies 700 Mountain Avenue Murray Hill, NJ 07974

2
**Text Categorization Text Collection Feature Extraction Classification**

Evaluation

3
**Text Collection Reuters OHSUMED Newswires from Reuters in 1987**

Training set: 9603 Test set: 3299 Categories: 95 OHSUMED Abstracts from medical journals Training set: 12327 Test set: 3616 Categories: 75 (within Heart Disease subtree)

4
**Feature Extraction Stop Word Removal Stemming Term Selection**

430 stop words Stemming Porter’s stemmer Term Selection by Document Frequency Category independent selection Category dependent selection Feature Extraction TF IDF

5
**Classification Method Classifiers**

Each document may belong to multiple categories Treating each category as a separate classification problem Binary classification Classifiers kNN (k Nearest Neighbor) C4.5 (Quinlan) Decision Forest

6
**C4.5 A method to build decision trees Training Testing**

Grow the tree by splitting the data set Prune the tree back to prevent over-fitting Testing Test vector goes down the tree and arrives at a leaf. Probability that the vector belongs to each category is estimated.

7
Decision Forest Consisting of many decision trees combined by averaging the class probability estimates at the leaves. Each tree is constructed in a randomly chosen (coordinate) subspace of the feature space. An oblique hyperplane is used as a discriminator at each internal node of the trees.

8
**Why choose these 3 classifiers?**

We do not have a parametric model for the problem (we cannot assume Gaussian distributions etc.) kNN and decision tree (c4.5) are the most popular nonparametric classifiers. We use them as the baselines for comparison We expect decision forest to do well since we have a high dimensional problem for which it is known to do well from previous studies

9
**Evaluation Measurements Tradeoff between Precision and Recall**

YES is correct No is correct Assigned YES a b Assigned NO c d Measurements Precision p = a / (a+b) Recall r = a / (a+c) F1 value F1 = 2rp / (r+p) Tradeoff between Precision and Recall kNN tends to have higher precision than recall, especially when k becomes larger.

10
**Averaging scores Macro-averaging Micro-averaging**

Calculate precision/recall for each category Average all the precision/recall values Assign equal weight to each category Micro-averaging Sum up classification decision of each document Calculate precision/recall from the summations Assign equal weight to each document This was used in experiment because the number of documents in each category varies considerably.

11
Performance in F1 Value

12
**Comparison between Classifiers**

Decision Forest better than C4.5 and kNN In category dependent case, C4.5 better than kNN In category independent case, kNN better than C4.5

13
**Category Dependent vs. Independent method**

For Decision Forest and C4.5, category dependent better than independent. But for kNN, category independent better than dependent. No obvious explanation found.

14
**Reuters vs. OHSUMED All classifiers degrades from Reuters to OHSUMED**

kNN degrades faster(26%) than C4.5(12%) and DF(12%)

15
**Reuters vs. OHSUMED OHSUMED is a harder problem because:**

Documents are more evenly distributed This even distribution confuses kNN recall rate more than others, because there are more confusion classes in the fixed size neighborhood.

16
Conclusion Decision Forest is substantially better than C4.5 and kNN in text categorization Difficult to make comparison with results of other classifiers outside this experiment, because Different ways of spliting training/test set Different term selection methods

Similar presentations

OK

Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on adjectives for class 2 Maths ppt on surface area and volume Ppt on acid rain Ppt online downloader youtube Ppt on internet addiction Ppt on boilers operations analyst Ppt on cross docking logistics Download ppt on oxidation and reduction reactions Microsoft office ppt online templates Ppt on immigration