Presentation on theme: "Evaluation of Decision Forests on Text Categorization"— Presentation transcript:
1 Evaluation of Decision Forests on Text Categorization Hao ChenSchool of InformationMgmt. & SystemsUniv. of CaliforniaBerkeley, CA 94720Tin Kam HoBell LabsLucent Technologies700 Mountain AvenueMurray Hill, NJ 07974
2 Text Categorization Text Collection Feature Extraction Classification Evaluation
3 Text Collection Reuters OHSUMED Newswires from Reuters in 1987 Training set: 9603Test set: 3299Categories: 95OHSUMEDAbstracts from medical journalsTraining set: 12327Test set: 3616Categories: 75 (within Heart Disease subtree)
5 Classification Method Classifiers Each document may belong to multiple categoriesTreating each category as a separate classification problemBinary classificationClassifierskNN (k Nearest Neighbor)C4.5 (Quinlan)Decision Forest
6 C4.5 A method to build decision trees Training Testing Grow the tree by splitting the data setPrune the tree back to prevent over-fittingTestingTest vector goes down the tree and arrives at a leaf.Probability that the vector belongs to each category is estimated.
7 Decision ForestConsisting of many decision trees combined by averaging the class probability estimates at the leaves.Each tree is constructed in a randomly chosen (coordinate) subspace of the feature space.An oblique hyperplane is used as a discriminator at each internal node of the trees.
8 Why choose these 3 classifiers? We do not have a parametric model for the problem (we cannot assume Gaussian distributions etc.)kNN and decision tree (c4.5) are the most popular nonparametric classifiers. We use them as the baselines for comparisonWe expect decision forest to do well since we have a high dimensional problem for which it is known to do well from previous studies
9 Evaluation Measurements Tradeoff between Precision and Recall YES is correctNo is correctAssigned YESabAssigned NOcdMeasurementsPrecision p = a / (a+b)Recall r = a / (a+c)F1 value F1 = 2rp / (r+p)Tradeoff between Precision and RecallkNN tends to have higher precision than recall, especially when k becomes larger.
10 Averaging scores Macro-averaging Micro-averaging Calculate precision/recall for each categoryAverage all the precision/recall valuesAssign equal weight to each categoryMicro-averagingSum up classification decision of each documentCalculate precision/recall from the summationsAssign equal weight to each documentThis was used in experiment because the number of documents in each category varies considerably.
12 Comparison between Classifiers Decision Forest better than C4.5 and kNNIn category dependent case, C4.5 better than kNNIn category independent case, kNN better than C4.5
13 Category Dependent vs. Independent method For Decision Forest and C4.5, category dependent better than independent.But for kNN, category independent better than dependent.No obvious explanation found.
14 Reuters vs. OHSUMED All classifiers degrades from Reuters to OHSUMED kNN degrades faster(26%) than C4.5(12%) and DF(12%)
15 Reuters vs. OHSUMED OHSUMED is a harder problem because: Documents are more evenly distributedThis even distribution confuses kNN recall rate more than others, because there are more confusion classes in the fixed size neighborhood.
16 ConclusionDecision Forest is substantially better than C4.5 and kNN in text categorizationDifficult to make comparison with results of other classifiers outside this experiment, becauseDifferent ways of spliting training/test setDifferent term selection methods