Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

text categorization Updated 11/1/2006

Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r = a/(a+c) FF  = (  2 +1) pr/(  2 p +r) Ususally one uses F 1 = 2pr/(p +r) Break-even point Ground truth TrueFalse Trueab Falsecd Classifier assigned Contigency table

Performance measures – multiple categories Micro averaging Macro averaging

Reuters 21578 Reuters collection contains 9603 training articles and 3299 test articles. Were sent over the Reuters newswire in 1987. Contains about 100 categories such as ‘mergers and acquisitions’, ‘interset rates’, ‘wheat’, ‘silver’ etc. Distribution of articles among categories is highly non-uniform. ‘earning’ contains 2709 docs 75 categories contain less than 10 docs each.

Example of a Reuters news story from category ‘earning’ 26-FEB-1987 15:18:59.34 earn COBANCO INC <CBCO> YEAR NET SANTA CRUZ, Calif., Feb 26 - Shr 34 cts vs 1.19 dlrs Net 807,000 vs 2,858,000 Assets 510.2 mln vs 479.7 mln Deposits 472.3 mln vs 440.3 mln Loans 299.2 mln vs 327.2 mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr. Reuter

Categorization methods Decision trees Naïve bayes K-nearest neighbors (KNN) Neural networks Support Vector Machines (SVM)

Representation of documents The most popular representation is ‘Bag of Words’, which ignores all structure of documents. Document I will be represented by a vector X i  R n (n is the number of word types), where the j’th coordinate is just the number of times word w j appears in the document. (so called ‘term frequency – tf j ).

Decision trees 1607/1704 = 0.943 694/5977 = 0.116 Earnings? 2301/7681 = 0.3 of all docs contains “cents” < 2 times contains “cents”  2 times contains “versus” < 2 times contains “versus”  2 times contains “net” < 1 time contains “net”  1 time 1398/1403 = 0.996 209/301 = 0.694 “yes” 422/541 = 0.780 272/5436 = 0.050 “no”

Building decision trees Information gain

Decision Tree Pruning

Naïve bayes Multivariate Bernoulli model Multinomial model

Precision recall curve

K-nearest neighbor

Neural network Perceptrons Multi-layer perceptrons

reuters 21578 – comparison* *Yiming-Yang & Xin Liu, A re-examination of text categorization methods, SIGIR99)

Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Similar presentations

Presentation on theme: "Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r ="— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =

Similar presentations

Presentation on theme: "Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r ="— Presentation transcript:

Similar presentations

About project

Feedback