Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Similar presentations


Presentation on theme: "A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004."— Presentation transcript:

1 A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004

2 Synopsis Purpose of this work Purpose of this work Experiment Design Experiment Design Results and Discussions Results and Discussions Conclusions Conclusions

3 Purpose of this work Text categorization, the task of assigning unlabelled documents into predefined categories Text categorization, the task of assigning unlabelled documents into predefined categories kNN, Decision Tree, Neural Network, Na ï ve Bayes, Linear Regression, SVM, Perceptron, Rocchio and etc. kNN, Decision Tree, Neural Network, Na ï ve Bayes, Linear Regression, SVM, Perceptron, Rocchio and etc. Classifier Committees, Bagging, Boosting Classifier Committees, Bagging, Boosting SVM has been shown rather good performance SVM has been shown rather good performance

4 Purpose of this work (Cont.) Biblio Term Weighting Kernel of SVM Data Collection Performance Evaluation Dumais, 1998 BinaryLinear Reuters-21578 top 10.92 (microaveraged breakeven point) Joachims, 1998 tf.idfPolynomial&RBF Reuters-21578 top 90.86&.864 (microaveraged breakeven point) Dai, 2003 logtf.idfLinear Part Reuters- 21578 top 10.9402 (F1) ……………

5 Purpose of this work (Cont.) Does the difference of performance come from different text representations or from different kernel functions of SVM ? Does the difference of performance come from different text representations or from different kernel functions of SVM ? [Leopold, 2002] points out that it is the text representation schemes which dominate the performance of text categorization rather than the kernel functions of SVM in text categorization domain. [Leopold, 2002] points out that it is the text representation schemes which dominate the performance of text categorization rather than the kernel functions of SVM in text categorization domain.

6 Purpose of this work (Cont.) Therefore, choosing an appropriate term weighting scheme is more important than choosing and tuning kernel functions of SVM for text categorization task. Therefore, choosing an appropriate term weighting scheme is more important than choosing and tuning kernel functions of SVM for text categorization task. However, the previous works are not enough to draw a definite conclusion that which term weighting scheme is better for SVM. However, the previous works are not enough to draw a definite conclusion that which term weighting scheme is better for SVM.

7 Purpose of this work (Cont.) Different Data Preparation: Stemming, stop- words, feature selection, term weighting schemes Different Data Preparation: Stemming, stop- words, feature selection, term weighting schemes Different Data Collection: Reuters (whole, top 10, top 90, partial top 10) Different Data Collection: Reuters (whole, top 10, top 90, partial top 10) Different Classifiers with various parameters Different Classifiers with various parameters Different performance evaluation Different performance evaluation

8 Purpose of this work (Cont.) Our study focuses on the various term weighting schemes for SVM. Our study focuses on the various term weighting schemes for SVM. The reason why choose linear kernel function: The reason why choose linear kernel function: It is simple and fast It is simple and fast Based on our preliminary experiments and previous studies, linear is better than non-linear models even handling high dimensional data Based on our preliminary experiments and previous studies, linear is better than non-linear models even handling high dimensional data Comparison of term weighting schemes rather than the choosing and tuning of kernel functions is our current work Comparison of term weighting schemes rather than the choosing and tuning of kernel functions is our current work

9 Term Weighting Schemes 10 different term weighting schemes selected due to their reported superior classification results or their typical representation when using SVM 10 different term weighting schemes selected due to their reported superior classification results or their typical representation when using SVM They are: binary, tf, logtf, ITF, idf, tf.idf, logtf.idf, tf.idf-prob, tf.chi, tf.rf They are: binary, tf, logtf, ITF, idf, tf.idf, logtf.idf, tf.idf-prob, tf.chi, tf.rf

10 Term Weighting Schemes The following four are related with term frequency alone: The following four are related with term frequency alone: binary : 1 for term present and 0 for term absent in a vector binary : 1 for term present and 0 for term absent in a vector tf : # of times a term occurs in a document tf : # of times a term occurs in a document logtf : 1 + log(tf), where log is to mend unfavorable linearity logtf : 1 + log(tf), where log is to mend unfavorable linearity ITF : 1-r/(r+tf), usually r=1 (inverse term frequency presented by Leopold) ITF : 1-r/(r+tf), usually r=1 (inverse term frequency presented by Leopold)

11 Term Weighting Schemes The following four are related with idf factor: The following four are related with idf factor: idf : log(N/ni), where N is the # of docs, ni the # of docs which contain term ti idf : log(N/ni), where N is the # of docs, ni the # of docs which contain term ti tf.idf : the widely-used term representation tf.idf : the widely-used term representation logtf.idf : (1+logtf).idf logtf.idf : (1+logtf).idf tf.idf-prob : idf-prob = log((N-ni)/ni), is an approximate representation of term relevance weight, also called probabilistic idf tf.idf-prob : idf-prob = log((N-ni)/ni), is an approximate representation of term relevance weight, also called probabilistic idf

12 Term Weighting Schemes tf.chi : as a representative of combining feature selection measures (chi^2, information gain, odds-ratio, gain ratio and etc.) tf.chi : as a representative of combining feature selection measures (chi^2, information gain, odds-ratio, gain ratio and etc.) tf.rf : newly proposed by us; relevant frequency (rf) = log(1+ni/ni_), ni is the # of docs which contain term ti, and ni_ is the # of negative docs which contain term ti tf.rf : newly proposed by us; relevant frequency (rf) = log(1+ni/ni_), ni is the # of docs which contain term ti, and ni_ is the # of negative docs which contain term ti

13 Analysis of Discriminating Power Different Formula idf = log (N/(a+c)) chi^2= N*((ad-bc)^2) / ((a+c)(b+d)(a+b)(c+d)) idf-prob = log((b+d)/(a+c)) rf = log(2+a/c) To avoid c=0, we set rf = log(2+a/_max(1,c)) N=a+b+c+d, d>>a, b, c

14 Analysis of Discriminating Power Assume the six terms have the same tf value. The first three terms have the same idf1, and the last three ones have the same idf2. idf = log ( N/(a+c) )  idf1 > idf2 N = a+b+c+d

15 Analysis of Discriminating Power Given idf1<idf2, the classical tf.idf gives more weight to the first three terms than the last three terms. But t1 has more discriminating power than t2 and t3 in positive category. tf.idf representation may lose its discriminating power. We propose new factor relevance frequency rf = log (1+(a+c)/c).

16 Benchmark Data Collection 1 Data Collection 1 – Reuters-21578 Data Collection 1 – Reuters-21578 top 10, 7193 trainings and 2787 tests top 10, 7193 trainings and 2787 tests Remove stop words (292), punctuation and numbers Remove stop words (292), punctuation and numbers Porter stemming performed Porter stemming performed Minimal term length is 4 Minimal term length is 4 Top p features per category selected by using chi- square metric, p = {50, 150, 300, 600, 900, 1200, 1800, 2400, All} Top p features per category selected by using chi- square metric, p = {50, 150, 300, 600, 900, 1200, 1800, 2400, All} Null vectors are removed Null vectors are removed 15959 terms 15959 terms

17 Summary of Reuters-21578 Data Set p#-Features#-Trains#-Tests 5040561232381 150121762402422 300242563182452 600493863642468 900700764102479 1200904564232486 18001114264562510 24001274164692512 All1593764892519

18 Benchmark Data Collection 2 Data Collection 2 – 20 Newsgroups Data Collection 2 – 20 Newsgroups 200 trainings and 100 tests per category, 20 categories; 4000 trainings and 2000 tests 200 trainings and 100 tests per category, 20 categories; 4000 trainings and 2000 tests Remove stop words, punctuation and numbers Remove stop words, punctuation and numbers Minimal term length is 4 Minimal term length is 4 Top p features per category selected by using chi- square metric, p = {5, 25, 50, 75, 100, 150, 200, 250, 300, 400, 500} Top p features per category selected by using chi- square metric, p = {5, 25, 50, 75, 100, 150, 200, 250, 300, 400, 500} Null vectors are removed Null vectors are removed 50088 terms 50088 terms

19 Summary of 20 Newsgroups Data Set p#-Features#-Trains#-Tests 5099138131861 75148338861918 100197339331940 150296639611961 200395539741973 250493839811980 300590139871985 400785639921994 500980339961995

20 Two Data Sets Comparison Reuters : Skewed category distribution; Reuters : Skewed category distribution; Among 7193 trainings, the most common category (earn) contains 2877 trainings (40%); while 80% of the categories have less than 7.5% training samples. Among 7193 trainings, the most common category (earn) contains 2877 trainings (40%); while 80% of the categories have less than 7.5% training samples. 20 Newsgroups : uniform distribution; 20 Newsgroups : uniform distribution; We selected the first 200 trainings and the first 100 tests per category based on the partition -- 20 news-bydate. 200 positive samples and 3800 negative samples per each chosen category,. We selected the first 200 trainings and the first 100 tests per category based on the partition -- 20 news-bydate. 200 positive samples and 3800 negative samples per each chosen category,.

21 Performance Measure Precision=true positive/(true positive + false positive) Precision=true positive/(true positive + false positive) Recall = true positive / (true positive + false negative) Recall = true positive / (true positive + false negative) Precision/Recall breakeven point : tune the classifier parameter and yield the hypothetical point at which precision and recall are equal. Precision/Recall breakeven point : tune the classifier parameter and yield the hypothetical point at which precision and recall are equal.

22 McNemar ’ s significance test Two classifier f1 and f2 are based on two term weighting schemes. Two classifier f1 and f2 are based on two term weighting schemes. Contingency table Contingency table n00 ( # of examples misclassified by both f1 and f2 ) n01 ( # of examples misclassified by f1 but not by f2 ) n10 ( # of examples misclassified by f2 but not by f1 ) n11 ( # of examples correctly classified by both f1 and f2 )

23 McNemar ’ s Significance Test If the two classifiers have the same error rate, then n10 = n01; If the two classifiers have the same error rate, then n10 = n01; chi = (|n10-n01|-1)^2 / (n01+n10) chi = (|n10-n01|-1)^2 / (n01+n10) is approximately distributed as chi^2 with 1 degree of freedom; is approximately distributed as chi^2 with 1 degree of freedom; If the null hypothesis is correct, then the probability that this quantity is greater than chi^2(1, 0.99) = 6.64 is less than 0.01(significant level \alpha). If the null hypothesis is correct, then the probability that this quantity is greater than chi^2(1, 0.99) = 6.64 is less than 0.01(significant level \alpha).

24 Results on the Reuters

25 Observation: The break-even point increases as the #- features grows. All schemes reach a maxi value at the full vocabulary and the best BEP is 0.9272 by tr.rf scheme

26 Significance Tests Results on Reuters #-features McNemar ’ s Test 200 {tf.chi} << all the others 400-1500 {binary, tf.chi}<<{all the others} 2500 {binary, tf.chi}<<{idf, tf.idf, tf.idf-prob}<{all the others} 5000+ {binary, idf, tf.chi}<<{tf.idf, logtf.idf, tf.idf- prob}<<{tf, logtf, ITF}<{tf.rf} ‘ < ’ and ‘ << ’ denote better than at significance level 0.01 and 0.001 respectively; ‘ {} ’ denote no significant difference

27 Results on the 20 Newsgroups

28 Observation: The tends are not monotonic increase. All schemes reach a maxi value at a small vocabulary range from 1000 to 3000. The best BEP is 0.6743 by tr.rf scheme

29 Significance Tests on 20 Newsgroups #-features McNemar ’ s Test 100-500 {tf.chi} << {all the other} 1000 {tf.chi}<<{binary}<<{all the other} 1500 {tf.chi}<<{binary}<{all the other}<{ITF, idf, tf.rf} 2000 {tf.chi,binary}<<{all the other}<{ITF, tf.rf} 3000-5000 {binary, tf.chi}<<{all the other}<{tf.rf} 6000-10000 {binary}<<{all the other}<<{tf.rf}

30 Discussion To achieve high break-even point, different number of vocabularies are required for the two data sets. To achieve high break-even point, different number of vocabularies are required for the two data sets. Reuters : diverse subject matters per category with overlapping vocabularies and large vocabularies are required; Reuters : diverse subject matters per category with overlapping vocabularies and large vocabularies are required; 20 Newsgroups : single narrow subject with limited vocabularies and 50-100 vocabularies per category is sufficient. 20 Newsgroups : single narrow subject with limited vocabularies and 50-100 vocabularies per category is sufficient.

31 tf.rf shows significant better performance than other schemes on the two different data sets. tf.rf shows significant better performance than other schemes on the two different data sets. Both of the best break-even points are achieved by using the tf.rf scheme no matter on the skewed or uniform category distribution. Both of the best break-even points are achieved by using the tf.rf scheme no matter on the skewed or uniform category distribution. The significance tests support this observation. The significance tests support this observation. Discussion

32 There is no observation that idf factor can add the term ’ s discriminating power for text categorization when combined with tf factor. There is no observation that idf factor can add the term ’ s discriminating power for text categorization when combined with tf factor. Reuters : tf, logtf and ITF achieve higher break- even point than schemes combined with idf – tf.idf, logtf.idf and tf.idf-prob. Reuters : tf, logtf and ITF achieve higher break- even point than schemes combined with idf – tf.idf, logtf.idf and tf.idf-prob. 20 Newsgroups : difference between tf alone or idf alone or both are not significant 20 Newsgroups : difference between tf alone or idf alone or both are not significant Hence, idf factor gives no discriminating power or even decrease the term ’ s discriminating power. Hence, idf factor gives no discriminating power or even decrease the term ’ s discriminating power.Discussion

33 Binary and tf.chi show consistently worse performance than other schemes. Binary and tf.chi show consistently worse performance than other schemes. Binary scheme ignores the frequency information which is crucial to the representation of the content of the document Binary scheme ignores the frequency information which is crucial to the representation of the content of the document Feature selection metrics, chi^2, involve d value where d>>a, b, and c. d value dominates chi^2 value and may not appropriately express the term ’ s discriminating power. Feature selection metrics, chi^2, involve d value where d>>a, b, and c. d value dominates chi^2 value and may not appropriately express the term ’ s discriminating power. Discussion

34 Specially, ITF scheme has comparable good performance in the two data sets but still worse than tf.rf scheme Specially, ITF scheme has comparable good performance in the two data sets but still worse than tf.rf scheme Discussion

35 Conclusions Our newly proposed tf.rf shows significant better performance than other schemes based on the two widely-used data sets with different category distributions Our newly proposed tf.rf shows significant better performance than other schemes based on the two widely-used data sets with different category distributions Schemes related with tf alone, tf, logtf, ITF show rather good performance while still worse than the tf.rf scheme Schemes related with tf alone, tf, logtf, ITF show rather good performance while still worse than the tf.rf scheme

36 Conclusions The idf and chi factor, taking the collection distribution into consideration, have not improve or even decrease the term ’ s discriminating power for categorization. The idf and chi factor, taking the collection distribution into consideration, have not improve or even decrease the term ’ s discriminating power for categorization. Binary and tf.chi significantly underperform the other schemes. Binary and tf.chi significantly underperform the other schemes.

37 Thanks for your time and participation ! Thanks for your time and participation !


Download ppt "A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004."

Similar presentations


Ads by Google