A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Brief introduction on Logistic Regression
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Evaluation of Decision Forests on Text Categorization
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Precision and Recall.
IR Models: Overview, Boolean, and Vector
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Statistical Methods Chichang Jou Tamkang University.
Ensemble Learning: An Introduction
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Experimental Evaluation
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Chapter 5: Information Retrieval and Web Search
Decision Tree Models in Data Mining
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Active Learning for Class Imbalance Problem
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Chapter 6: Information Retrieval and Web Search
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Classification Ensemble Methods 1
NTU & MSRA Ming-Feng Tsai
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Article Filtering for Conflict Forecasting Benedict Lee and Cuong Than Comp 540 4/25/2006.
SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Similarity Measures for Text Document Clustering
Chapter 7. Classification and Prediction
An Empirical Comparison of Supervised Learning Algorithms
Asymmetric Gradient Boosting with Application to Spam Filtering
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Dynamic Category Profiling for Text Filtering and Classification
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004

Synopsis Purpose of this work Purpose of this work Experiment Design Experiment Design Results and Discussions Results and Discussions Conclusions Conclusions

Purpose of this work Text categorization, the task of assigning unlabelled documents into predefined categories Text categorization, the task of assigning unlabelled documents into predefined categories kNN, Decision Tree, Neural Network, Na ï ve Bayes, Linear Regression, SVM, Perceptron, Rocchio and etc. kNN, Decision Tree, Neural Network, Na ï ve Bayes, Linear Regression, SVM, Perceptron, Rocchio and etc. Classifier Committees, Bagging, Boosting Classifier Committees, Bagging, Boosting SVM has been shown rather good performance SVM has been shown rather good performance

Purpose of this work (Cont.) Biblio Term Weighting Kernel of SVM Data Collection Performance Evaluation Dumais, 1998 BinaryLinear Reuters top (microaveraged breakeven point) Joachims, 1998 tf.idfPolynomial&RBF Reuters top 90.86&.864 (microaveraged breakeven point) Dai, 2003 logtf.idfLinear Part Reuters top (F1) ……………

Purpose of this work (Cont.) Does the difference of performance come from different text representations or from different kernel functions of SVM ? Does the difference of performance come from different text representations or from different kernel functions of SVM ? [Leopold, 2002] points out that it is the text representation schemes which dominate the performance of text categorization rather than the kernel functions of SVM in text categorization domain. [Leopold, 2002] points out that it is the text representation schemes which dominate the performance of text categorization rather than the kernel functions of SVM in text categorization domain.

Purpose of this work (Cont.) Therefore, choosing an appropriate term weighting scheme is more important than choosing and tuning kernel functions of SVM for text categorization task. Therefore, choosing an appropriate term weighting scheme is more important than choosing and tuning kernel functions of SVM for text categorization task. However, the previous works are not enough to draw a definite conclusion that which term weighting scheme is better for SVM. However, the previous works are not enough to draw a definite conclusion that which term weighting scheme is better for SVM.

Purpose of this work (Cont.) Different Data Preparation: Stemming, stop- words, feature selection, term weighting schemes Different Data Preparation: Stemming, stop- words, feature selection, term weighting schemes Different Data Collection: Reuters (whole, top 10, top 90, partial top 10) Different Data Collection: Reuters (whole, top 10, top 90, partial top 10) Different Classifiers with various parameters Different Classifiers with various parameters Different performance evaluation Different performance evaluation

Purpose of this work (Cont.) Our study focuses on the various term weighting schemes for SVM. Our study focuses on the various term weighting schemes for SVM. The reason why choose linear kernel function: The reason why choose linear kernel function: It is simple and fast It is simple and fast Based on our preliminary experiments and previous studies, linear is better than non-linear models even handling high dimensional data Based on our preliminary experiments and previous studies, linear is better than non-linear models even handling high dimensional data Comparison of term weighting schemes rather than the choosing and tuning of kernel functions is our current work Comparison of term weighting schemes rather than the choosing and tuning of kernel functions is our current work

Term Weighting Schemes 10 different term weighting schemes selected due to their reported superior classification results or their typical representation when using SVM 10 different term weighting schemes selected due to their reported superior classification results or their typical representation when using SVM They are: binary, tf, logtf, ITF, idf, tf.idf, logtf.idf, tf.idf-prob, tf.chi, tf.rf They are: binary, tf, logtf, ITF, idf, tf.idf, logtf.idf, tf.idf-prob, tf.chi, tf.rf

Term Weighting Schemes The following four are related with term frequency alone: The following four are related with term frequency alone: binary : 1 for term present and 0 for term absent in a vector binary : 1 for term present and 0 for term absent in a vector tf : # of times a term occurs in a document tf : # of times a term occurs in a document logtf : 1 + log(tf), where log is to mend unfavorable linearity logtf : 1 + log(tf), where log is to mend unfavorable linearity ITF : 1-r/(r+tf), usually r=1 (inverse term frequency presented by Leopold) ITF : 1-r/(r+tf), usually r=1 (inverse term frequency presented by Leopold)

Term Weighting Schemes The following four are related with idf factor: The following four are related with idf factor: idf : log(N/ni), where N is the # of docs, ni the # of docs which contain term ti idf : log(N/ni), where N is the # of docs, ni the # of docs which contain term ti tf.idf : the widely-used term representation tf.idf : the widely-used term representation logtf.idf : (1+logtf).idf logtf.idf : (1+logtf).idf tf.idf-prob : idf-prob = log((N-ni)/ni), is an approximate representation of term relevance weight, also called probabilistic idf tf.idf-prob : idf-prob = log((N-ni)/ni), is an approximate representation of term relevance weight, also called probabilistic idf

Term Weighting Schemes tf.chi : as a representative of combining feature selection measures (chi^2, information gain, odds-ratio, gain ratio and etc.) tf.chi : as a representative of combining feature selection measures (chi^2, information gain, odds-ratio, gain ratio and etc.) tf.rf : newly proposed by us; relevant frequency (rf) = log(1+ni/ni_), ni is the # of docs which contain term ti, and ni_ is the # of negative docs which contain term ti tf.rf : newly proposed by us; relevant frequency (rf) = log(1+ni/ni_), ni is the # of docs which contain term ti, and ni_ is the # of negative docs which contain term ti

Analysis of Discriminating Power Different Formula idf = log (N/(a+c)) chi^2= N*((ad-bc)^2) / ((a+c)(b+d)(a+b)(c+d)) idf-prob = log((b+d)/(a+c)) rf = log(2+a/c) To avoid c=0, we set rf = log(2+a/_max(1,c)) N=a+b+c+d, d>>a, b, c

Analysis of Discriminating Power Assume the six terms have the same tf value. The first three terms have the same idf1, and the last three ones have the same idf2. idf = log ( N/(a+c) )  idf1 > idf2 N = a+b+c+d

Analysis of Discriminating Power Given idf1<idf2, the classical tf.idf gives more weight to the first three terms than the last three terms. But t1 has more discriminating power than t2 and t3 in positive category. tf.idf representation may lose its discriminating power. We propose new factor relevance frequency rf = log (1+(a+c)/c).

Benchmark Data Collection 1 Data Collection 1 – Reuters Data Collection 1 – Reuters top 10, 7193 trainings and 2787 tests top 10, 7193 trainings and 2787 tests Remove stop words (292), punctuation and numbers Remove stop words (292), punctuation and numbers Porter stemming performed Porter stemming performed Minimal term length is 4 Minimal term length is 4 Top p features per category selected by using chi- square metric, p = {50, 150, 300, 600, 900, 1200, 1800, 2400, All} Top p features per category selected by using chi- square metric, p = {50, 150, 300, 600, 900, 1200, 1800, 2400, All} Null vectors are removed Null vectors are removed terms terms

Summary of Reuters Data Set p#-Features#-Trains#-Tests All

Benchmark Data Collection 2 Data Collection 2 – 20 Newsgroups Data Collection 2 – 20 Newsgroups 200 trainings and 100 tests per category, 20 categories; 4000 trainings and 2000 tests 200 trainings and 100 tests per category, 20 categories; 4000 trainings and 2000 tests Remove stop words, punctuation and numbers Remove stop words, punctuation and numbers Minimal term length is 4 Minimal term length is 4 Top p features per category selected by using chi- square metric, p = {5, 25, 50, 75, 100, 150, 200, 250, 300, 400, 500} Top p features per category selected by using chi- square metric, p = {5, 25, 50, 75, 100, 150, 200, 250, 300, 400, 500} Null vectors are removed Null vectors are removed terms terms

Summary of 20 Newsgroups Data Set p#-Features#-Trains#-Tests

Two Data Sets Comparison Reuters : Skewed category distribution; Reuters : Skewed category distribution; Among 7193 trainings, the most common category (earn) contains 2877 trainings (40%); while 80% of the categories have less than 7.5% training samples. Among 7193 trainings, the most common category (earn) contains 2877 trainings (40%); while 80% of the categories have less than 7.5% training samples. 20 Newsgroups : uniform distribution; 20 Newsgroups : uniform distribution; We selected the first 200 trainings and the first 100 tests per category based on the partition news-bydate. 200 positive samples and 3800 negative samples per each chosen category,. We selected the first 200 trainings and the first 100 tests per category based on the partition news-bydate. 200 positive samples and 3800 negative samples per each chosen category,.

Performance Measure Precision=true positive/(true positive + false positive) Precision=true positive/(true positive + false positive) Recall = true positive / (true positive + false negative) Recall = true positive / (true positive + false negative) Precision/Recall breakeven point : tune the classifier parameter and yield the hypothetical point at which precision and recall are equal. Precision/Recall breakeven point : tune the classifier parameter and yield the hypothetical point at which precision and recall are equal.

McNemar ’ s significance test Two classifier f1 and f2 are based on two term weighting schemes. Two classifier f1 and f2 are based on two term weighting schemes. Contingency table Contingency table n00 ( # of examples misclassified by both f1 and f2 ) n01 ( # of examples misclassified by f1 but not by f2 ) n10 ( # of examples misclassified by f2 but not by f1 ) n11 ( # of examples correctly classified by both f1 and f2 )

McNemar ’ s Significance Test If the two classifiers have the same error rate, then n10 = n01; If the two classifiers have the same error rate, then n10 = n01; chi = (|n10-n01|-1)^2 / (n01+n10) chi = (|n10-n01|-1)^2 / (n01+n10) is approximately distributed as chi^2 with 1 degree of freedom; is approximately distributed as chi^2 with 1 degree of freedom; If the null hypothesis is correct, then the probability that this quantity is greater than chi^2(1, 0.99) = 6.64 is less than 0.01(significant level \alpha). If the null hypothesis is correct, then the probability that this quantity is greater than chi^2(1, 0.99) = 6.64 is less than 0.01(significant level \alpha).

Results on the Reuters

Observation: The break-even point increases as the #- features grows. All schemes reach a maxi value at the full vocabulary and the best BEP is by tr.rf scheme

Significance Tests Results on Reuters #-features McNemar ’ s Test 200 {tf.chi} << all the others {binary, tf.chi}<<{all the others} 2500 {binary, tf.chi}<<{idf, tf.idf, tf.idf-prob}<{all the others} {binary, idf, tf.chi}<<{tf.idf, logtf.idf, tf.idf- prob}<<{tf, logtf, ITF}<{tf.rf} ‘ < ’ and ‘ << ’ denote better than at significance level 0.01 and respectively; ‘ {} ’ denote no significant difference

Results on the 20 Newsgroups

Observation: The tends are not monotonic increase. All schemes reach a maxi value at a small vocabulary range from 1000 to The best BEP is by tr.rf scheme

Significance Tests on 20 Newsgroups #-features McNemar ’ s Test {tf.chi} << {all the other} 1000 {tf.chi}<<{binary}<<{all the other} 1500 {tf.chi}<<{binary}<{all the other}<{ITF, idf, tf.rf} 2000 {tf.chi,binary}<<{all the other}<{ITF, tf.rf} {binary, tf.chi}<<{all the other}<{tf.rf} {binary}<<{all the other}<<{tf.rf}

Discussion To achieve high break-even point, different number of vocabularies are required for the two data sets. To achieve high break-even point, different number of vocabularies are required for the two data sets. Reuters : diverse subject matters per category with overlapping vocabularies and large vocabularies are required; Reuters : diverse subject matters per category with overlapping vocabularies and large vocabularies are required; 20 Newsgroups : single narrow subject with limited vocabularies and vocabularies per category is sufficient. 20 Newsgroups : single narrow subject with limited vocabularies and vocabularies per category is sufficient.

tf.rf shows significant better performance than other schemes on the two different data sets. tf.rf shows significant better performance than other schemes on the two different data sets. Both of the best break-even points are achieved by using the tf.rf scheme no matter on the skewed or uniform category distribution. Both of the best break-even points are achieved by using the tf.rf scheme no matter on the skewed or uniform category distribution. The significance tests support this observation. The significance tests support this observation. Discussion

There is no observation that idf factor can add the term ’ s discriminating power for text categorization when combined with tf factor. There is no observation that idf factor can add the term ’ s discriminating power for text categorization when combined with tf factor. Reuters : tf, logtf and ITF achieve higher break- even point than schemes combined with idf – tf.idf, logtf.idf and tf.idf-prob. Reuters : tf, logtf and ITF achieve higher break- even point than schemes combined with idf – tf.idf, logtf.idf and tf.idf-prob. 20 Newsgroups : difference between tf alone or idf alone or both are not significant 20 Newsgroups : difference between tf alone or idf alone or both are not significant Hence, idf factor gives no discriminating power or even decrease the term ’ s discriminating power. Hence, idf factor gives no discriminating power or even decrease the term ’ s discriminating power.Discussion

Binary and tf.chi show consistently worse performance than other schemes. Binary and tf.chi show consistently worse performance than other schemes. Binary scheme ignores the frequency information which is crucial to the representation of the content of the document Binary scheme ignores the frequency information which is crucial to the representation of the content of the document Feature selection metrics, chi^2, involve d value where d>>a, b, and c. d value dominates chi^2 value and may not appropriately express the term ’ s discriminating power. Feature selection metrics, chi^2, involve d value where d>>a, b, and c. d value dominates chi^2 value and may not appropriately express the term ’ s discriminating power. Discussion

Specially, ITF scheme has comparable good performance in the two data sets but still worse than tf.rf scheme Specially, ITF scheme has comparable good performance in the two data sets but still worse than tf.rf scheme Discussion

Conclusions Our newly proposed tf.rf shows significant better performance than other schemes based on the two widely-used data sets with different category distributions Our newly proposed tf.rf shows significant better performance than other schemes based on the two widely-used data sets with different category distributions Schemes related with tf alone, tf, logtf, ITF show rather good performance while still worse than the tf.rf scheme Schemes related with tf alone, tf, logtf, ITF show rather good performance while still worse than the tf.rf scheme

Conclusions The idf and chi factor, taking the collection distribution into consideration, have not improve or even decrease the term ’ s discriminating power for categorization. The idf and chi factor, taking the collection distribution into consideration, have not improve or even decrease the term ’ s discriminating power for categorization. Binary and tf.chi significantly underperform the other schemes. Binary and tf.chi significantly underperform the other schemes.

Thanks for your time and participation ! Thanks for your time and participation !