Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification
Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2 1 Microsoft Research, Cambridge, UK 2 Jožef Stefan Institute, Ljubljana, Slovenia

Introduction Feature selection in the context of text categorization
Comparing different feature ranking schemes Characterizing feature rankings based on their sparsity behavior Sparsity defined as the average number of different words in a document (after feature selection removed some words)

Feature Weighting Schemes
Odds ratio OR(t) = log[odds(t|c) / odds(t|c)] Information gain IG(t; c) = entropy(c) – entropy(c|t) 2-statistic 2(t) = N (NtcNtc – NtcNct)2 / [Nc Nc Nt Nt] N = number of all documents; Ntc = number of documents from class c containing term t, etc. Numerator equals 0 if t and c are independent. Robertson – Sparck-Jones weighting RSJ(t) = log[(Ntc+0.5) (Ntc+0.5) / (Nct+0.5)(Ntc+0.5)] (very similar to odds ratio)

Weights based on word frequency DF = document frequency (no. of documents containing the word; this ranking suggests to use the most common words) IDF = inverse document frequency (use the least common words)

Weights based on a linear classifier (w, b) prediction(d) = sgn[b + Sterm ti wi TF(ti, d)] If a weight wi is close to 0, the term ti has little influence on the predictions. If it is not important for predictions, it is probably not important for learning either. Thus, use |wi| as the score of a the term ti. We use linear models trained using SVM and perceptron. It might be practical to train the model on a subset of the full training set only (e.g. ½ or ¼ of the full training set, etc.).

Characterization of Feature Rankings in terms of Sparsity
We have a reatively good understanding of feature rankings based on odds ratio, information gain, etc., because they are based on explicit formulas for feature scores How to better understand the rankings based on linear classifiers? Let “sparsity” be the average number of different words per document, after some feature selection has been applied. Equivalently: the avg. number of nonzero components per vector representing the document. This has direct ties to memory consuption, as well as to CPU time consumption for computing norms, dot products, etc. We can plot the “sparsity curve” showing how sparsity grows as we add more and more features from a given ranking.

Sparsity Curves

Sparsity as the independent variable
When discussing and comparing feature rankings, we often use the number of features as the independent variable. “What is the performance when using the first 100 features?” etc. Somewhat unfair towards rankings that prefer (at least initially) less frequent features, such as odds ratio Sparsity is much more directly connected to memory and CPU time requirements Thus, we propose the use of sparsity as the independent variable when comparing feature rankings.

Performance as a function of the number of features (Naïve Bayes, 16 categories of RCV2)

Performance as a function of sparsity

Sparsity as a cutoff criterion
Each category is treated as a binary classification problem (does the document belong to category c or not?) Thus, a feature ranking method produces one ranking per category We must choose how many of the top ranked features to use for learning and classification Alternatively, we can define the cutoff in terms of sparsity. The best number of features can vary greatly from one category to another Does the best sparsity vary less between categories? Suppose we want a constant number of features for each category. Is it better to use a constant sparsity for each category?

Results

Conclusions Sparsity is an interesting and useful concept
As a cutoff criterion, it is not any worse, and is often a little better, than the number of features It offers more direct control over memory and CPU time consumption When comparing feature selection methods, it is not biased in favour of methods which prefer more common features

Future work Characterize feature ranking schemes in terms of other characteristics besides sparsity curves E.g. cumulative information gain: how the sum of IG(t; c) over the first k terms t of the feature ranking grows with k. The goal: define a set of characteristic curves that would explain why some feature rankings (e.g. SVM-based) are better than others. If we know the characteristic curves of a good feature ranking, we can synthesize new rankings with approximately the same characteristic curves Would they also perform comparatively well? With a good set of feature characteristics, we might be able to take the approximate characteristics of a good feature ranking and then synthesize comparably good rankings on other classes or datasets. (Otherwise it can be expensive to get a really good feature ranking, such as the SVM-based one.)

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.

Similar presentations

Presentation on theme: "Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.

Similar presentations

Presentation on theme: "Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2."— Presentation transcript:

Similar presentations

About project

Feedback