Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.

Slides:

Advertisements

Similar presentations

Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.

Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Text Categorization.

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.

Classification Classification Examples

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Evaluation of Decision Forests on Text Categorization

Logistic Regression.

Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.

N.D.GagunashviliUniversity of Akureyri, Iceland Pearson´s χ 2 Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted.

Information Retrieval Models: Probabilistic Models

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Modeling Modern Information Retrieval

A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

Advanced Multimedia Text Classification Tamara Berg.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.

Bayesian Networks. Male brain wiring Female brain wiring.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Text Classification, Active/Interactive learning.

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Usman Roshan Machine Learning, CS 698

1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

Prediction of Influencers from Word Use Chan Shing Hei.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.

Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

A Simple Approach for Author Profiling in MapReduce

Queensland University of Technology

BINARY LOGISTIC REGRESSION

Deep Feedforward Networks

Artificial Neural Networks

System for Semi-automatic ontology construction

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Usman Roshan Machine Learning

Perceptrons Lirong Xia.

CS 4/527: Artificial Intelligence

Information Retrieval Models: Probabilistic Models

Feature selection Usman Roshan.

Text Categorization Assigning documents to a fixed set of categories

Usman Roshan Machine Learning

Feature Selection for Ranking

Parametric Methods Berlin Chen, 2005 References:

CS 430: Information Discovery

Semi-Automatic Data-Driven Ontology Construction System

Logistic Regression.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Perceptrons Lirong Xia.

Presentation transcript:

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2 1 Microsoft Research, Cambridge, UK 2 Jožef Stefan Institute, Ljubljana, Slovenia

Introduction Feature selection in the context of text categorization Comparing different feature ranking schemes Characterizing feature rankings based on their sparsity behavior Sparsity defined as the average number of different words in a document (after feature selection removed some words)

Feature Weighting Schemes Odds ratio OR(t) = log[odds(t|c) / odds(t|c)] Information gain IG(t; c) = entropy(c) – entropy(c|t) 2-statistic 2(t) = N (NtcNtc – NtcNct)2 / [Nc Nc Nt Nt] N = number of all documents; Ntc = number of documents from class c containing term t, etc. Numerator equals 0 if t and c are independent. Robertson – Sparck-Jones weighting RSJ(t) = log[(Ntc+0.5) (Ntc+0.5) / (Nct+0.5)(Ntc+0.5)] (very similar to odds ratio)

Feature Weighting Schemes Weights based on word frequency DF = document frequency (no. of documents containing the word; this ranking suggests to use the most common words) IDF = inverse document frequency (use the least common words)

Feature Weighting Schemes Weights based on a linear classifier (w, b) prediction(d) = sgn[b + Sterm ti wi TF(ti, d)] If a weight wi is close to 0, the term ti has little influence on the predictions. If it is not important for predictions, it is probably not important for learning either. Thus, use |wi| as the score of a the term ti. We use linear models trained using SVM and perceptron. It might be practical to train the model on a subset of the full training set only (e.g. ½ or ¼ of the full training set, etc.).

Characterization of Feature Rankings in terms of Sparsity We have a reatively good understanding of feature rankings based on odds ratio, information gain, etc., because they are based on explicit formulas for feature scores How to better understand the rankings based on linear classifiers? Let “sparsity” be the average number of different words per document, after some feature selection has been applied. Equivalently: the avg. number of nonzero components per vector representing the document. This has direct ties to memory consuption, as well as to CPU time consumption for computing norms, dot products, etc. We can plot the “sparsity curve” showing how sparsity grows as we add more and more features from a given ranking.

Sparsity Curves

Sparsity as the independent variable When discussing and comparing feature rankings, we often use the number of features as the independent variable. “What is the performance when using the first 100 features?” etc. Somewhat unfair towards rankings that prefer (at least initially) less frequent features, such as odds ratio Sparsity is much more directly connected to memory and CPU time requirements Thus, we propose the use of sparsity as the independent variable when comparing feature rankings.

Performance as a function of the number of features (Naïve Bayes, 16 categories of RCV2)

Performance as a function of sparsity

Sparsity as a cutoff criterion Each category is treated as a binary classification problem (does the document belong to category c or not?) Thus, a feature ranking method produces one ranking per category We must choose how many of the top ranked features to use for learning and classification Alternatively, we can define the cutoff in terms of sparsity. The best number of features can vary greatly from one category to another Does the best sparsity vary less between categories? Suppose we want a constant number of features for each category. Is it better to use a constant sparsity for each category?

Results

Conclusions Sparsity is an interesting and useful concept As a cutoff criterion, it is not any worse, and is often a little better, than the number of features It offers more direct control over memory and CPU time consumption When comparing feature selection methods, it is not biased in favour of methods which prefer more common features

Future work Characterize feature ranking schemes in terms of other characteristics besides sparsity curves E.g. cumulative information gain: how the sum of IG(t; c) over the first k terms t of the feature ranking grows with k. The goal: define a set of characteristic curves that would explain why some feature rankings (e.g. SVM-based) are better than others. If we know the characteristic curves of a good feature ranking, we can synthesize new rankings with approximately the same characteristic curves Would they also perform comparatively well? With a good set of feature characteristics, we might be able to take the approximate characteristics of a good feature ranking and then synthesize comparably good rankings on other classes or datasets. (Otherwise it can be expensive to get a really good feature ranking, such as the SVM-based one.)