On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.

Slides:

Advertisements

Similar presentations

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Albert Gatt Corpora and Statistical Methods Lecture 13.

PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Classification / Regression Support Vector Machines

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Text Classification With Support Vector Machines

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang.

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti.

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.

Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Universit at Dortmund, LS VIII

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.

A Language Independent Method for Question Classification COLING 2004.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

An Introduction to Support Vector Machine (SVM)

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.

Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Recent Trends in Text Mining

Semi-Supervised Clustering

CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER

Support Vector Machines

Machine Learning Week 1.

Statistical NLP: Lecture 9

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Information Retrieval

Machine Learning Support Vector Machine Supervised Learning

Presentation transcript:

On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001

Plan of talk A representation of a new text categorization technique based on: Distributional Clustering Support Vector Machine (SVM) Comparative evaluation of the new technique wrt previous work (Dumais et. al.) that used Mutial Information (MI) feature selection

Main results The evaluation is performed on two benchmark corpora: Reuters 20 Newsgroups (20NG) The result is that the new technique works better than the known one on 20NG. But it isn ’ t better on Reuters. Possible reasons for such a behavior will be discussed.

Text categorization A fundamental problem of splitting a large text corpus into a number of semantic categories (predefined). We are dealing with its supervised version. The problem has many real-world applications.  Search engines.  Helpdesks.

Text representation A standard approach: Bag-Of-Words. A document as a list of words it contains. Much more sophisticated method: distributional clusters. A word is represented as a distribution over the categories. The words are then clustered to k clusters. Details will go later on.

Support Vector Machines A modern inductive classification method. Proposed by Vapnik. Usually shows its advantage over other learning schemes such as K Nearest Neighbors Na ï ve Bayes

Corpora A corpus is a large collection of documents. We ’ ve checked our algorithms on two well-known corpora: Reuters (ModApte split): 7063 articles in the training set, 2742 articles in the test set. 118 categories. 20 Newsgroups: articles. 20 categories.

Multi-labeling vs. uni-labeling Multi-labeled corpus: many articles belong to a number of categories. Example: Reuters (15.5% are multi- labeled documents) Uni-labeled corpus: each article belongs to only one category. It has been thought so about 20 newsgroups. But in fact it contains 4.5% multi-labeled documents.

Related results Dumais et al. (1998): SVM with simple feature selection on Reuters. Best known result: 92.0% of breakeven over 10 largest categories. Baker and McCallum (1998): Distributional clustering + Na ï ve Bayes on 20NG. 85.7% of accuracy (uni-labeled scheme).

Related results (contd.) Joachims (1996): Rocchio algorithm on Na ï ve Bayes. Best known result on 20NG (uni-labeled approach): 90.3% of accuracy. Slonim and Tishby (2000): Information Bottleneck method. Used in our work.

Related results (contd.) Zhang and Oles (2001): comparative study of linear classification techniques wrt. text categorization over different corpora. SVM is always better.

The case of our study corpus MI feature selection Distributional Clustering Support Vector Machine result <>

Feature selection via Mutual Information On training set, choose N words which contribute maximum for separating the categories. The contribution is in terms of Mutual Information: For each word w and each category c.

Feature selection via MI (contd.) For each category we build a list of N most contributing words. For example (on 20 Newsgroups): sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, circuits, … rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

Distributional Clustering Was proposed by Pereira, Tishby and Lee (1993). Its generalization is called Information Bottleneck (Tishby, Pereira, Bialek 1999). In our case, each word (in the training set) is represented as a distribution over categories it appears in. Each word w is then clustered into a pseudo-word.

Distributional Clustering (contd.) The idea is to maximize the Mutual Information wrt. the partition under a constraint on. The solution is in the following equation: Where Z is the normalization factor, β is an annealing parameter.

Deterministic Annealing A powerful clustering method, proposed by Rose et. al. (1998). The approach is “ top-down ” : Start with one cluster with low β ( “ high temperature ” ). Split it while lowering the “ temperature ” until reaching a stable stage.

Deterministic Annealing (contd.)

Vector space in our experiment In MI feature selection technique: documents are projected onto N most contributing words. In Information Bottleneck technique: Firstly words are grouped into clusters, And then documents are projected onto the pseudo-words. So, documents are vectors whose elements are numbers of occurrences of “ best ” words (1) or pseudo-words (2).

Support Vector Machines A modern classification technique. The classification is based on the border examples only: We used linear SVM (the SVMlight packet by Joachims). Support Vectors

Multi-labeled setting 1. MI feature selection (or distributional clustering) on the training and test sets. 2. For each category we train a binary classifier on the training set. 3. On each document in the test set we run all the classifiers. 4. The document is related to all the categories whose classifiers accepted it.

Uni-labeled setting 1. The same as in multi-labeled one. 2. `` `` `` 3. `` `` `` 4. The document is related to the (one) category whose classifier accepted it with maximal score.

Evaluating the results Multi-labeled: each document ’ s labels should be identical to the classification results. Precision/Recall scheme. Uni-labeled: the classification result should be included in the set of document ’ s labels. Accuracy measure (number of hits).

The setup of our experiment To reproduce the results achieved by Dumais et. al., we choose k = 300 (number of “ best ” words and number of clusters). Since we wanted to compare 20NG and Reuters (ModApte split: ¾ is training set and ¼ is test set) we used 4-fold cross- validation on 20NG.

Parameter tuning We have 2 major parameters: Number of clusters or “ best ” words (k). SVM parameters (C and J in SVMlight). For each experiment, k is fixed. To perform a “ fair ” experiment, we tune C and J on the training set, splitting it to train-train and train-validation sets. Then we run the experiment with the best parameters fixed at the previous stage.

Unfair parameter tuning Suppose we want to compare results of two experiments A and B. And we see that the result of A is better than the one of B. So, we run B with unfair parameter tuning Parameters are tuned right on the test set. This will assure us that it ’ s impossible to achieve the result of A with the setting of B.

Our result on 20 Newsgroups Multi-labeled setting (break-even point): Clustering: 88.6±0.3% (k = 300) MI feature selection: 77.7±0.5% (k = 300) `` `` : 86.3±0.4% (k = 15000) Uni-labeled setting (accuracy measure): Clustering: 91.0±0.3% (k=300) MI feature selection: 85.1±0.5% (k = 300) `` `` : 91.2±0.4% (k = 15000) Parameter tuning of the MI-based experiments is unfair.

Our result on Reuters It makes no sense to speak about uni- labeled setting on Reuters. Because it ’ s a multi-labeled corpus. Multi-labeled setting (break-even point): Clustering: 91.6% (k = 300) – unfair MI feature selection: 92.0% (k = 300) The results are achieved on 10 largest categories of Reuters.

Discussion of the results So, we see that our technique (clustering) works better than MI on 20NG and almost the same (a little worse) on Reuters. What can be the explanation? Reuters is manually labeled while 20NG is “ naturally ” labeled. Hypothesis: Reuters was labeled only according to a few keywords that appeared in the documents.

Confirmation of our suggestion We tried to decrease the number of features selected by MI technique, on both Reuters and 20NG. We saw that On 20NG the results decreased sharply, On Reuters the results remained the same. So, just a few words are enough to categorize documents of Reuters, while in 20NG we need much more words.

Dependence of break-even on number of features

Conclusion There ’ re corpora for which simple methods work well. Such as Reuters: selection of just a few features solves the problem of text categorization. For other corpora (such as 20NG) a sophisticated method of distributional clustering helps a lot. Future work: to evaluate our technique on other corpora.