1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.

Slides:

Advertisements

Similar presentations

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Evaluation of Decision Forests on Text Categorization

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

Hypothesis Testing IV Chi Square.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Analysis of frequency counts with Chi square

Learning for Text Categorization

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,

Chapter 5: Information Retrieval and Web Search

Statistical hypothesis testing – Inferential statistics II. Testing for associations.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Chapter 13: Inference in Regression

Research Methods for Counselors COUN 597 University of Saint Joseph Class # 9 Copyright © 2014 by R. Halstead. All rights reserved.

Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Chapter 6: Information Retrieval and Web Search

Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.

SINGULAR VALUE DECOMPOSITION (SVD)

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.

Chapter 13 - ANOVA. ANOVA Be able to explain in general terms and using an example what a one-way ANOVA is (370). Know the purpose of the one-way ANOVA.

Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.

Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.

SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,

Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.

Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Chi-Square Analyses.

Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.

SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.

Natural Language Processing Topics in Information Retrieval August, 2002.

Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.

The p-value approach to Hypothesis Testing

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.

Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Plan for Today’s Lecture(s)

Chapter 11 Chi-Square Tests.

Chapter 25 Comparing Counts.

The Chi-Square Distribution and Test for Independence

Chapter 11: Inference for Distributions of Categorical Data

Text Categorization Assigning documents to a fixed set of categories

Chapter 11 Chi-Square Tests.

Chapter 26 Comparing Counts.

Feature Selection Methods

Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.

Chapter 26 Comparing Counts.

Chapter 11 Chi-Square Tests.

Presentation transcript:

1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004

2 Today Feature selection TF.IDF Term Weighting Term Normalization

3 Features for Text Categorization Linguistic features Words – lowercase? (should we convert to?) – normalized? (e.g. “texts”  “text”) Phrases Word-level n-grams Character-level n-grams Punctuation Part of Speech Non-linguistic features document formatting informative character sequences (e.g. &lt)

4 If the algorithm cannot handle all possible features –e.g. language identification for 100 languages using all words –text classification using n-grams Good features can result in higher accuracy But! Why feature selection? What if we just keep all features? –Even the unreliable features can be helpful. –But we need to weight them:  In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection! When Do We Need Feature Selection?

5 Why Feature Selection? Not all features are equally good! Bad features: best to remove –Infrequent  unlikely to be be met again  co-occurrence with a class can be due to chance –Too frequent  mostly function words –Uniform across all categories Good features: should be kept –Co-occur with a particular category –Do not co-occur with other categories The rest: good to keep

6 Types Of Feature Selection?  Feature selection reduces the number of features  Usually:  Eliminating features  Weighting features  Normalizing features  Sometimes by transforming parameters  e.g. Latent Semantic Indexing using Singular Value Decomposition  Method may depend on problem type  For classification and filtering, may use information from example documents to guide selection

7 Feature Selection Task independent methods Document Frequency (DF) Term Strength (TS) Task-dependent methods Information Gain (IG) Mutual Information (MI)  2 statistic (CHI) Empirically compared by Yang & Pedersen (1997)

8 Pedersen & Yang Experiments Compared feature selection methods for text categorization 5 feature selection methods: –DF, MI, CHI, (IG, TS) –Features were just words 2 classifiers: –kNN: k-Nearest Neighbor (to be covered next week) –LLSF: Linear Least Squares Fit 2 data collections: –Reuters –OHSUMED: subset of MEDLINE (1990&1991 used)

9 DF: number of documents a term appears in Based on Zipf’s Law Remove the rare terms: (met 1-2 times) Non-informative Unreliable – can be just noise Not influential in the final decision Unlikely to appear in new documents Plus Easy to compute Task independent: do not need to know the classes Minus Ad hoc criterion Rare terms can be good discriminators (e.g., in IR) Document Frequency (DF) What about the frequent terms? What is a “rare” term?

10 Examples of Frequent Words: Most Frequent Words in Brown Corpus

11 Common words from a predefined list Mostly from closed-class categories: –unlikely to have a new word added –include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles But also some open-class words like numerals Bad discriminators uniformly spread across all classes can be safely removed from the vocabulary –Is this always a good idea? (e.g. author identification) Stop Word Removal

12  2 statistic (pronounced “kai square”) The most commonly used method of comparing proportions. Checks whether there is a relationship between being in one of two groups and a characteristic under study. Example: Let us measure the dependency between a term t and a category c. –the groups would be:  1) the documents from a category c i  2) all other documents –the characteristic would be:  “document contains term t”  2 statistic (CHI)

13 Is “jaguar” a good predictor for the “auto” class? We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2500 Class  auto 39500

14 Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? We would have: P r (j,a) = P r (j)  P r (a) So, there would be: N  P r (j,a), i.e. N  P r (j)  P r (a) P r (j) = (2+3)/N; P r (a) = (2+500)/N; N= Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2500 Class  auto 39500

15 Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? We would have: P r (j,a) = P r (j)  P r (a) So, there would be: N  P r (j,a), i.e. N  P r (j)  P r (a) P r (j) = (2+3)/N; P r (a) = (2+500)/N; N= Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500 Class  auto expected: f e observed: f o

16 Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? We would have: P r (j,a) = P r (j)  P r (a) So, there would be: N  P r (j,a), i.e. N  P r (j)  P r (a) P r (j) = (2+3)/N; P r (a) = (2+500)/N; N= Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500(502) Class  auto 3(4.75)9500(9498) expected: f e observed: f o

17  2 is interested in (f o – f e ) 2 /f e summed over all table entries: The null hypothesis is rejected with confidence.999, since 12.9 > (the value for.999 confidence).  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500(502) Class  auto 3(4.75)9500(9498) expected: f e observed: f o

18 There is a simpler formula for  2 :  2 statistic (CHI) N = A + B + C + D A = #(t,c)C = #(¬t,c) B = #(t,¬c)D = #(¬t, ¬c)

19 How to use  2 for multiple categories? Compute  2 for each category and then combine: we can require to discriminate well across all categories, then we need to take the expected value of  2 : or to discriminate well for a single category, then we take the maximum:  2 statistic (CHI)

20 Plus normalized and thus comparable across terms  2 (t,c) is 0, when t and c are independent can be compared to  2 distribution, 1 degree of freedom Minus unreliable for low frequency terms computationally expensive  2 statistic (CHI)

21 Information Gain A measure of importance of the feature for predicting the presence of the class. Defined as: The number of “bits of information” gained by knowing the term is present or absent Based on Information Theory –We won’t go into this in detail here. Plus: sound information theory justification Minus: computationally expensive

22 Information Gain (IG) IG: number of bits of information gained by knowing the term is present or absent t is the term being scored, c i is a class variable entropy: H(c) specific conditional entropy H(c|t) specific conditional entropy H(c|¬t)

23 The probability of seeing x followed by y vs. the probably of seeing x anywhere times the probability of seeing y anywhere. log ( P(x,y) / P(x)P(y) ) Mutual Information (MI)

24 Approximation: Mutual Information (MI) A = #(t,c)C = #(¬t,c) B = #(t,¬c)D = #(¬t, ¬c) rare terms scored higher does not use term absence

25 Compute MI for each category and then combine If we want to discriminate well across all categories, then we need to take the expected value of MI: To discriminate well for a single category, then we take the maximum: Using Mutual Information

26 Mutual Information Plus I(t,c) is 0, when t and c are independent Sound information-theoretic interpretation Minus Small numbers produce unreliable results Computationally expensive Does not use term absence

27 Mutual information Term strength

28 DF, IG and CHI are good and strongly correlated thus using DF is good, cheap and task independent can be used when IG and CHI are too expensive MI is bad favors rare terms (which are typically bad) MI vs. IG Comparison: DF,TS,IG,CHI,MI mutual information gain

29 Term Weighting In the study just shown, terms were (mainly) treated as binary features If a term occurred in a document, it was assigned 1 Else 0 Often it us useful to weight the selected features Standard technique: tf.idf

30 TF: term frequency definition: TF = t ij –frequency of term i in document j purpose: makes the frequent words for the document more important IDF: inverted document frequency definition: IDF = log(N/n i ) –n i : number of documents containing term i –N : total number of documents purpose: makes rare words across documents more important TF.IDF definition: t ij  log(N/n i ) TF.IDF Term Weighting

31 Term Normalization Combine different words into a single representation Stemming/morphological analysis –bought, buy, buys -> buy General word categories –$23.45, 5.30 Yen -> MONEY –1984, 10,000 -> DATE, NUM –PERSON –ORGANIZATION  (Covered in Information Extraction segment) Generalize with lexical hierarchies –WordNet, MeSH  (Covered later in the semester)

32 Purpose: conflate morphological variants of a word to a single index term Stemming: normalize to a pseudoword –e.g. “more” and “morals” become “mor” (Porter stemmer) Lemmatization: convert to the root form –e.g. “more” and “morals” become “more” and “moral” Plus: vocabulary size reduction data sparseness reduction Minus: loses important features (even to_lowercase() can be bad!) questionable utility (maybe just “-s”, “-ing” and “-ed”?) Stemming & Lemmatization

33 1.Feature selection infrequent term removal infrequent across the whole collection (i.e. DF) met in a single document most frequent term removal (i.e. stop words) 2.Normalization: 1.Stemming. (often) 2.Word classes (sometimes) 3.Feature weighting: TF.IDF or IDF 4.Dimensionality reduction. (occasionally) What Do People Do In Practice?

34 Summary Feature selection Task independent methods: DF, TS Task dependent: IG, MI,  2 statistic Term weighting IDF TF.IDF Term normalization

35 Feature Selection Yang Y., J. Pedersen. A comparative study on feature selection in text categorization. In J. D. H. Fisher, editor, The Fourteenth International Conference on Machine Learning (ICML'97), pages Morgan Kaufmann, Term Weighting Salton G., C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management: an International Journal, v.24 n.5, p , Salton, G Automatic text processing. Chapter 9. References