Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Similar presentations


Presentation on theme: "Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A"— Presentation transcript:

1 Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
NYU Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman

2 Bag of Words Models do we really need elaborate linguistic analysis?
look at text mining applications document retrieval opinion mining association mining see how far we can get with document-level bag-of-words models and introduce some of our mathematical approaches 1/16/14 NYU

3 document retrieval opinion mining association mining 1/16/14 NYU

4 Information Retrieval
Task: given query = list of keywords, identify and rank relevant documents from collection Basic idea: find documents whose set of words most closely matches words in query 1/16/14 NYU

5 Topic Vector Suppose the document collection has n distinct words, w1, …, wn Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document Example D1 = [The cat chased the mouse.] D2 = [The dog chased the cat.] W = [The, chased, dog, cat, mouse] (n = 5) V1 = [ 2 , , 0 , 1 , ] V2 = [ 2 , , 1 , 1 , ] 1/16/14 NYU

6 Weighting the components
Unusual words like elephant determine the topic much more than common words such as “the” or “have” can ignore words on a stop list or weight each term frequency tfi by its inverse document frequency idfi where N = size of collection and ni = number of documents containing term i 1/16/14 NYU

7 Cosine similarity metric
Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14 NYU

8 Cosine similarity metric
the cosine similarity metric is the cosine of the angle between the term vectors: w2 w1 1/16/14 NYU

9 Verdict: a success For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years

10 document retrieval opinion mining association mining 1/16/14 NYU

11 Opinion Mining Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic classification task valuable for producers/marketers of all sorts of products Simple strategy: bag-of-words make lists of positive and negative words see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type problem: hard to make such lists lists will differ for different topics 1/16/14 NYU

12 Training a Model Instead, label the documents in a corpus and then train a classifier from the corpus labeling is easier than thinking up words in effect, learn positive and negative words from the corpus probabilistic classifier identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/14 NYU

13 Using Bayes’ Rule The last step is based on the naïve assumption of independence of the word probabilities 1/16/14 NYU

14 Training We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators P ( t ) = count ( docs labeled t) / N P ( wi | t ) = count ( docs labeled t containing wi ) count ( docs labeled t) 1/16/14 NYU

15 A Problem Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews P ( positive | GR ) = ? 1/16/14 NYU

16 P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14 NYU

17 Laplace Smoothing A simple remedy is to add 1 to each count
to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14 NYU

18 An NLTK Demo Sentiment Analysis with Python NLTK Text Classification
NLTK Code (simplified classifier) 1/16/14 NYU

19 Is “low” a positive or a negative term?
1/16/14 NYU

20 Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14 NYU

21 Verdict: mixed A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails for ambiguous terms for comparative reviews to reveal aspects of an opinion the car looked great and handled well, but the wheels kept falling off 1/16/14 NYU

22 document retrieval opinion mining association mining 1/16/14 NYU

23 Association Mining Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B e.g., “people who bought A also bought B” For text: documents with term A also have term B widely used in scientific and medical literature 1/16/14 NYU

24 Bag-of-words Simplest approach doesn’t work well
look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B doesn’t work well want to find names (of companies, products, genes), not individual words interested in specific types of terms want to learn from a few examples need contexts to avoid noise 1/16/14 NYU

25 Needed Ingredients Effective text association mining needs
name recognition term classification preferably: ability to learn patterns preferably: a good GUI demo: 1/16/14 NYU

26 Conclusion Some tasks can be handled effectively (and very simply) by bag-of-words models, but most benefit from an analysis of language structure 1/16/14 NYU


Download ppt "Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A"

Similar presentations


Ads by Google