Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.

Similar presentations


Presentation on theme: "Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU."— Presentation transcript:

1 Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU

2 Bag of Words Models do we really need elaborate linguistic analysis? look at text mining applications – document retrieval – opinion mining – association mining see how far we can get with document-level bag- of-words models – and introduce some of our mathematical approaches 1/16/14NYU2

3 document retrieval opinion mining association mining 1/16/14NYU3

4 Information Retrieval Task: given query = list of keywords, identify and rank relevant documents from collection Basic idea: find documents whose set of words most closely matches words in query 1/16/14NYU4

5 Topic Vector Suppose the document collection has n distinct words, w 1, …, w n Each document is characterized by an n-dimensional vector whose i th component is the frequency of word w i in the document Example D1 = [The cat chased the mouse.] D2 = [The dog chased the cat.] W = [The, chased, dog, cat, mouse] (n = 5) V1 = [ 2, 1, 0, 1, 1 ] V2 = [ 2, 1, 1, 1, 0 ] 1/16/14NYU5

6 Weighting the components Unusual words like elephant determine the topic much more than common words such as “the” or “have” can ignore words on a stop list or weight each term frequency tf i by its inverse document frequency idf i where N = size of collection and n i = number of documents containing term i 1/16/14NYU6

7 Cosine similarity metric Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14NYU7

8 Cosine similarity metric the cosine similarity metric is the cosine of the angle between the term vectors: 1/16/14NYU8 w1w1 w2w2

9 Verdict: a success For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years stemming required for inflected languages limited resolution: returns documents, not answers

10 document retrieval opinion mining association mining 1/16/14NYU10

11 Opinion Mining – Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic classification task valuable for producers/marketers of all sorts of products – Simple strategy: bag-of-words make lists of positive and negative words see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type problem: hard to make such lists – lists will differ for different topics 1/16/1411NYU

12 Training a Model – Instead, label the documents in a corpus and then train a classifier from the corpus labeling is easier than thinking up words in effect, learn positive and negative words from the corpus – probabilistic classifier identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/1412NYU

13 Using Bayes’ Rule The last step is based on the naïve assumption of independence of the word probabilities 1/16/14NYU13

14 Training We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators P ( t ) = count ( docs labeled t) / N P ( w i | t ) = count ( docs labeled t containing w i ) count ( docs labeled t) 1/16/14NYU14

15 A Problem Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews P ( positive | GR ) = ? 1/16/14NYU15

16 P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14NYU16

17 Laplace Smoothing A simple remedy is to add 1 to each count – to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) – for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14NYU17

18 An NLTK Demo Sentiment Analysis with Python NLTK Text Classification http://text-processing.com/demo/sentiment/ NLTK Code (simplified classifier) http://streamhacker.com/2010/05/10/text- classification-sentiment-analysis-naive-bayes-classifier http://streamhacker.com/2010/05/10/text- classification-sentiment-analysis-naive-bayes-classifier 1/16/14NYU18

19 Is “low” a positive or a negative term? 1/16/14NYU19

20 Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14NYU20

21 Negation how to handle “the equipment never failed” 1/16/14NYU21

22 Negation modify words following negation “the equipment never NOT_failed” treat them as a separate ‘negated’ vocabulary 1/16/14NYU22

23 Negation: how far to go? “the equipment never failed and was cheap to run”  “the equipment never NOT_failed NOT_and NOT_was NOT_cheap NOT_to NOT_run have to determine scope of negation 1/16/14NYU23

24 Verdict: mixed A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails – for ambiguous terms – for negation – for comparative reviews – to reveal aspects of an opinion the car looked great and handled well, but the wheels kept falling off 1/16/14NYU24

25 document retrieval opinion mining association mining 1/16/14NYU25

26 Association Mining Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B – e.g., “people who bought A also bought B” For text: documents with term A also have term B – widely used in scientific and medical literature 1/16/14NYU26

27 Bag-of-words Simplest approach look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B doesn’t work well want to find names (of companies, products, genes), not individual words interested in specific types of terms want to learn from a few examples – need contexts to avoid noise 1/16/14NYU27

28 Needed Ingredients Effective text association mining needs – name recognition – term classification – preferably: ability to learn patterns – preferably: a good GUI – demo: www.coremine.comwww.coremine.com 1/16/14NYU28

29 Conclusion Some tasks can be handled effectively (and very simply) by bag-of-words models, but most benefit from an analysis of language structure 1/16/14NYU29


Download ppt "Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU."

Similar presentations


Ads by Google