Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU
Bag of Words Models do we really need elaborate linguistic analysis? look at text mining applications – document retrieval – opinion mining – association mining see how far we can get with document-level bag- of-words models – and introduce some of our mathematical approaches 1/16/14NYU2
document retrieval opinion mining association mining 1/16/14NYU3
Information Retrieval Task: given query = list of keywords, identify and rank relevant documents from collection Basic idea: find documents whose set of words most closely matches words in query 1/16/14NYU4
Topic Vector Suppose the document collection has n distinct words, w 1, …, w n Each document is characterized by an n-dimensional vector whose i th component is the frequency of word w i in the document Example D1 = [The cat chased the mouse.] D2 = [The dog chased the cat.] W = [The, chased, dog, cat, mouse] (n = 5) V1 = [ 2, 1, 0, 1, 1 ] V2 = [ 2, 1, 1, 1, 0 ] 1/16/14NYU5
Weighting the components Unusual words like elephant determine the topic much more than common words such as “the” or “have” can ignore words on a stop list or weight each term frequency tf i by its inverse document frequency idf i where N = size of collection and n i = number of documents containing term i 1/16/14NYU6
Cosine similarity metric Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14NYU7
Cosine similarity metric the cosine similarity metric is the cosine of the angle between the term vectors: 1/16/14NYU8 w1w1 w2w2
Verdict: a success For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years stemming required for inflected languages limited resolution: returns documents, not answers
document retrieval opinion mining association mining 1/16/14NYU10
Opinion Mining – Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic classification task valuable for producers/marketers of all sorts of products – Simple strategy: bag-of-words make lists of positive and negative words see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type problem: hard to make such lists – lists will differ for different topics 1/16/1411NYU
Training a Model – Instead, label the documents in a corpus and then train a classifier from the corpus labeling is easier than thinking up words in effect, learn positive and negative words from the corpus – probabilistic classifier identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/1412NYU
Using Bayes’ Rule The last step is based on the naïve assumption of independence of the word probabilities 1/16/14NYU13
Training We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators P ( t ) = count ( docs labeled t) / N P ( w i | t ) = count ( docs labeled t containing w i ) count ( docs labeled t) 1/16/14NYU14
A Problem Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews P ( positive | GR ) = ? 1/16/14NYU15
P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14NYU16
Laplace Smoothing A simple remedy is to add 1 to each count – to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) – for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14NYU17
An NLTK Demo Sentiment Analysis with Python NLTK Text Classification NLTK Code (simplified classifier) classification-sentiment-analysis-naive-bayes-classifier classification-sentiment-analysis-naive-bayes-classifier 1/16/14NYU18
Is “low” a positive or a negative term? 1/16/14NYU19
Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14NYU20
Negation how to handle “the equipment never failed” 1/16/14NYU21
Negation modify words following negation “the equipment never NOT_failed” treat them as a separate ‘negated’ vocabulary 1/16/14NYU22
Negation: how far to go? “the equipment never failed and was cheap to run” “the equipment never NOT_failed NOT_and NOT_was NOT_cheap NOT_to NOT_run have to determine scope of negation 1/16/14NYU23
Verdict: mixed A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails – for ambiguous terms – for negation – for comparative reviews – to reveal aspects of an opinion the car looked great and handled well, but the wheels kept falling off 1/16/14NYU24
document retrieval opinion mining association mining 1/16/14NYU25
Association Mining Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B – e.g., “people who bought A also bought B” For text: documents with term A also have term B – widely used in scientific and medical literature 1/16/14NYU26
Bag-of-words Simplest approach look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B doesn’t work well want to find names (of companies, products, genes), not individual words interested in specific types of terms want to learn from a few examples – need contexts to avoid noise 1/16/14NYU27
Needed Ingredients Effective text association mining needs – name recognition – term classification – preferably: ability to learn patterns – preferably: a good GUI – demo: 1/16/14NYU28
Conclusion Some tasks can be handled effectively (and very simply) by bag-of-words models, but most benefit from an analysis of language structure 1/16/14NYU29