Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
NYU Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman

Bag of Words Models do we really need elaborate linguistic analysis?
look at text mining applications document retrieval opinion mining association mining see how far we can get with document-level bag-of-words models and introduce some of our mathematical approaches 1/16/14 NYU

document retrieval opinion mining association mining 1/16/14 NYU

Information Retrieval
Task: given query = list of keywords, identify and rank relevant documents from collection Basic idea: find documents whose set of words most closely matches words in query 1/16/14 NYU

Topic Vector Suppose the document collection has n distinct words, w1, …, wn Each document is characterized by an n-dimensional vector whose ith component is the frequency of word wi in the document Example D1 = [The cat chased the mouse.] D2 = [The dog chased the cat.] W = [The, chased, dog, cat, mouse] (n = 5) V1 = [ 2 , , 0 , 1 , ] V2 = [ 2 , , 1 , 1 , ] 1/16/14 NYU

Weighting the components
Unusual words like elephant determine the topic much more than common words such as “the” or “have” can ignore words on a stop list or weight each term frequency tfi by its inverse document frequency idfi where N = size of collection and ni = number of documents containing term i 1/16/14 NYU

Cosine similarity metric
Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14 NYU

Cosine similarity metric
the cosine similarity metric is the cosine of the angle between the term vectors: w2 w1 1/16/14 NYU

Verdict: a success For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years

Opinion Mining Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic classification task valuable for producers/marketers of all sorts of products Simple strategy: bag-of-words make lists of positive and negative words see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type problem: hard to make such lists lists will differ for different topics 1/16/14 NYU

Training a Model Instead, label the documents in a corpus and then train a classifier from the corpus labeling is easier than thinking up words in effect, learn positive and negative words from the corpus probabilistic classifier identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/14 NYU

Using Bayes’ Rule The last step is based on the naïve assumption of independence of the word probabilities 1/16/14 NYU

Training We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators P ( t ) = count ( docs labeled t) / N P ( wi | t ) = count ( docs labeled t containing wi ) count ( docs labeled t) 1/16/14 NYU

A Problem Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews P ( positive | GR ) = ? 1/16/14 NYU

P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14 NYU

Laplace Smoothing A simple remedy is to add 1 to each count
to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14 NYU

An NLTK Demo Sentiment Analysis with Python NLTK Text Classification
NLTK Code (simplified classifier) 1/16/14 NYU

Is “low” a positive or a negative term?
1/16/14 NYU

Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14 NYU

Verdict: mixed A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails for ambiguous terms for comparative reviews to reveal aspects of an opinion the car looked great and handled well, but the wheels kept falling off 1/16/14 NYU

Association Mining Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B e.g., “people who bought A also bought B” For text: documents with term A also have term B widely used in scientific and medical literature 1/16/14 NYU

Bag-of-words Simplest approach doesn’t work well
look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B doesn’t work well want to find names (of companies, products, genes), not individual words interested in specific types of terms want to learn from a few examples need contexts to avoid noise 1/16/14 NYU

Needed Ingredients Effective text association mining needs
name recognition term classification preferably: ability to learn patterns preferably: a good GUI demo: 1/16/14 NYU

Conclusion Some tasks can be handled effectively (and very simply) by bag-of-words models, but most benefit from an analysis of language structure 1/16/14 NYU

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Similar presentations

Presentation on theme: "Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Similar presentations

Presentation on theme: "Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A"— Presentation transcript:

Similar presentations

About project

Feedback