Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.

Slides:



Advertisements
Similar presentations
Content-based Recommendation Systems
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Vector Space Models.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Relevance Feedback Hongning Wang
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Information Retrieval: Models and Methods
True/False questions (3pts*2)
Information Retrieval: Models and Methods
Basic Information Retrieval
Representation of documents and queries
From frequency to meaning: vector space models of semantics
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
5. Vector Space and Probabilistic Retrieval Models
Information Retrieval
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Presentation transcript:

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU

Bag of Words Models do we really need elaborate linguistic analysis? look at text mining applications – document retrieval – opinion mining – association mining see how far we can get with document-level bag- of-words models – and introduce some of our mathematical approaches 1/16/14NYU2

document retrieval opinion mining association mining 1/16/14NYU3

Information Retrieval Task: given query = list of keywords, identify and rank relevant documents from collection Basic idea: find documents whose set of words most closely matches words in query 1/16/14NYU4

Topic Vector Suppose the document collection has n distinct words, w 1, …, w n Each document is characterized by an n-dimensional vector whose i th component is the frequency of word w i in the document Example D1 = [The cat chased the mouse.] D2 = [The dog chased the cat.] W = [The, chased, dog, cat, mouse] (n = 5) V1 = [ 2, 1, 0, 1, 1 ] V2 = [ 2, 1, 1, 1, 0 ] 1/16/14NYU5

Weighting the components Unusual words like elephant determine the topic much more than common words such as “the” or “have” can ignore words on a stop list or weight each term frequency tf i by its inverse document frequency idf i where N = size of collection and n i = number of documents containing term i 1/16/14NYU6

Cosine similarity metric Define a similarity metric between topic vectors A common choice is cosine similarity (dot product): 1/16/14NYU7

Cosine similarity metric the cosine similarity metric is the cosine of the angle between the term vectors: 1/16/14NYU8 w1w1 w2w2

Verdict: a success For heterogenous text collections, the vector space model, tf-idf weighting, and cosine similarity have been the basis for successful document retrieval for over 50 years stemming required for inflected languages limited resolution: returns documents, not answers

document retrieval opinion mining association mining 1/16/14NYU10

Opinion Mining – Task: judge whether a document expresses a positive or negative opinion (or no opinion) about an object or topic classification task valuable for producers/marketers of all sorts of products – Simple strategy: bag-of-words make lists of positive and negative words see which predominate in a given document (and mark as ‘no opinion’ if there are few words of either type problem: hard to make such lists – lists will differ for different topics 1/16/1411NYU

Training a Model – Instead, label the documents in a corpus and then train a classifier from the corpus labeling is easier than thinking up words in effect, learn positive and negative words from the corpus – probabilistic classifier identify most likely class s = argmax P ( t | W) t є {pos, neg} 1/16/1412NYU

Using Bayes’ Rule The last step is based on the naïve assumption of independence of the word probabilities 1/16/14NYU13

Training We now estimate these probabilities from the training corpus (of N documents) using maximum likelihood estimators P ( t ) = count ( docs labeled t) / N P ( w i | t ) = count ( docs labeled t containing w i ) count ( docs labeled t) 1/16/14NYU14

A Problem Suppose a glowing review GR (with lots of positive words) includes one word, “mathematical”, previously seen only in negative reviews P ( positive | GR ) = ? 1/16/14NYU15

P ( positive | GR ) = 0 because P ( “mathematical” | positive ) = 0 The maximum likelihood estimate is poor when there is very little data We need to ‘smooth’ the probabilities to avoid this problem 1/16/14NYU16

Laplace Smoothing A simple remedy is to add 1 to each count – to keep them as probabilities we increase the denominator N by the number of outcomes (values of t) (2 for ‘positive’ and ‘negative’) – for the conditional probabilities P( w | t ) there are similarly two outcomes (w is present or absent) 1/16/14NYU17

An NLTK Demo Sentiment Analysis with Python NLTK Text Classification NLTK Code (simplified classifier) classification-sentiment-analysis-naive-bayes-classifier classification-sentiment-analysis-naive-bayes-classifier 1/16/14NYU18

Is “low” a positive or a negative term? 1/16/14NYU19

Ambiguous terms “low” can be positive “low price” or negative “low quality” 1/16/14NYU20

Negation how to handle “the equipment never failed” 1/16/14NYU21

Negation modify words following negation “the equipment never NOT_failed” treat them as a separate ‘negated’ vocabulary 1/16/14NYU22

Negation: how far to go? “the equipment never failed and was cheap to run”  “the equipment never NOT_failed NOT_and NOT_was NOT_cheap NOT_to NOT_run have to determine scope of negation 1/16/14NYU23

Verdict: mixed A simple bag-of-words strategy with a NB model works quite well for simple reviews referring to a single item, but fails – for ambiguous terms – for negation – for comparative reviews – to reveal aspects of an opinion the car looked great and handled well, but the wheels kept falling off 1/16/14NYU24

document retrieval opinion mining association mining 1/16/14NYU25

Association Mining Goal: find interesting relationships among attributes of an object in a large collection … objects with attribute A also have attribute B – e.g., “people who bought A also bought B” For text: documents with term A also have term B – widely used in scientific and medical literature 1/16/14NYU26

Bag-of-words Simplest approach look for words A and B for which frequency (A and B in same document) >> frequency of A x frequency of B doesn’t work well want to find names (of companies, products, genes), not individual words interested in specific types of terms want to learn from a few examples – need contexts to avoid noise 1/16/14NYU27

Needed Ingredients Effective text association mining needs – name recognition – term classification – preferably: ability to learn patterns – preferably: a good GUI – demo: 1/16/14NYU28

Conclusion Some tasks can be handled effectively (and very simply) by bag-of-words models, but most benefit from an analysis of language structure 1/16/14NYU29