Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19,
Learning for Text Categorization
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Lecture 13-1: Text Classification & Naive Bayes
Collective Word Sense Disambiguation David Vickrey Ben Taskar Daphne Koller.
TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Text Classification – Naive Bayes
CS276A Text Retrieval and Mining Lecture 11. Recap of the last lecture Probabilistic models in Information Retrieval Probability Ranking Principle Binary.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Crash Course on Machine Learning
Naïve Bayes for Text Classification: Spam Detection
Categorization/Classification
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Text Classification : The Naïve Bayes algorithm
Information Retrieval and Web Search Introduction to Text Classification (Note: slides in this set have been adapted from the course taught by Chris Manning.
Bayesian Networks. Male brain wiring Female brain wiring.
ITCS 6265 Information Retrieval and Web Mining Lecture 12: Text Classification; The Naïve Bayes algorithm.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 11 9/29/2011.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.
Text Classification, Active/Interactive learning.
How to classify reading passages into predefined categories ASH.
Naive Bayes Classifier
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Statistical NLP Winter 2009 Lecture 4: Text categorization through Naïve Bayes Roger Levy ありがとう to Chris Manning for slides.
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 15.
Classification Techniques: Bayesian Classification
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
Statistical NLP Winter 2008 Lecture 4: Text classification through Naïve Bayes Roger Levy ありがとう to Chris Manning for slides.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Web Search and Text Mining Lecture 17: Naïve BayesText Classification.
Text Classification and Naïve Bayes (Modified from Stanford CS276 slides on Lecture 10: Text Classification; The Naïve Bayes algorithm)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CS276 Information Retrieval and Web Search Lecture 12: Naïve BayesText Classification.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 4 Text Categorization I Introduction and Naive Bayes Jan 21, 2003.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Naive Bayes Classifier
CSCI 5417 Information Retrieval Systems Jim Martin
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Presentation transcript:

Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze

Naïve Bayes: Why Bother? Tightly tied to text categorization Interesting theoretical properties. A simple example of an important class of learners based on generative models that approximate how data is produced For certain special cases, NB is the best thing you can do.

Bayes’ Rule

Maximum a posteriori Hypothesis As P(D) is constant

Maximum likelihood Hypothesis If all hypotheses are a priori equally likely, we only need to consider the P(D|h) term:

Naive Bayes Classifiers Task: Classify a new instance D based on a tuple of attribute values into one of the classes c j  C

Naïve Bayes Classifier: Naïve Bayes Assumption P(c j ) –Can be estimated from the frequency of classes in the training examples. P(x 1,x 2,…,x n |c j ) –O(|X| n |C|) parameters –Could only be estimated if a very, very large number of training examples was available. Naïve Bayes Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(x i |c j ).

Smoothing to Avoid Overfitting Somewhat more subtle version # of values of X i overall fraction in data where X i =x i,k extent of “smoothing”

Naive Bayes for Text Categorization Attributes are text positions, values are words. Still too many possibilities Assume that classification is independent of the positions of the words –Use same parameters for each position –Result is bag of words model (over tokens not types)

Text j  single document containing all docs j for each word x k in Vocabulary –n k  number of occurrences of x k in Text j – Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate required P(c j ) and P(x k | c j ) terms –For each c j in C do docs j  subset of documents for which the target class is c j

Naïve Bayes: Classifying positions  all word positions in current document which contain tokens found in Vocabulary Return c NB, where

Underflow Prevention Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still the most probable.

Naïve Bayes as Stochastic Language Models Model probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … themanlikesthewoman multiply Model M P(s | M) =

Naïve Bayes as Stochastic Language Models Model probability of generating any string 0.2the 0.01class sayst pleaseth yon maiden 0.01woman Model M1Model M2 maidenclasspleasethyonthe P(s|M2) > P(s|M1) 0.2the class 0.03sayst 0.02pleaseth 0.1yon 0.01maiden woman

Unigram and higher-order models Unigram Language Models Bigram (generally, n-gram) Language Models = P ( )P ( | ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( ) P ( | ) P ( | ) P ( | ) Easy. Effective!

Smoothing and Backoff Suppose we’re using a trigram model. We need to estimate P(w 3 | w 1,w 2 ) It will often be the case that the trigram w 1,w 2,w 3 is rare or non-existent in the training corpus. (Similar to problem we saw above with unigrams.) First resort: backoff. Estimate P(w 3 | w 1,w 2 ) using P(w 3 | w 2 ) Alternatively, use some very large backup corpus. Various combinations have been tried.

Multinomial Naïve Bayes = class conditional language model Think of w i as the i th word in the document Effectively, the probability of each class is done as a class-specific unigram language model Cat w1w1 w2w2 w3w3 w4w4 w5w5 w6w6

But Wait! Another Approach Now think of w i as the i th word in the dictionary (not the document) Each value is either 1 (in the doc) or 0 (not) This is very different than the multinomial method. McCallum and Nigam (1998) observed that the two were often confused. Cat w1w1 w2w2 w3w3 w4w4 w5w5 w6w6

Binomial Naïve Bayes One feature X w for each word in dictionary X w = true in document d if w appears in d Naive Bayes assumption: Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Parameter Estimation fraction of documents of topic c j in which word w appears Binomial model: Multinomial model: –Can create a mega-document for topic j by concatenating all documents in this topic –Use frequency of w in mega-document fraction of times in which word w appears across all documents of topic c j

Experiment: Multinomial vs Binomial M&N (1998) did some experiments to see which is better Determine if a university web page is {student, faculty, other_stuff} Train on ~5,000 hand-labeled web pages –Cornell, Washington, U.Texas, Wisconsin Crawl and classify a new site (CMU)

Multinomial vs. Binomial

Conclusions Multinomial is better For Binomial, it’s really important to do feature filtering Other experiments bear out these conclusions

Feature Filtering If irrelevant words mess up the results, let’s try to use only words that might help In training set, choose k words which best discriminate the categories. Best way to choose: for each category build a list of j most discriminating terms

Infogain Use terms with maximal Mutual Information with the classes: –For each word w and each category c (This is equivalent to the usual two-class Infogain formula.)

Chi-Square Feature Selection Term presentTerm absent Document belongs to category AB Document does not belong to category CD X 2 = N(AD-BC) 2 / ( (A+B) (A+C) (B+D) (C+D) ) For complete independence of term and category: AD=BC

Feature Selection Many other measures of differentiation have been tried. Empirical tests suggest Infogain works best. Simply eliminating rare terms is easy and usual doesn’t do much harm. Be sure not to use test data when you do feature selection. (This is tricky when you’re using k-fold cross-validation.)

Naïve Bayes: Conclusions Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate, though not nearly as good as, say, SVM. However, due to the inadequacy of the conditional independence assumption, the actual posterior- probability numerical estimates are not. –Output probabilities are generally very close to 0 or 1.

Some Good Things about NB Theoretically optimal if the independence assumptions hold Fast Sort of robust to irrelevant features (but not really) Very good in domains with many equally important features Probably only method useful for very short test documents (Why?)