Naïve Bayes Text Classification

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Classification Classification Examples
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Hinrich Schütze and Christina Lioma
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Learning for Text Categorization
Chapter 7 Retrieval Models.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 13: Text Classification & Naive.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
K nearest neighbor and Rocchio algorithm
Lecture 13-1: Text Classification & Naive Bayes
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Scalable Text Mining with Sparse Generative Models
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
How to classify reading passages into predefined categories ASH.
Naive Bayes Classifier
Text Clustering.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
C.Watterscsci64031 Probabilistic Retrieval Model.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Bayesian inference, Naïve Bayes model
Matt Gormley Lecture 3 September 7, 2016
بسم الله الرحمن الرحيم وقل ربي زدني علما.
بسم الله الرحمن الرحيم وقل ربي زدني علما.
Introduction to Information Retrieval
Lecture 13: Language Models for IR
Lecture 12: Relevance Feedback & Query Expansion - II
Naive Bayes Classifier
Information Retrieval
Perceptrons Lirong Xia.
CSCI 5417 Information Retrieval Systems Jim Martin
Naive Bayesian Classification
Learning Sequence Motif Models Using Expectation Maximization (EM)
Lecture 15: Text Classification & Naive Bayes
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Word Embedding Word2Vec.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Naive Bayes for Document Classification
Michal Rosen-Zvi University of California, Irvine
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Information Retrieval
Mathematical Foundations of BME
INF 141: Information Retrieval
NAÏVE BAYES CLASSIFICATION
Perceptrons Lirong Xia.
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Naïve Bayes Classifier
Presentation transcript:

Naïve Bayes Text Classification

Text Classification and idea of Relevance Text classification is the process of assigning tags or categories to text based on its content Sentiment analysis, topic labeling, spam detection, relevance Relevance is when search engine returns a set of documents relevant to user’s query. User manually marks result as relevant or not and search engine re-computes based on user’s input Further new results have (hopefully) better recall

Rocchio classification Consider 3 classes, China, UK and Kenya in a 2D space. Documents are shown as circle, diamonds and X’s Boundaries are called decision boundaries to separate the 3 classes New document (‘star’) to be assigned to class of that region, China in this case Rocchio classification computes boundaries to classify unseen data It uses centroids to define boundaries. Centroid c of a class is center of mass (vector average) of its members

Query expansion at search engines Example1: after searching ‘herbs’ user frequently search ‘herbal remedies’ ’herbal remedies’ is potential expansion of ‘herb’ Example2: user who search for ‘flower pic’ frequently clicks on a specific URL. Users searching for ‘flower clipart’ also frequently click the same URL  ‘flower clipart’ and ‘flower pic’ are potential expansion of each other

Naïve Bayes – Text classification - Idea Given: A document space X Documents are represented in this space – typically some high dimensional space A fixed set of classes C = [c1, c2, …. cn] Classes are human defined for the needs of an application i.e. (spam vs. non-spam) A training set D of labeled documents. Each labeled document belong to class Using a learning method, we wish to develop a classifier ‘y’ that maps documents to a class y : XC

Naïve Bayes classifier Naïve Bayes classifier is a probabilistic classifier Probability of a document ‘d’ being in a class ‘c’ is as follows: Nd is the length of the document (number of tokens) P(tk | c) is the probability of term tk occurring in document of class c P(tk | c) as a measure of how much evidence tk contributes that c is correct class P(c) is prior probability of c

Maximum a posteriori class Our goal in Naïve Bayes classification is to find the ‘best’ class The best class is the most likely or maximum a posteriori (MAP) class

Taking the log Multiplying lots of small probabilities can give floating point underflow Since log(xy) = log(x)+log(y), we can assume log probabilities Since log is a monotonic function, class with highest score don’t change Thus our MAP formula becomes: Simple interpretation: Each conditional parameter log Pˆ (tk |c) is a weight that indicates how good an indicator tk is for c. The prior log Pˆ (c) is a weight that indicates the relative frequency of c. The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the class. We select the class with the most evidence.

Parameter Estimation Estimate parameters P(c) and P(tk | c) from train data Prior: P(c) = Nc / N Nc = number of docs in class c, N = total number of docs Conditional probability: P(t | c) = Tct / Tct’ Tct is the number of tokens t in training documents from class c

If WTO never occurs in class China in the train set: P(China|d) ∝ P(China) . P(Beijing | China) . P(and | China) . P(Taipei | China) . P(join | Chiina) . P(WTO | China) If WTO never occurs in class China in the train set: P(WTO | China) = 0 / Tchina, t’ = 0 Thus we will get P(china | d) = 0 for any document that contains WTO C =China X1 =Beijing X2 =and X3 =Taipei X4 =join X5=WTO

To avoid zeros: add-one smoothing Before: P(t | c) = Tct / Tct’ After: P(t | c) = Tct + 1 / Tct’ + 1  Tct +1 / Tct’ + B B is the number of bins – in this case is the number of different words or the size of the vocabulary

Naïve Bayes: Summary Estimate parameters from the training corpus using add-one smoothing For a new document, for each class, compute sum of (i) log of prior and (ii) logs of conditional probabilities of the terms Assign the document to the class with the largest score

Naïve Bayes: Training

Naïve Bayes: Testing ApplyMultinomialNB(C, V , prior , condprob, d ) W ← ExtractTokensFromDoc(V , d ) for each c ∈ C do score[c] ← log prior [c] for each t ∈ W do score[c]+ = log condprob[t][c] return arg maxc∈C score[c]

Example: Doc ID Words in document In c = China? Training set 1 Chinese Beijing Chinese Yes 2 Chinese Chinese Shanghai yes 3 Chinese Macao 4 Tokyo Japan Chinese No Test set 5 Chinese Chinese Chinese Tokyo Japan ???

Thus, classifier assigns the test document to c = China Thus, classifier assigns the test document to c = China. The reason for this classification decision is that the three occurrences of positive indicator Chinese is document 5 outweigh the occurrences of the two negative indicators of Japan and Tokyo

Time complexity of Naïve Bayes Lave: average length of training doc, La: length of test doc, Ma: number of distinct terms in test doc, D: training set, V : vocabulary, C: set of classes Θ(|D|Lave) is the time it takes to compute all counts. Θ(|C||V |) is the time it takes to compute the parameters from the counts. Generally: |C||V | < |D|Lave Test time is also linear (in the length of the test document). Thus: Naive Bayes is linear in the size of the training set (training) and the test document (testing). This is optimal.