Naïve Bayes Classifier April 25 th, 2006. Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Slides:

Advertisements

Similar presentations

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.

Bayesian Learning Provides practical learning algorithms

Data Mining Classification: Alternative Techniques

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.

What is Statistical Modeling

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Data Mining Classification: Naïve Bayes Classifier

Overview Full Bayesian Learning MAP learning

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 13: Text Classification & Naive.

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Lecture 13-1: Text Classification & Naive Bayes

TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)

Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.

Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Crash Course on Machine Learning

Naïve Bayes for Text Classification: Spam Detection

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Information Retrieval and Web Search Introduction to Text Classification (Note: slides in this set have been adapted from the course taught by Chris Manning.

Bayesian Networks. Male brain wiring Female brain wiring.

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Text Classification, Active/Interactive learning.

How to classify reading passages into predefined categories ASH.

1 Data Mining Lecture 5: KNN and Bayes Classifiers.

Naive Bayes Classifier

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.

Mining di dati web Classificazioni di Documenti Web Fondamenti di Classificazione e Naive Bayes A.A 2006/2007.

1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 15.

Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

Classification Techniques: Bayesian Classification

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

Chapter 6 Bayesian Learning

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Bayesian Learning Provides practical learning algorithms

Data Mining and Decision Support

Classification Today: Basic Problem Decision Trees.

1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners

Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.

Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.

Lecture 1.31 Criteria for optimal reception of radio signals.

Naive Bayes Classifier

Lecture 15: Text Classification & Naive Bayes

Data Mining Lecture 11.

Classification Techniques: Bayesian Classification

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —

Information Retrieval

Machine Learning: Lecture 6

Machine Learning: UNIT-3 CHAPTER-1

Naive Bayes Classifier

NAÏVE BAYES CLASSIFICATION

Naïve Bayes Classifier

Presentation transcript:

Naïve Bayes Classifier April 25 th, 2006

Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when job is done by experts Consistent when the problem size and team is small Difficult and expensive to scale

Classification Methods (2) Automatic classification Hand-coded rule-based systems One technique used by CS dept’s spam filter, Reuters, Snort IDS … E.g., assign category if the instance matches the rules Accuracy is often very high if a rule has been carefully refined over time by a subject expert Building and maintaining these rules is expensive

Classification Methods (3) Supervised learning of a document-label assignment function Many systems partly rely on machine learning (Google, MSN, Yahoo!, …) Naive Bayes (simple, common method) k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs Note that many commercial systems use a mixture of methods

Decision Tree Strength Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which fields are most important for prediction or classification Weakness Error-prone with many classes Computationally expensive to train, hard to update Simple true/false decision, nothing in between

Does patient have cancer or not? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 99% of the cases and a correct negative result in only 95% of the cases. Furthermore, only 0.03 of the entire population has this disease. How likely that this patient has cancer?

Bayesian Methods Our focus this lecture Learning and classification methods based on probability theory. Bayes theorem plays a critical role in probabilistic learning and classification. Uses prior probability of each category given no information about an item. Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Basic Probability Formulas Product rule Sum rule Bayes theorem Theorem of total probability, if event Ai is mutually exclusive and probability sum to 1

Bayes Theorem Given a hypothesis h and data D which bears on the hypothesis: P(h): independent probability of h: prior probability P(D): independent probability of D P(D|h): conditional probability of D given h: likelihood P(h|D): conditional probability of h given D: posterior probability

Does patient have cancer or not? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 99% of the cases and a correct negative result in only 95% of the cases. Furthermore, only 0.03 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis?

Maximum A Posterior Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D. H: set of all hypothesis. Note that we can drop P(D) as the probability of the data is constant (and independent of the hypothesis).

Maximum Likelihood Now assume that all hypotheses are equally probable a priori, i.e., P(hi ) = P(hj ) for all hi, hj belong to H. This is called assuming a uniform prior. It simplifies computing the posterior: This hypothesis is called the maximum likelihood hypothesis.

Desirable Properties of Bayes Classifier Incrementality: with each training example, the prior and the likelihood can be updated dynamically: flexible and robust to errors. Combines prior knowledge and observed data: prior probability of a hypothesis multiplied with probability of the hypothesis given the training data Probabilistic hypothesis: outputs not only a classification, but a probability distribution over all classes

Bayes Classifiers Assumption: training set consists of instances of different classes described cj as conjunctions of attributes values Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj  C Key idea: assign the most probable class using Bayes Theorem.

Parameters estimation P(c j ) Can be estimated from the frequency of classes in the training examples. P(x 1,x 2,…,x n |c j ) O(|X| n|C|) parameters Could only be estimated if a very, very large number of training examples was available. Independence Assumption : attribute values are conditionally independent given the target value: naïve Bayes.

Properties Estimating instead of greatly reduces the number of parameters (and the data sparseness). The learning step in Naïve Bayes consists of estimating and based on the frequencies in the training data An unseen instance is classified by computing the class that maximizes the posterior When conditioned independence is satisfied, Naïve Bayes corresponds to MAP classification.

Question: For the day, what’s the play prediction?

Underflow Prevention Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still the most probable.

Smoothing to Avoid Overfitting Somewhat more subtle version # of values of X i overall fraction in data where X i =x i,k extent of “smoothing”