Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Bayesian Learning Provides practical learning algorithms
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.
Visual Recognition Tutorial
Probabilistic inference
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Bayesian classifiers.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Evaluating Hypotheses
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Bayesian Learning and Learning Bayesian Networks.
Visual Recognition Tutorial
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Thanks to Nir Friedman, HU
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Semi-Supervised Learning
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Networks. Male brain wiring Female brain wiring.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Text Classification, Active/Interactive learning.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
START OF DAY 3 Reading: Chap. 7. Instance-based Learning.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Naive Bayes Classifier
Computer Science Department
Data Mining Lecture 11.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Presentation transcript:

Naïve Bayes

Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of interest are governed by probability distributions and that optimal decisions can be made by reasoning about these probabilities together with observed data. 2

Probabilistic Learning In ML, we are often interested in determining the best hypothesis from some space H, given the observed training data D. One way to specify what is meant by the best hypothesis is to say that we demand the most probable hypothesis, given the data D together with any initial knowledge about the prior probabilities of the various hypotheses in H. 3

Bayes Theorem Bayes theorem is the cornerstone of Bayesian learning methods It provides a way of calculating the posterior probability P(h | D), from the prior probabilities P(h), P(D) and P(D | h), as follows: 4

Using Bayes Theorem (I) Suppose I wish to know whether someone is telling the truth or lying about some issue X o The available data is from a lie detector with two possible outcomes: truthful and liar o I also have prior knowledge that over the entire population, 21% lie about X o Finally, I know the lie detector is imperfect: it returns truthful in only 94% of the cases where people actually told the truth and liar in only 87% of the cases where people where actually lying 5

Using Bayes Theorem (II) P(lies about X) = 0.21 P(liar | lies about X) = 0.93 P(liar | tells the truth about X) = P(tells the truth about X) = 0.79 P(truthful | lies about X) = 0.07 P(truthful | tells the truth about X) = 0.85

Using Bayes Theorem (III) Suppose a new person is asked about X and the lie detector returns liar Should we conclude the person is indeed lying about X or not What we need is to compare: o P(lies about X | liar) o P(tells the truth about X | liar) 7

Using Bayes Theorem (IV) By Bayes Theorem: o P(lies about X | liar) = [P(liar | lies about X).P(lies about X)]/P(liar) o P(tells the truth about X | liar) = [P(liar | tells the truth about X).P(tells the truth about X)]/P(liar) All probabilities are given explicitly, except for P(liar) which is easily computed (theorem of total probability): o P(liar) = P(liar | lies about X).P(lies about X) + P(liar | tells the truth about X).P(tells the truth about X) 8

Using Bayes Theorem (V) Computing, we get: o P(liar) = 0.93x x0.89 = o P(lies about X | liar) = [0.93x0.21]/0.329 = o P(tells the truth about X | liar) = [0.15x0.89]/0.329 = And we would conclude that the person was indeed lying about X 9

Intuition How did we make our decision? o We chose the/a maximally probable or maximum a posteriori (MAP) hypothesis, namely: 10

Brute-force MAP Learning For each hypothesis h  H o Calculate P(h | D) // using Bayes Theorem Return h MAP =argmax h  H P(h | D) Guaranteed “best” BUT often impractical for large hypothesis spaces: mainly used as a standard to gauge the performance of other learners 11

Remarks The Brute-Force MAP learning algorithm answers the question of: which is the most probable hypothesis given the training data?' Often, it is the related question of: which is the most probable classification of the new query instance given the training data?' that is most significant. In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities. 12

Bayes Optimal Classification (I) If the possible classification of the new instance can take on any value v j from some set V, then the probability P(v j | D) that the correct classification for the new instance is v j, is just: 13 Clearly, the optimal classification of the new instance is the value v j, for which P(v j | D) is maximum, which gives rise to the following algorithm to classify query instances. Clearly, the optimal classification of the new instance is the value v j, for which P(v j | D) is maximum, which gives rise to the following algorithm to classify query instances.

Bayes Optimal Classification (II) Return 14 No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average, since it maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypotheses. No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average, since it maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypotheses. The algorithm however, is impractical for large hypothesis spaces. The algorithm however, is impractical for large hypothesis spaces.

Naïve Bayes Learning (I) The naive Bayes learner is a practical Bayesian learning method. It applies to learning tasks where instances are conjunction of attribute values and the target function takes its values from some finite set V. The Bayesian approach consists in assigning to a new query instance the most probable target value, v MAP, given the attribute values a 1, …, a n that describe the instance, i.e., 15

Naïve Bayes Learning (II) Using Bayes theorem, this can be reformulated as: 16 Finally, we make the further simplifying assumption that the attribute values are conditionally independent given the target value. Hence, one can write the conjunctive conditional probability as a product of simple conditional probabilities. Finally, we make the further simplifying assumption that the attribute values are conditionally independent given the target value. Hence, one can write the conjunctive conditional probability as a product of simple conditional probabilities.

Naïve Bayes Learning (III) Return 17 The naive Bayes learning method involves a learning step in which the various P(v j ) and P(a i | v j ) terms are estimated, based on their frequencies over the training data. The naive Bayes learning method involves a learning step in which the various P(v j ) and P(a i | v j ) terms are estimated, based on their frequencies over the training data. These estimates are then used in the above formula to classify each new query instance. These estimates are then used in the above formula to classify each new query instance. Whenever the assumption of conditional independence is satisfied, the naive Bayes classification is identical to the MAP classification. Whenever the assumption of conditional independence is satisfied, the naive Bayes classification is identical to the MAP classification.

Illustration (I) 18

Illustration (II) 19

How is NB Incremental? No training instances are stored Model consists of summary statistics that are sufficient to compute prediction Adding a new training instance only affects summary statistics, which may be updated incrementally

Estimating Probabilities We have so far estimated P(X=x | Y=y) by the fraction n x|y /n y, where n y is the number of instances for which Y=y and n x|y is the number of these for which X=x This is a problem when n x is small o E.g., assume P(X=x | Y=y)=0.05 and the training set is s.t. that n y =5. Then it is highly probable that n x|y =0 o The fraction is thus an underestimate of the actual probability o It will dominate the Bayes classifier for all new queries with X=x 21

m-estimate Replace n x|y /n y by: 22 Where p is our prior estimate of the probability we wish to determine and m is a constant Where p is our prior estimate of the probability we wish to determine and m is a constant Typically, p = 1/k (where k is the number of possible values of X) Typically, p = 1/k (where k is the number of possible values of X) m acts as a weight (similar to adding m virtual instances distributed according to p) m acts as a weight (similar to adding m virtual instances distributed according to p)

Revisiting Conditional Independence Definition: X is conditionally independent of Y given Z iff P(X | Y, Z) = P(X | Z) NB assumes that all attributes are conditionally independent, given the class. Hence, 23

What if ? In many cases, the NB assumption is overly restrictive What we need is a way of handling independence or dependence over subsets of attributes o Joint probability distribution Defined over Y 1 x Y 2 x … x Y n Specifies the probability of each variable binding 24

Bayesian Belief Network Directed acyclic graph: o Nodes represent variables in the joint space o Arcs represent the assertion that the variable is conditionally independent of it non descendants in the network given its immediate predecessors in the network o A conditional probability table is also given for each variable: P(V | immediate predecessors) Refer to section