1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Bayesian Learning Provides practical learning algorithms
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
What is Statistical Modeling
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.
Data Mining Techniques Outline
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Probability and Bayesian Networks
Evaluating Hypotheses
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, March 6, 2000 William.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Experimental Evaluation
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Naive Bayes Classifier
Bayesian Networks Martin Bachler MLA - VO
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 Bayesian Learning. 2 Bayesian Reasoning Basic assumption –The quantities of interest are governed by probability distribution –These probability + observed.
Bayesian Learning Provides practical learning algorithms
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 28 February 2007.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
CS479/679 Pattern Recognition Dr. George Bebis
Naive Bayes Classifier
Computer Science Department
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Presentation transcript:

1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)

2 An Introduction §Bayesian Decision Theory came long before Version Spaces, Decision Tree Learning and Neural Networks. It was studied in the field of Statistical Theory and more specifically, in the field of Pattern Recognition. §Bayesian Decision Theory is at the basis of important learning schemes such as the Naïve Bayes Classifier, Learning Bayesian Belief Networks and the EM Algorithm. §Bayesian Decision Theory is also useful as it provides a framework within which many non-Bayesian classifiers can be studied (See [Mitchell, Sections 6.3, 4,5,6]).

3 Bayes Theorem §Goal: To determine the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. §Prior probability of h, P(h): it reflects any background knowledge we have about the chance that h is a correct hypothesis (before having observed the data). §Prior probability of D, P(D): it reflects the probability that training data D will be observed given no knowledge about which hypothesis h holds. §Conditional Probability of observation D, P(D|h): it denotes the probability of observing data D given some world in which hypothesis h holds.

4 Bayes Theorem (Cont’d) §Posterior probability of h, P(h|D): it represents the probability that h holds given the observed training data D. It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Machine Learning researchers are interested in. §Bayes Theorem allows us to compute P(h|D): P(h|D)=P(D|h)P(h)/P(D)

5 Maximum A Posteriori (MAP) Hypothesis and Maximum Likelihood §Goal: To find the most probable hypothesis h from a set of candidate hypotheses H given the observed data D. §MAP Hypothesis, h MAP = argmax h  H P(h|D) = argmax h  H P(D|h)P(h)/P(D) = argmax h  H P(D|h)P(h) §If every hypothesis in H is equally probable a priori, we only need to consider the likelihood of the data D given h, P(D|h). Then, h MAP becomes the Maximum Likelihood, h ML = argmax h  H P(D|h)P(h)

6 An Example §Two alternative hypotheses l The patient has a particular form of cancer l The patient does not §Observed Data l Positive/negative of a particular lab test §Prior knowledge l Population of people have this disease

7 Bayes Theorem and Concept Learning §Assumption l Training data D is noise free. l The target concept c is contained in H l No a priori reason to believe that any hypothesis is more probable than any other

8 Bayes Theorem and Concept Learning See Fig. 6.1 in text If P(h)=1/|H| and if P(D|h)=1 if D is consistent with h, and 0 otherwise, then every hypothesis in the version space resulting from D is a MAP hypothesis.

9 MAP Hypotheses and Consistent Learners §Every hypothesis consistent with D is a MAP hypothesis. §Consistent Learner l It outputs a hypothesis that commits zero errors over training examples. l Every consistent learner outputs a MAP hypothesis if we assume a uniform prior probability distribution over H and if we assume deterministic, noise free data. l Example: Find-S algorithm outputs MAP hypotheses that are maximally specific members of version space. Are there other probability distributions for P(h) and P(D|h) ? –Yes.

10 MAP Hypotheses and Consistent Learners §Bayesian framework allows one way to characterize the behavior of learning algorithms (e.g. Find-S), even when the learning algorithm does not explicitly manipulate probabilities.

11 ML and Least-Squared Error Hypotheses §Under certain assumptions regarding noise in the data, minimizing the mean squared error (what common neural nets do) corresponds to computing the maximum likelihood hypothesis.

12 ML and Least-Squared Error Hypotheses

13 ML and Least-Squared Error Hypotheses §Assumptions l MAP becomes ML when P(h) is equally probable. l Normal Distribution of error l Noise in target values but noise free for attribute values Target value: Weight Attributes: Age, Height

14 ML Hypotheses for Predicting Probability §Binary classifier of neural net l Training data l Single output node

15 ML Hypotheses for Predicting Probability

16 ML Hypotheses for Predicting Probability §Gradient Search in single layer NN §Summary: l Minimize sum of squared error seeks the ML hypothesis under assumption that training data can be modeled by Normal distributed noise added to the target function value. l The rule that minimizes cross entropy seeks the ML hypothesis under the assumption that the observed Boolean value is the probabilistic function of input instance.

17

18 Bayes Optimal Classifier §One great advantage of Bayesian Decision Theory is that it gives us a lower bound on the classification error that can be obtained for a given problem. §Bayes Optimal Classification: The most probable classification of a new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities: argmax vj  V  hi  H P(v h |h i )P(h i |D) where V is the set of all the values a classification can take and v j is one possible such classification. §Unfortunately, Bayes Optimal Classifier is usually too costly to apply! ==> Naïve Bayes Classifier

19

20 Gibbs algorithm §Algorithm l 1. Choose h from H, according to the posterior probability distribution over H l 2. Use h to predict the classification of x §Validity of Gibbs algorithm l Haussler, 1994 l Error(Gibbs algorithm)< 2*Error(Bayes optimal classifier)

21 Naïve Bayes Classifier §Let each instance x of a training set D be described by a conjunction of n attribute values and let f(x), the target function, be such that f(x)  V, a finite set. §Bayesian Approach: v MAP = argmax vj  V P(v j |a 1,a 2,..,a n ) = argmax vj  V [P(a 1,a 2,..,a n |v j ) P(v j )/P(a 1,a 2,..,a n )] = argmax vj  V [P(a 1,a 2,..,a n |v j ) P(v j ) §Naïve Bayesian Approach: We assume that the attribute values are conditionally independent so that P(a 1,a 2,..,a n |v j ) =  i P(a i |v j ) [and not too large a data set is required.] § Naïve Bayes Classifier: v NB = argmax vj  V P(v j )  i P(a i |v j )

22 An Illustrative Example §(outlook=sunny,temperature=cool,humidity=high,wind=str ong) §P(wind=strong|playTennis=yes)=3/9=.33 §P(wind=strong|PlayTennis=no)=3/5=.60 §P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)= §P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)=.0206 §v NB = no

23 Naïve Bayes Classifier Estimating Probabilities P(a i |v j ) m-estimate of probability = –m : equivalent sample size, p : prior estimate of probability –n c : number of instances a i among v j –n: number of instances with value v j Example: Table 3.2

24 An Example: Learning to classify text §Independent Assumption l Incorrect assumption but no other choice l Example: machine learning l Impractical: 2 target values*111 word positions*50000 words l Position Independent Assumption 2 target values*50000 words See Algorithm at page 183 l Learn_Bayes_Text/Classify_Bayes_Text

25 Bayesian Belief Networks §The Bayes Optimal Classifier is often too costly to apply. §The Naïve Bayes Classifier uses the conditional independence assumption to defray these costs. However, in many cases, such an assumption is overly restrictive. §Bayesian belief networks provide an intermediate approach which allows stating conditional independence assumptions that apply to subsets of the variable.

26 Conditional Independence § We say that X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z. §i.e., (  x i,y j,z k ) P(X=x i |Y=y j,Z=z k )=P(X=x i |Z=z k ) §or, P(X|Y,Z)=P(X|Z) §This definition can be extended to sets of variables as well: we say that the set of variables X 1 …X l is conditionally independent of the set of variables Y 1 …Y m given the set of variables Z 1 …Z n, if P(X 1 …X l |Y 1 …Y m,Z 1 …Z n )=P(X 1 …X l |Z 1 …Z n )

27 Representation in Bayesian Belief Networks Storm BusTourGroup Lightning Campfire ThunderForestFire Each node is asserted to be conditionally independent of its non-descendants, given its immediate parents Associated with each node is a conditional probability table, which specifies the conditional distribution for the variable given its immediate parents in the graph

28 Inference in Bayesian Belief Networks §A Bayesian Network can be used to compute the probability distribution for any subset of network variables given the values or distributions for any subset of the remaining variables. §Unfortunately, exact inference of probabilities in general for an arbitrary Bayesian Network is known to be NP-hard. §In theory, approximate techniques (such as Monte Carlo Methods) can also be NP-hard, though in practice, many such methods were shown to be useful.

29 Learning Bayesian Belief Networks 3 Cases: 1. The network structure is given in advance and all the variables are fully observable in the training examples. ==> Trivial Case: just estimate the conditional probabilities. 2. The network structure is given in advance but only some of the variables are observable in the training data. ==> Similar to learning the weights for the hidden units of a Neural Net: Gradient Ascent Procedure 3. The network structure is not known in advance. ==> Use a heuristic search or constraint-based technique to search through potential structures.

30 The EM Algorithm: Learning with unobservable relevant variables. §Example:Assume that data points have been uniformly generated from k distinct Gaussian with the same known variance. The problem is to output a hypothesis h= that describes the means of each of the k distributions. In particular, we are looking for a maximum likelihood hypothesis for these means. §We extend the problem description as follows: for each point x i, there are k hidden variables z i1,..,z ik such that z il =1 if x i was generated by normal distribution l and z iq = 0 for all q  l.

31 The EM Algorithm (Cont’d) §An arbitrary initial hypothesis h= is chosen. §The EM Algorithm iterates over two steps: l Step 1 (Estimation, E): Calculate the expected value E[zij] of each hidden variable zij, assuming that the current hypothesis h= holds. l Step 2 (Maximization, M): Calculate a new maximum likelihood hypothesis h’=, assuming the value taken on by each hidden variable z ij is its expected value E[z ij ] calculated in step 1. Then replace the hypothesis h= by the new hypothesis h’= and iterate. The EM Algorithm can be applied to more general problems