CS 478 - Bayesian Learning1 Bayesian Learning. CS 478 - Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayesian Learning No reading assignment for this topic
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Statistical Learning: Bayesian and ML COMP155 Sections May 2, 2007.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Bayesian Learning and Learning Bayesian Networks.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Bayesian Learning Rong Jin.
Thanks to Nir Friedman, HU
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Text Classification, Active/Interactive learning.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
INTRODUCTION TO Machine Learning 3rd Edition
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Bayesian Learning. 2 Bayesian Reasoning Basic assumption –The quantities of interest are governed by probability distribution –These probability + observed.
Ensemble Methods in Machine Learning
Bayesian Learning Provides practical learning algorithms
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Machine Learning 5. Parametric Methods.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
CS479/679 Pattern Recognition Dr. George Bebis
Naive Bayes Classifier
Computer Science Department
Data Mining Lecture 11.
Bayesian Averaging of Classifiers and the Overfitting Problem
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CS Bayesian Learning1 Bayesian Learning

CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile several different notations that encode the same concepts States: the thing in the world that dictates what happens Observations: the thing that we get to see Likelihoods – States yield observations, p(o|s) – States are causes of effects, which we observe, p(e|c) – States are hypotheses or explanations of data we observe, p(D|h) Naïve Bayes is an approach for inferring causes from data assuming a particular structure from the data

Bayesian Learning P(h|D) - Posterior probability of h, this is what we usually want to know in machine learning P(h) - Prior probability of the hypothesis independent of D - do we usually know? – Could assign equal probabilities – Could assign probability based on inductive bias (e.g. simple hypotheses have higher probability) P(D) - Prior probability of the data P(D|h) - Probability “likelihood” of data given the hypothesis P(h|D) = P(D|h)P(h)/P(D) Bayes Rule P(h|D) increases with P(D|h) and P(h). In learning to discover the best h given a particular D, P(D) is the same in all cases and thus is not needed. Good approach when P(D|h)P(h) is more reasonable to calculate than P(h|D)

Bayesian Learning Maximum a posteriori (MAP) hypothesis h MAP = argmax h  H P(h|D) = argmax h  H P(D|h)P(h)/P(D) = argmax h  H P(D|h)P(h) Maximum Likelihood (ML) Hypothesis h ML = argmax h  H P(D|h) MAP = ML if all priors P(h) are equally likely Note that prior can be like an inductive bias (i.e. simpler hypothesis are more probable) Example (assume only 3 possible hypotheses) For a consistent learner (e.g. Version Space) then all h which match D are MAPs assuming P(h) = 1/|H| - can use P(h) to then bias which one you really want

Bayesian Learning (cont) Brute force approach is to test each h  H to see which maximizes P(h|D) Note that the argmax is not the real probability since P(D) is unknown Can still get the real probability (if desired) by normalization if there is a limited number of priors – Assume only two possible hypotheses h 1 and h 2 – The true posterior probability of h 1 would be

Bayes Optimal Classifiers Best question is what is the most probable classification for a given instance, rather than what is the most probable hypothesis for a data set Let all possible hypothesis vote for the instance in question weighted by their posterior (an ensemble approach) - usually better than the single best MAP hypothesis Bayes Optimal Classification: Example

CS Bayesian Learning7 Bayes Optimal Classifiers (Cont) No other classification method using the same hypothesis space can outperform a Bayes optimal classifier on average, given the available data and prior probabilities over the hypotheses Large or infinite hypothesis spaces make this impractical in general, but it is an important theoretical concept Also, this is only as accurate as our knowledge of the priors for the hypotheses, which we usually do not know If our priors are bad, then Bayes optimal will not be optimal. For example, if we just assumed uniform priors, then you might have a situation where the many lower posterior hypotheses could dominate the fewer high posterior ones. Note that the prior probabilities over a hypothesis space is an inductive bias (e.g. simplest the most probable, etc.)

CS Bayesian Learning8 Naïve Bayes Classifier Given a training set, P(v j ) is easy to calculate How about P(a 1,…,a n |v j )? Most cases would be either 0 or 1. Would require a huge training set to get reasonable values. Key leap: Assume conditional independence of the attributes While conditional independence is not typically a reasonable assumption… – Low complexity simple approach - need only store all P(v j ) and P(a i |v j ) terms, easy to calculate and with only |attributes|  |attribute values|  |classes| terms there is often enough data to make the terms accurate at a 1st order level – Effective for many applications Example

CS Bayesian Learning9 Naïve Bayes (cont.) Can normalize to get the actual naïve Bayes probability Continuous data? - Can discretize a continuous feature into bins thus changing it into a nominal feature and then gather statistics normally – How many bins? - More bins is good, but need sufficient data to make statistically significant bins. Thus, base it on data available

CS Bayesian Learning10 Infrequent Data Combinations Would if there are 0 or very few cases of a particular a i |v j (n c /n)? n c is the number of instances with output v j where a i = attribute value c. n is the total number of instances with output v j Should usually allow every case at least some finite probability since it could occur in the test set, else the 0 terms will dominate the product (speech example) Replace n c /n with the Laplacian: p is a prior probability of the attribute value which is usually set to 1/(# of attribute values) for that attribute (thus 1/p is just the number of possible attribute values). Thus if n c /n is 0/10 and n c has three attribute values, the Laplacian would be 1/13. Another approach: m-estimate of probability: As if augmented the observed set with m “virtual” examples distributed according to p. If m is set to 1/p then it is the Laplacian. If m is 0 then it defaults to n c /n.

Naïve Bayes (cont.) No training per se, just gather the statistics from your data set and then apply the Naïve Bayes classification equation to any new instance Easier to have many attributes since not building a net, etc. and the amount of statistics gathered grows linearly with the number of attributes (# attributes  # attribute values  # classes) - Thus natural for applications like text classification which can easily be represented with huge numbers of input attributes. Mitchell’s text classification approach – Just calculate P(word|class) for every word/token in the language and each output class based on the training data. Words that occur in testing but do not occur in the training data are ignored. – Good empirical results. Can drop filler words (the, and, etc.) and words found less than z times in the training set.

CS Bayesian Learning12 Less Naïve Bayes NB uses just 1st order features - assumes conditional independence – calculate statistics for all P(a i |v j )) – |attributes|  |attribute values|  |output classes| nth order - P(a i,…,a n |v j ) - assumes full conditional dependence – |attributes| n  |attribute values|  |output classes| – Too computationally expensive - exponential – Not enough data to get reasonable statistics - most cases occur 0 or 1 time 2nd order? - compromise - P(a i a k |v j ) - assume only low order dependencies – |attributes| 2  |attribute values|  |output classes| – Still may have cases where number of a i a k |v j occurrences are 0 or few - might be all right (just use the features which occur often in the data) How might you test if a problem is conditionally independent? – Could compare with nth order but that is difficult because of time complexity and insufficient data – Could just compare against 2nd order. How far off on average is our assumption P(a i a k |v j ) = P(a i |v j ) P(a k |v j )

Bayesian Belief Nets Can explicitly specify where there is significant conditional dependence - intermediate ground (all dependencies would be too complex and not all are truly dependent). If you can get both of these correct (or close) then it can be a powerful representation. - growing research area Specify causality in a DAG and give conditional probabilities from immediate parents (causal) Belief networks represent the full joint probability function for a set of random variables in a compact space - Product of recursively derived conditional probabilities If given a subset of observable variables, then you can infer probabilities on the unobserved variables - general approach is NP-complete - approximation methods are used Gradient descent learning approaches for conditionals. Greedy approaches to find network structure.

CS Bayesian Learning14 Naïve Bayes Assignment See html html