Bayesian Learning Chapter 20.1-20.4 Some material adapted from lecture notes by Lise Getoor and Ron Parr.

Slides:



Advertisements
Similar presentations
Learning with Missing Data
Advertisements

Mixture Models and the EM Algorithm
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Learning: Parameter Estimation
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Review P(h i | d) – probability that the hypothesis is true, given the data (effect  cause) Used by MAP: select the hypothesis that is most likely given.
Neural Networks Marco Loog.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.
Learning Bayesian Networks
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.
1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Naive Bayes Classifier
CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins,
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Machine Learning 5. Parametric Methods.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CS479/679 Pattern Recognition Dr. George Bebis
Class #21 – Tuesday, November 10
Qian Liu CSE spring University of Pennsylvania
Maximum Likelihood Estimation
Oliver Schulte Machine Learning 726
Irina Rish IBM T.J.Watson Research Center
Data Mining Lecture 11.
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
CONTEXT DEPENDENT CLASSIFICATION
Professor Marie desJardins,
Bayesian Learning Chapter
Chapter 20. Learning and Acting with Bayes Nets
Topic Models in Text Processing
Class #25 – Monday, November 29
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Machine Learning: UNIT-3 CHAPTER-1
Learning Bayesian networks
Presentation transcript:

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr

Naïve Bayes

Use Bayesian modeling Make the simplest possible independence assumption: –Each attribute is independent of the values of the other attributes, given the class variable –In our restaurant domain: Cuisine is independent of Patrons, given a decision to stay (or not)

Bayesian Formulation p(C | F 1,..., F n ) = p(C) p(F 1,..., F n | C) / P(F 1,..., F n ) = α p(C) p(F 1,..., F n | C) Assume each feature F i is conditionally independent of the other given the class C. Then: p(C | F 1,..., F n ) = α p(C) Π i p(F i | C) Estimate each of these conditional probabilities from the observed counts in the training data: p(F i | C) = N(F i ∧ C) / N(C) –One subtlety of using the algorithm in practice: When your estimated probabilities are zero, ugly things happen –The fix: Add one to every count (aka “Laplacian smoothing”—they have a different name for everything!)

Naive Bayes: Example p(Wait | Cuisine, Patrons, Rainy?) = = α  p(Wait)  p(Cuisine|Wait)  p(Patrons|Wait)  p(Rainy?|Wait) = p(Wait)  p(Cuisine|Wait)  p(Patrons|Wait)  p(Rainy?|Wait) p(Cuisine)  p(Patrons)  p(Rainy?) We can estimate all of the parameters (p(F) and p(C) just by counting from the training examples

Naive Bayes: Analysis Naive Bayes is amazingly easy to implement (once you understand the bit of math behind it) Remarkably, naive Bayes can outperform many much more complex algorithms—it’s a baseline that should pretty much always be used for comparison Naive Bayes can’t capture interdependencies between variables (obviously)—for that, we need Bayes nets!

Learning Bayesian Networks

Learning Bayesian networks Given training set Find B that best matches D –model selection –parameter estimation Data D Inducer C A EB

Learning Bayesian Networks Describe a BN by specifying its (1) structure and (2) conditional probability tables (CPTs) Both can be learned from data, but –learning structure is much harder than learning parameters –learning when some nodes are hidden, or with missing data harder still We have four cases: StructureObservabilityMethod KnownFull Maximum Likelihood Estimation KnownPartial EM (or gradient ascent) UnknownFull Search through model space UnknownPartial EM + search through model space

Parameter estimation Assume known structure Goal: estimate BN parameters  –entries in local probability models, P(X | Parents(X)) A parameterization  is good if it is likely to generate the observed data: Maximum Likelihood Estimation (MLE) Principle: Choose   so as to maximize L i.i.d. samples

Parameter estimation II The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter The MLE (maximum likelihood estimate) solution: –for each value x of a node X –and each instantiation u of Parents(X) –Just need to collect the counts for every combination of parents and children observed in the data –MLE is equivalent to an assumption of a uniform prior over parameter values sufficient statistics

Model selection Goal: Select the best network structure, given the data Input: –Training data –Scoring function Output: –A network that maximizes the score

Structure selection: Scoring Bayesian: prior over parameters and structure –get balance between model complexity and fit to data as a byproduct Score (G:D) = log P(G|D)  log [P(D|G) P(G)] Marginal likelihood just comes from our parameter estimates Prior on structure can be any measure we want; typically a function of the network complexity Same key property: Decomposability Score(structure) =  i Score(family of X i ) Marginal likelihood Prior

Heuristic search B E A C B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A,E) Reverse E  A

Exploiting decomposability B E A C B E A C B E A C Δscore(C) Add E  C Δscore(A) Delete E  A Δscore(A) Reverse E  A B E A C Δscore(A) Delete E  A To recompute scores, only need to re-score families that changed in the last move

Variations on a theme Known structure, fully observable: only need to do parameter estimation Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation Known structure, missing values: use expectation maximization (EM) to estimate parameters Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques Unknown structure, hidden variables: too hard to solve!