1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

A Tutorial on Learning with Bayesian Networks

Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.

Learning: Parameter Estimation

Parameter Estimation using likelihood functions Tutorial #1

. Learning Bayesian networks Slides by Nir Friedman.

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Today Logistic Regression Decision Trees Redux Graphical Models

Thanks to Nir Friedman, HU

Learning Bayesian Networks (From David Heckerman’s tutorial)

1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.

Crash Course on Machine Learning

Recitation 1 Probability Review

Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

Problem Introduction Chow’s Problem Solution Example Proof of correctness.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.

Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,

Randomized Algorithms for Bayesian Hierarchical Clustering

Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website

Slides for “Data Mining” by I. H. Witten and E. Frank.

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1 BN Semantics 2 – The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 20 th, 2006 Readings: K&F:

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.

1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.

Bayesian Belief Network AI Contents t Introduction t Bayesian Network t KDD Data.

1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Bayes Net Learning: Bayesian Approaches

Still More Uncertainty

Bayesian Networks: Motivation

CS498-EA Reasoning in AI Lecture #20

Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,

Kalman Filters Switching Kalman Filter

Parameter Learning 2 Structure Learning 1: The good

EM for BNs Kalman Filters Gaussian BNs

Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.

Topic Models in Text Processing

Unifying Variational and GBP Learning Parameters of MNs EM for BNs

BN Semantics 3 – Now it’s personal! Parameter Learning 1

Junction Trees 3 Undirected Graphical Models

Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website

Variable Elimination Graphical Models – Carlos Guestrin

Generalized Belief Propagation

BN Semantics 2 – The revenge of d-separation

Learning Bayesian networks

Presentation transcript:

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, –  Carlos Guestrin

2 Decomposable score Log data likelihood Decomposable score:  Decomposes over families in BN (node and its parents)  Will lead to significant computational efficiency!!!  Score(G : D) =  i FamScore(X i |Pa Xi : D)

–  Carlos Guestrin Chow-Liu tree learning algorithm 1 For each pair of variables X i,X j  Compute empirical distribution:  Compute mutual information: Define a graph  Nodes X 1,…,X n  Edge (i,j) gets weight

–  Carlos Guestrin Chow-Liu tree learning algorithm 2 Optimal tree BN  Compute maximum weight spanning tree  Directions in BN: pick any node as root, breadth-first- search defines directions

–  Carlos Guestrin Can we extend Chow-Liu 1 Tree augmented naïve Bayes (TAN) [Friedman et al. ’97]  Naïve Bayes model overcounts, because correlation between features not considered  Same as Chow-Liu, but score edges with:

–  Carlos Guestrin Can we extend Chow-Liu 2 (Approximately learning) models with tree-width up to k  [Chechetka & Guestrin ’07]  But, O(n 2k+6 )

–  Carlos Guestrin What you need to know about learning BN structures so far Decomposable scores  Maximum likelihood  Information theoretic interpretation Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N 2k+6 ))

–  Carlos Guestrin Maximum likelihood score overfits! Information never hurts: Adding a parent always increases score!!!

–  Carlos Guestrin Bayesian score Prior distributions:  Over structures  Over parameters of a structure Posterior over structures given data:

–  Carlos Guestrin m Can we really trust MLE? What is better?  3 heads, 2 tails  30 heads, 20 tails  3x10 23 heads, 2x10 23 tails Many possible answers, we need distributions over possible parameters

–  Carlos Guestrin Bayesian Learning Use Bayes rule: Or equivalently:

–  Carlos Guestrin Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior?  Represent expert knowledge  Simple posterior form Conjugate priors:  Closed-form representation of posterior (more details soon)  For Binomial, conjugate prior is Beta distribution

–  Carlos Guestrin Beta prior distribution – P(  ) Likelihood function: Posterior:

–  Carlos Guestrin Posterior distribution Prior: Data: m H heads and m T tails Posterior distribution:

–  Carlos Guestrin Conjugate prior Given likelihood function P(D|  ) (Parametric) prior of the form P(  |  ) is conjugate to likelihood function if posterior is of the same parametric family, and can be written as:  P(  |  ’), for some new set of parameters  ’ Prior: Data: m H heads and m T tails (binomial likelihood) Posterior distribution:

–  Carlos Guestrin Using Bayesian posterior Posterior distribution: Bayesian inference:  No longer single parameter:  Integral is often hard to compute

–  Carlos Guestrin Bayesian prediction of a new coin flip Prior: Observed m H heads, m T tails, what is probability of m+1 flip is heads?

–  Carlos Guestrin Asymptotic behavior and equivalent sample size Beta prior equivalent to extra thumbtack flips:  As m → 1, prior is “forgotten” But, for small sample size, prior is important! Equivalent sample size:  Prior parameterized by  H,  T, or  m’ (equivalent sample size) and   Fix m’, change  Fix , change m’

–  Carlos Guestrin Bayesian learning corresponds to smoothing m=0 ) prior parameter m!1 ) MLE m

–  Carlos Guestrin Bayesian learning for multinomial What if you have a k sided coin??? Likelihood function if multinomial:  Conjugate prior for multinomial is Dirichlet:  Observe m data points, m i from assignment i, posterior: Prediction:

–  Carlos Guestrin Bayesian learning for two-node BN Parameters  X,  Y|X Priors:  P(  X ):  P(  Y|X ):

–  Carlos Guestrin Very important assumption on prior: Global parameter independence Global parameter independence:  Prior over parameters is product of prior over CPTs

–  Carlos Guestrin Global parameter independence, d-separation and local prediction Flu Allergy Sinus Headache Nose Independencies in meta BN: Proposition: For fully observable data D, if prior satisfies global parameter independence, then

–  Carlos Guestrin Within a CPT Meta BN including CPT parameters: Are  Y|X=t and  Y|X=f d-separated given D? Are  Y|X=t and  Y|X=f independent given D?  Context-specific independence!!! Posterior decomposes:

–  Carlos Guestrin Priors for BN CPTs (more when we talk about structure learning) Consider each CPT: P(X|U=u) Conjugate prior:  Dirichlet(  X=1|U=u,…,  X=k|U=u ) More intuitive:  “prior data set” D’ with m’ equivalent sample size  “prior counts”:  prediction:

–  Carlos Guestrin An example

–  Carlos Guestrin What you need to know about parameter learning Bayesian parameter learning:  motivation for Bayesian approach  Bayesian prediction  conjugate priors, equivalent sample size  Bayesian learning ) smoothing Bayesian learning for BN parameters  Global parameter independence  Decomposition of prediction according to CPTs  Decomposition within a CPT

–  Carlos Guestrin Announcements Project description is out on class website:  Individual or groups of two only  Suggested projects on the class website, or do something related to your research (preferable) Must be something you started this semester The semester goes really quickly, so be realistic (and ambitious )  Must be related to Graphical Models! Project deliverables:  one page proposal due Wednesday (10/8)  5-page milestone report Nov 3rd in class  Poster presentation on Dec. 1 st, 3-6pm in NSH Atrium  Write up, 8-pages, due Dec 3rd by 3pm by to instructors (no late days)  All write ups in NIPS format (see class website), page limits are strict Objective:  Explore and apply concepts in probabilistic graphical models  Doing a fun project!

–  Carlos Guestrin Bayesian score and model complexity X Y True model: P(Y=t|X=t) =  P(Y=t|X=f) =  Structure 1: X and Y independent  Score doesn’t depend on alpha Structure 2: X ! Y  Data points split between P(Y=t|X=t) and P(Y=t|X=f)  For fixed M, only worth it for large  Because posterior over parameter will be more diffuse with less data

–  Carlos Guestrin Bayesian, a decomposable score As with last lecture, assume:  Local and global parameter independence Also, prior satisfies parameter modularity:  If X i has same parents in G and G’, then parameters have same prior Finally, structure prior P(G) satisfies structure modularity  Product of terms over families  E.g., P(G) / c |G| Bayesian score decomposes along families!

–  Carlos Guestrin BIC approximation of Bayesian score Bayesian has difficult integrals For Dirichlet prior, can use simple Bayes information criterion (BIC) approximation  In the limit, we can forget prior!  Theorem: for Dirichlet prior, and a BN with Dim(G) independent parameters, as m!1:

–  Carlos Guestrin BIC approximation, a decomposable score BIC: Using information theoretic formulation:

–  Carlos Guestrin Consistency of BIC and Bayesian scores A scoring function is consistent if, for true model G *, as m!1, with probability 1  G * maximizes the score  All structures not I-equivalent to G * have strictly lower score Theorem: BIC score is consistent Corollary: the Bayesian score is consistent What about maximum likelihood score? Consistency is limiting behavior, says nothing about finite sample size!!!

–  Carlos Guestrin Priors for general graphs For finite datasets, prior is important! Prior over structure satisfying prior modularity What about prior over parameters, how do we represent it?  K2 prior: fix an , P(  Xi|PaXi ) = Dirichlet( ,…,  )  K2 is “inconsistent”

–  Carlos Guestrin BDe prior Remember that Dirichlet parameters analogous to “fictitious samples” Pick a fictitious sample size m’ For each possible family, define a prior distribution P(X i,Pa Xi )  Represent with a BN  Usually independent (product of marginals) BDe prior: Has “consistency property”: