Important Distinctions in Learning BNs

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Bayes rule, priors and maximum a posteriori

Basics of Statistical Estimation

A Tutorial on Learning with Bayesian Networks

Probabilistic models Haixu Tang School of Informatics.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Learning: Parameter Estimation

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Parameter Estimation using likelihood functions Tutorial #1

Visual Recognition Tutorial

This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Maximum Likelihood (ML), Expectation Maximization (EM)

Visual Recognition Tutorial

Thanks to Nir Friedman, HU

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Recitation 1 Probability Review

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Univariate Gaussian Case (Cont.)

Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.

CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Bayesian Estimation and Confidence Intervals Lecture XXII.

Univariate Gaussian Case (Cont.)

Oliver Schulte Machine Learning 726

CS 2750: Machine Learning Review

Bayesian Estimation and Confidence Intervals

LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION

Probability Theory and Parameter Estimation I

ICS 280 Learning in Graphical Models

CS 2750: Machine Learning Density Estimation

Ch3: Model Building through Regression

Bayes Net Learning: Bayesian Approaches

Maximum Likelihood Estimation

Oliver Schulte Machine Learning 726

Latent Variables, Mixture Models and EM

Tutorial #3 by Ma’ayan Fishelson

Learning Bayesian networks

Important Distinctions in Learning BNs

CS498-EA Reasoning in AI Lecture #20

LECTURE 09: BAYESIAN LEARNING

LECTURE 07: BAYESIAN ESTIMATION

Parameter Learning 2 Structure Learning 1: The good

Learning From Observed Data

BN Semantics 3 – Now it’s personal! Parameter Learning 1

CS639: Data Management for Data Science

Mathematical Foundations of BME Reza Shadmehr

Part II White Parts from: Technical overview for machine-learning researcher – slides from UAI 1999 tutorial.

Learning Bayesian networks

Presentation transcript:

White Parts from: Technical overview for machine-learning researcher – slides from UAI 1999 tutorial

Important Distinctions in Learning BNs Complete data versus incomplete data Observed variables versus hidden variables Learning parameters versus learning structure Scoring methods versus conditional independence tests methods Exact scores versus asymptotic scores Search strategies versus Optimal learning of trees/polytrees/TANs

Overview of today’s lecture Introduction to Bayesian statistics Learning parameters of Bayesian networks The lecture today assumes: complete data, observable variables, no hidden variables.

P(hhth…ttth)= #h(1- )#t = A(1- )B Maximum Likelihood approach: maximize probability of data with respect to the unknown parameter(s). Likelihood function: P(hhth…ttth)= #h(1- )#t = A(1- )B Equivalent method is to maximize the log-likelihood function. The maximum for such sampling data is obtained when  = A/ (A + B)

What if #heads is zero ? The maximum for such sampling data is obtained when  = A / (A+B) = 0 A “regularized maximum” likelihood solution is needed When the data is “small”. = A+ε / (A+B+2ε) where ε is some small number. What is the “meaning” of ε aside of a needed correction?

What if we have some idea a priori? If our a priori idea is equivalent to seeing a heads, and b tails and now we see data of A heads and B tails, then what is a natural estimate for  ? If we use the estimate  = a/( a+b ) as a prior guess, then after seeing the data our natural estimate changes to: = ( A+a )/(N+N’) where N = a+b, N’=A+B. So what is a good choice for {a, b} ? For a random coin, maybe (100,100)? For a random thumbtack maybe (7,3)? a and b are imaginary counts, N= a+b is the equivalent sample size while A, B are the data counts and A+B is the data size. What is the meaning of ε in the previous slide ?

 =( ( a1+A1)/(N+N’), … ,(an+An)/(N+N’)) Generalizing to a Dice with 6 sides If our a priori idea is equivalent to seeing a1,…,a6 counts from each state and we now see data A1 …A6 counts from each state, then what is a natural estimate for  ? If we use a point estimate  = (a1/N,…,an/N) then after seeing the data our point estimate changes to  =( ( a1+A1)/(N+N’), … ,(an+An)/(N+N’)) Note that posterior estimate can be updated after each data point (namely in sequel or “online”) and that the posterior can serve as prior for future data points.

Another view of the update We just saw that from  = (a1/N,…,an/N) we move on after seeing the data to i = ( ai+Ai)/(N+N’) for each i. This update can be viewed as a weighted mixture of prior knowledge with data estimates: i = N/(N+N’) (ai/N) + N’ /(N+N’) (Ai/N’)

The Full Bayesian Approach Use a priori knowledge about the parameters. Encode uncertainty about the parameters via a distribution. Update the distribution after data is seen. Choose expected value as estimate.

p(|data) =  p() p(data|) (As before) p(|data) =  p() p(data|)

p(|data) = p(| hhth …ttth) =  p() p( hhth …ttth|) If we had more 100 heads the peak would move much more to the right. If we had 50 heads and 50 tails the peak would just sharpen considerably. p(|data) = p(| hhth …ttth) =  p() p( hhth …ttth|) p(| data) =  p() #h (1-) #t = p(| #h, #t) (#h, #t) are sufficient statistics for binomial sampling

The Beta distribution as a prior for =f







The Beta prior distribution for  Example: ht … htthh

From Prior to Posterior Observation 1: If the prior is Beta(;a,b) and we have seen A heads and B tails, then the posterior is Beta(;A+a,B+b). Consequence: If the prior is Beta(;a,b) and we use a point estimate  = a/N, then after seeing the data our point estimate changes to = (A+a)/(N+N’) where N’=A+B. So what is a good choice for the hyper-parameters {a, b} ? For a random coin, maybe (100,100)? For a random thumbtack maybe (7,3)? a and b are imaginary counts, N=a+b is the equivalent sample size while A, B are the data counts and A+B is the data size.

From Prior to Posterior in Blocks Observation 2: If the prior is Dir(a1,…,an) and we have seen A1 …An counts from each state, then the posterior is Dir(a1+A1,…, an+An). Consequence: If the prior is Dir(a1,…,an) and we use a point estimate  = (a1/N,…,an/N) then after seeing the data our point estimate changes to  =( ( a1+A1)/(N+N’), … ,(an+An)/(N+N’)) Note that posterior distribution can be updated after each data point (namely in sequel or “online”) and that the posterior can serve as prior for the future data points.

Another view of the update Recall the Consequence that from  = (a1/N,…,an/N) we move on after seeing the data to i = ( ai+Ai)/(N+N’) for each i. This update can be viewed as mixture of prior and data estimates: i = N/(N+N’) (ai/N) + N’(N+N’) (Ai/N’)

Learning Bayes Net parameters with complete data – no hidden variables

Learning Bayes Net parameters p(), p(h| ) =  and p(t| ) = 1- 

Learning Bayes Net parameters

P(X=x | x) = x P(Y=y |X=x, y|x, y|~x)= y|x P(Y=y |X=~x, y|x, y|~x)= y|~x Global and Local parameter independence  three separate independent thumbtack estimation tasks, assuming complete data.