. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Basics of Statistical Estimation
A Tutorial on Learning with Bayesian Networks
Probabilistic models Haixu Tang School of Informatics.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.
Learning: Parameter Estimation
Oliver Schulte Machine Learning 726
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Parameter Estimation using likelihood functions Tutorial #1
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.
Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.
Machine Learning CMPT 726 Simon Fraser University
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Class 3: Estimating Scoring Rules for Sequence Alignment.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Thanks to Nir Friedman, HU
Learning Bayesian Networks (From David Heckerman’s tutorial)
Recitation 1 Probability Review
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
Statistical Decision Theory
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Maximum Likelihood Estimation
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Univariate Gaussian Case (Cont.)
Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Oliver Schulte Machine Learning 726
CS 2750: Machine Learning Density Estimation
Parameter Estimation 主講人:虞台文.
Bayes Net Learning: Bayesian Approaches
Maximum Likelihood Estimation
Oliver Schulte Machine Learning 726
Tutorial #3 by Ma’ayan Fishelson
Outline Parameter estimation – continued Non-parametric methods.
More about Posterior Distributions
Important Distinctions in Learning BNs
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Important Distinctions in Learning BNs
LECTURE 09: BAYESIAN LEARNING
Parameter Learning 2 Structure Learning 1: The good
Parametric Methods Berlin Chen, 2005 References:
Learning Bayesian networks
Presentation transcript:

. PGM: Tirgul 10 Parameter Learning and Priors

2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often we don’t have an expert Data is cheap u Vast amounts of data becomes available to us u Learning allows us to build systems based on the data

3 Learning Bayesian networks Inducer Data + Prior information E R B A C.9.1 e b e be b b e BEP(A | E,B)

4 Known Structure -- Complete Data E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified l Inducer needs to estimate parameters u Data does not contain missing values

5 Unknown Structure -- Complete Data E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified l Inducer needs to select arcs & estimate parameters u Data does not contain missing values

6 Known Structure -- Incomplete Data Inducer E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

7 Known Structure / Complete Data u Given a network structure G And choice of parametric family for P(X i |Pa i ) u Learn parameters for network Goal u Construct a network that is “closest” to probability that generated the data

8 Example: Binomial Experiment (Statistics 101) u When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task:  Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  HeadTail

9 Statistical Parameter Fitting  Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same distribution l Each sampled independently of the rest  The task is to find a parameter  so that the data can be summarized by a probability P(x[j]|  ). l Depends on the given family of probability distributions: multinomial, Gaussian, Poisson, etc. l For now, focus on multinomial distributions i.i.d. samples

10 The Likelihood Function u How good is a particular  ? It depends on how likely it is to generate the observed data u The likelihood for the sequence H,T, T, H, H is  L(  :D)

11 Sufficient Statistics  To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails)  N H and N T are sufficient statistics for the binomial distribution

12 Sufficient Statistics u A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood  Formally, s(D) is a sufficient statistics if for any two datasets D and D’ s(D) = s(D’ )  L(  |D) = L(  |D’) Datasets Statistics

13 Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function u This is one of the most commonly used estimators in statistics u Intuitively appealing

14 Example: MLE in Binomial Data u Applying the MLE principle we get (Which coincides with what one would expect) L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

15 Learning Parameters for a Bayesian Network E B A C u Training data has the form:

16 Learning Parameters for a Bayesian Network E B A C u Since we assume i.i.d. samples, likelihood function is

17 Learning Parameters for a Bayesian Network E B A C u By definition of network, we get

18 Learning Parameters for a Bayesian Network E B A C u Rewriting terms, we get

19 General Bayesian Networks Generalizing for any Bayesian network: u The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

20 General Bayesian Networks (Cont.) Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other.

21 From Binomial to Multinomial  For example, suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2. …,  K Sufficient statistics:  N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE:

22 Likelihood for Multinomial Networks  When we assume that P(X i | Pa i ) is multinomial, we get further decomposition:

23 Likelihood for Multinomial Networks  When we assume that P(X i | Pa i ) is multinomial, we get further decomposition:  For each value pa i of the parents of X i we get an independent multinomial problem u The MLE is

24 Maximum Likelihood Estimation Consistency u Estimate converges to best possible value as the number of examples grow u To make this formal, we need to introduce some definitions

25 KL-Divergence  Let P and Q be two distributions over X u A measure of distance between P and Q is the Kullback-Leibler Divergence  KL(P||Q) = 1 (when logs are in base 2) = l The probability P assigns to an instance is, on average, half the probability Q assigns to it u KL(P||Q)  0  KL(P||Q) = 0 iff are P and Q equal

26 Consistency  Let P(X|  ) be a parametric family l We need to make various regularity condition we won’t go into now  Let P * (X) be the distribution that generates the data u Let be the MLE estimate given a dataset D Thm  As N , where with probability 1

27 Consistency -- Geometric Interpretation P*P* P(X|  * ) Space of probability distribution Distributions that can represented by P(X|  )

28 Is MLE all we need? u Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack l Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet?

29 Bayesian Inference Frequentist Approach: u Assumes there is an unknown but fixed parameter  u Estimates  with some confidence u Prediction by using the estimated parameter value Bayesian Approach: u Represents uncertainty about the unknown parameter u Uses probability to quantify this uncertainty: l Unknown parameters as random variables u Prediction follows from the rules of probability: l Expectation over the unknown parameters

30 Bayesian Inference (cont.) u We can represent our uncertainty about the sampling process using a Bayesian network  The values of X are independent given   The conditional probabilities, P(x[m] |  ), are the parameters in the model u Prediction is now inference in this network  X[1]X[2]X[m] X[m+1] Observed dataQuery

31 Bayesian Inference (cont.) Prediction as inference in this network where Posterior Likelihood Prior Probability of data  X[1]X[2]X[m] X[m+1]

32 Example: Binomial Data Revisited  Prior: uniform for  in [0,1] P(  ) = 1  Then P(  |D) is proportional to the likelihood L(  :D) (N H,N T ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is

33 Bayesian Inference and MLE u In our example, MLE and Bayesian prediction differ But… If prior is well-behaved u Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value u Both are consistent

34 Dirichlet Priors u Recall that the likelihood function is  A Dirichlet prior with hyperparameters  1,…,  K is defined as for legal  1,…,  K Then the posterior has the same form, with hyperparameters  1 +N 1,…,  K +N K

35 Dirichlet Priors (cont.) u We can compute the prediction on a new event in closed form:  If P(  ) is Dirichlet with hyperparameters  1,…,  K then u Since the posterior is also Dirichlet, we get

36 Dirichlet Priors -- Example Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

37 Prior Knowledge  The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience  Equivalent sample size =  1 +…+  K u The larger the equivalent sample size the more confident we are in our prior

38 Effect of Priors Prediction of P(X=H ) after seeing data with N H = 0.25N T for different sample sizes Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T

39 Effect of Priors (cont.) u In real data, Bayesian estimates are less sensitive to noise in the data P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result

40 Conjugate Families u The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy l Dirichlet prior is a conjugate family for the multinomial likelihood u Conjugate families are useful since: l For many distributions we can represent them with hyperparameters l They allow for sequential update within the same representation l In many cases we have closed-form solution for prediction

41 Bayesian Networks and Bayesian Prediction u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

42 Bayesian Networks and Bayesian Prediction (Cont.) u We can also “read” from the network: Complete data  posteriors on parameters are independent XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

43 Bayesian Prediction(cont.) u Since posteriors on parameters for each family are independent, we can compute them separately u Posteriors for parameters within families are also independent:  Complete data  independent posteriors on  Y|X=0 and  Y|X=1 XX  Y|X m X[m] Y[m] Refined model XX  Y|X=0 m X[m] Y[m]  Y|X=1

44 Bayesian Prediction(cont.)  Given these observations, we can compute the posterior for each multinomial  X i | pa i independently l The posterior is Dirichlet with parameters  (X i =1|pa i )+N (X i =1|pa i ),…,  (X i =k|pa i )+N (X i =k|pa i )  The predictive distribution is then represented by the parameters

45 Assessing Priors for Bayesian Networks We need the  (x i,pa i ) for each node x j  We can use initial parameters  0 as prior information Need also an equivalent sample size parameter M 0 Then, we let  (x i,pa i ) = M 0  P(x i,pa i |  0 ) u This allows to update a network using new data

46 Learning Parameters: Case Study (cont.) Experiment: u Sample a stream of instances from the alarm network u Learn parameters using l MLE estimator l Bayesian estimator with uniform prior with different strengths

47 Learning Parameters: Case Study (cont.) KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50

48 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomial these are of the form N (x i,pa i ) l Parameter estimation u Bayesian methods also require choice of priors u Both MLE and Bayesian are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)