Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning: Parameter Estimation Eran Segal Weizmann Institute.

Similar presentations


Presentation on theme: "Learning: Parameter Estimation Eran Segal Weizmann Institute."— Presentation transcript:

1 Learning: Parameter Estimation Eran Segal Weizmann Institute

2 Learning Introduction So far, we assumed that the networks were given Where do the networks come from? Knowledge engineering with aid of experts Automated construction of networks Learn by examples or instances

3 Learning Introduction Input: dataset of instances D={d[1],...d[m]} Output: Bayesian network Measures of success How close is the learned network to the original distribution Use distance measures between distributions Often hard because we do not have the true underlying distribution Instead, evaluate performance by how well the network predicts new unseen examples (“test data”) Classification accuracy How close is the structure of the network to the true one? Use distance metric between structures Hard because we do not know the true structure Instead, ask whether independencies learned hold in test data

4 Prior Knowledge Prespecified structure Learn only CPDs Prespecified variables Learn network structure and CPDs Hidden variables Learn hidden variables, structure, and CPDs Complete/incomplete data Missing data Unobserved variables

5 Learning Bayesian Networks P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 0.20.8 x11x11 x20x20 0.10.9 x11x11 x21x21 0.020.98 X1X1 Y X2X2 Inducer  Data  Prior information

6 Known Structure, Complete Data Goal: Parameter estimation Data does not contain missing values P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 0.20.8 x11x11 x20x20 0.10.9 x11x11 x21x21 0.020.98 X1X1 Y X2X2 Inducer X1X1 X2X2 Y x10x10 x21x21 y0y0 x11x11 x20x20 y0y0 x10x10 x21x21 y1y1 x10x10 x20x20 y0y0 x11x11 x21x21 y1y1 x10x10 x21x21 y1y1 x11x11 x20x20 y0y0 Input Data X1X1 Y X2X2 Initial network

7 Unknown Structure, Complete Data Goal: Structure learning & parameter estimation Data does not contain missing values P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 0.20.8 x11x11 x20x20 0.10.9 x11x11 x21x21 0.020.98 X1X1 Y X2X2 Inducer X1X1 X2X2 Y x10x10 x21x21 y0y0 x11x11 x20x20 y0y0 x10x10 x21x21 y1y1 x10x10 x20x20 y0y0 x11x11 x21x21 y1y1 x10x10 x21x21 y1y1 x11x11 x20x20 y0y0 Input Data X1X1 Y X2X2 Initial network

8 Known Structure, Incomplete Data Goal: Parameter estimation Data contains missing values Example: Naïve Bayes P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 0.20.8 x11x11 x20x20 0.10.9 x11x11 x21x21 0.020.98 X1X1 Y X2X2 Inducer X1X1 X2X2 Y ?x21x21 y0y0 x11x11 ?y0y0 ?x21x21 ? x10x10 x20x20 y0y0 ?x21x21 y1y1 x10x10 x21x21 ? x11x11 ?y0y0 Input Data Initial network X1X1 Y X2X2

9 Unknown Structure, Incomplete Data Goal: Structure learning & parameter estimation Data contains missing values P(Y|X 1,X 2 ) X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 0.20.8 x11x11 x20x20 0.10.9 x11x11 x21x21 0.020.98 X1X1 Y X2X2 Inducer X1X1 X2X2 Y ?x21x21 y0y0 x11x11 ?y0y0 ?x21x21 ? x10x10 x20x20 y0y0 ?x21x21 y1y1 x10x10 x21x21 ? x11x11 ?y0y0 Input Data Initial network X1X1 Y X2X2

10 Parameter Estimation Input Network structure Choice of parametric family for each CPD P(X i |Pa(X i )) Goal: Learn CPD parameters Two main approaches Maximum likelihood estimation Bayesian approaches

11 Biased Coin Toss Example Coin can land in two positions: Head or Tail Estimation task Given toss examples x[1],...x[m] estimate P(H)=  and P(T)=  1-  Assumption: i.i.d samples Tosses are controlled by an (unknown) parameter  Tosses are sampled from the same distribution Tosses are independent of each other

12 Biased Coin Toss Example Goal: find  [0,1] that predicts the data well “Predicts the data well” = likelihood of the data given  Example: probability of sequence H,T,T,H,H 00.20.40.60.81  L(D:  )

13 Maximum Likelihood Estimator Parameter  that maximizes L(D:  ) In our example,  =0.6 maximizes the sequence H,T,T,H,H 00.20.40.60.81  L(D:  )

14 Maximum Likelihood Estimator General case Observations: M H heads and M T tails Find  maximizing likelihood Equivalent to maximizing log-likelihood Differentiating the log-likelihood and solving for  we get that the maximum likelihood parameter is:

15 Sufficient Statistics For computing the parameter  of the coin toss example, we only needed M H and M T since  M H and M T are sufficient statistics

16 Sufficient Statistics A function s(D) is a sufficient statistics from instances to a vector in  k if for any two datasets D and D’ and any  we have Datasets Statistics

17 Sufficient Statistics for Multinomial A sufficient statistics for a dataset D over a variable Y with k values is the tuple of counts such that M i is the number of times that the Y=y i in D Sufficient statistic Define s(x[i]) as a tuple of dimension k s(x[i])=(0,...0,1,0,...,0) (1,...,i-1)(i+1,...,k)

18 Sufficient Statistic for Gaussian Gaussian distribution: Rewrite as  sufficient statistics for Gaussian: s(x[i])=

19 Maximum Likelihood Estimation MLE Principle: Choose  that maximize L(D:  ) Multinomial MLE: Gaussian MLE:

20 MLE for Bayesian Networks Parameters  x 0,  x 1  y 0| x 0,  y 1| x 0,  y 0| x 1,  y 1| x 1 Data instance tuple: Likelihood X Y Y Xy0y0 y1y1 x0x0 0.950.05 x1x1 0.20.8 X x0x0 x1x1 0.70.3  Likelihood decomposes into two separate terms, one for each variable

21 MLE for Bayesian Networks Terms further decompose by CPDs: By sufficient statistics where M[x 0,y 0 ] is the number of data instances in which X takes the value x 0 and Y takes the value y 0 MLE

22 MLE for Bayesian Networks Likelihood for Bayesian network  if  X i |Pa(X i ) are disjoint then MLE can be computed by maximizing each local likelihood separately

23 MLE for Table CPD BayesNets Multinomial CPD For each value x  X we get an independent multinomial problem where the MLE is

24 MLE for Tree CPDs Assume tree CPD with known tree structure y0y0 y1y1 z1z1 z0z0 Z Y  x 0 |y 0,  x 1 |y 0  x 0 |y 1,z 1,  x 1 |y 1,z 1 Y X Z Terms for and can be combined Optimization can be done by leaves

25 MLE for Tree CPD BayesNets Tree CPD T, leaves l For each value l  Leaves(T) we get an independent multinomial problem where the MLE is

26 Limitations of MLE Two teams play 10 times, and the first wins 7 of the 10 matches  Probability of first team winning = 0.7 A coin is tosses 10 times, and comes out ‘head’ 7 of the 10 tosses  Probability of head = 0.7 Would you place the same bet on the next game as you would on the next coin toss? We need to incorporate prior knowledge Prior knowledge should only be used as a guide

27 Bayesian Inference Assumptions Given a fixed  tosses are independent If  is unknown tosses are not marginally independent – each toss tells us something about  The following network captures our assumptions  X[1] X[M]X[2] …

28 Bayesian Inference Joint probabilistic model Posterior probability over   X[1] X[M]X[2] … Likelihood Prior Normalizing factor For a uniform prior, posterior is the normalized likelihood

29 Bayesian Prediction Predict the data instance from the previous ones Solve for uniform prior P(  )=1 and binomial variable

30 Example: Binomial Data Prior: uniform for  in [0,1] P(  ) = 1  P(  |D) is proportional to the likelihood L(D:  ) MLE for P(X=H) is 4/5 = 0.8 Bayesian prediction is 5/7 00.20.40.60.81 (M H,M T ) = (4,1)

31 Dirichlet Priors A Dirichlet prior is specified by a set of (non-negative) hyperparameters  1,...  k so that  ~ Dirichlet(  1,...  k ) if where and Intuitively, hyperparameters correspond to the number of imaginary examples we saw before starting the experiment

32 Dirichlet Priors – Example 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 00.20.40.60.81 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

33 Dirichlet Priors Dirichlet priors have the property that the posterior is also Dirichlet Data counts are M 1,...,M k Prior is Dir(  1,...  k )  Posterior is Dir(  1 +M 1,...  k +M k ) The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience Equivalent sample size =  1 +…+  K The larger the equivalent sample size the more confident we are in our prior

34 Effect of Priors 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 020406080100 0 0.1 0.2 0.3 0.4 0.5 0.6 020406080100 Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T Prediction of P(X=H) after seeing data with M H =1/4M T as a function of the sample size

35 Effect of Priors (cont.) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 5101520253035404550 P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result In real data, Bayesian estimates are less sensitive to noise in the data

36 General Formulation Joint distribution over D,  Posterior distribution over parameters P(D) is the marginal likelihood of the data As we saw, likelihood can be described compactly using sufficient statistics We want conditions in which posterior is also compact

37 Conjugate Families A family of priors P(  :  ) is conjugate to a model P(  |  ) if for any possible dataset D of i.i.d samples from P(  |  ) and choice of hyperparameters  for the prior over , there are hyperparameters  ’ that describe the posterior, i.e., P(  :  ’)  P(D|  )P(  :  )  Posterior has the same parametric form as the prior Dirichlet prior is a conjugate family for the multinomial likelihood Conjugate families are useful since: Many distributions can be represented with hyperparameters They allow for sequential update within the same representation In many cases we have closed-form solutions for prediction

38 Bayesian Estimation in BayesNets Instances are independent given the parameters (x[m’],y[m’]) are d-separated from (x[m],y[m]) given  Priors for individual variables are a priori independent Global independence of parameters XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X

39 Bayesian Estimation in BayesNets Posteriors of  are independent given complete data Complete data d-separates parameters for different CPDs As in MLE, we can solve each estimation problem separately XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X

40 Bayesian Estimation in BayesNets Posteriors of  are independent given complete data Also holds for parameters within families Note context specific independence between  Y|X=0 and  Y|X=1 when given both X and Y XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X=0  Y|X=1

41 Bayesian Estimation in BayesNets Posteriors of  can be computed independently For multinomial  X i |pa i posterior is Dirichlet with parameters  X i =1|pa i +M[X i =1|pa i ],...,  X i =k|pa i +M[X i =k|pa i ] XX X[1] X[M]X[2] … X Y Bayesian network Bayesian network for parameter estimation Y[1] Y[M]Y[2] …  Y|X=0  Y|X=1

42 Assessing Priors for BayesNets We need the  (x i,pa i ) for each node x i We can use initial parameters  0 as prior information Need also an equivalent sample size parameter M’ Then, we let  (x i,pa i ) = M’  P(x i,pa i |  0 ) This allows to update a network using new data X Y Example network for priors P(X=0)=P(X=1)=0.5 P(Y=0)=P(Y=1)=0.5 M’=1 Note:  (x 0 )=0.5  (x 0,y 0 )=0.25

43 Case Study: ICU Alarm Network The “Alarm” network 37 variables Experiment Sample instances Learn parameters MLE Bayesian PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

44 Case Study: ICU Alarm Network 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0500100015002000250030003500400045005000 KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50 MLE performs worst Prior M’=5 provides best smoothing

45 Parameter Estimation Summary Estimation relies on sufficient statistics For multinomials these are of the form M[x i,pa i ] Parameter estimation Bayesian methods also require choice of priors MLE and Bayesian are asymptotically equivalent Both can be implemented in an online manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)


Download ppt "Learning: Parameter Estimation Eran Segal Weizmann Institute."

Similar presentations


Ads by Google