# . PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

## Presentation on theme: ". PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often."— Presentation transcript:

. PGM: Tirgul 10 Parameter Learning and Priors

2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often we don’t have an expert Data is cheap u Vast amounts of data becomes available to us u Learning allows us to build systems based on the data

3 Learning Bayesian networks Inducer Data + Prior information E R B A C.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B)

4 Known Structure -- Complete Data E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified l Inducer needs to estimate parameters u Data does not contain missing values

5 Unknown Structure -- Complete Data E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified l Inducer needs to select arcs & estimate parameters u Data does not contain missing values

6 Known Structure -- Incomplete Data Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l We consider assignments to missing values E, B, A.

7 Known Structure / Complete Data u Given a network structure G And choice of parametric family for P(X i |Pa i ) u Learn parameters for network Goal u Construct a network that is “closest” to probability that generated the data

8 Example: Binomial Experiment (Statistics 101) u When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task:  Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  HeadTail

9 Statistical Parameter Fitting  Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same distribution l Each sampled independently of the rest  The task is to find a parameter  so that the data can be summarized by a probability P(x[j]|  ). l Depends on the given family of probability distributions: multinomial, Gaussian, Poisson, etc. l For now, focus on multinomial distributions i.i.d. samples

10 The Likelihood Function u How good is a particular  ? It depends on how likely it is to generate the observed data u The likelihood for the sequence H,T, T, H, H is 00.20.40.60.81  L(  :D)

11 Sufficient Statistics  To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails)  N H and N T are sufficient statistics for the binomial distribution

12 Sufficient Statistics u A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood  Formally, s(D) is a sufficient statistics if for any two datasets D and D’ s(D) = s(D’ )  L(  |D) = L(  |D’) Datasets Statistics

13 Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function u This is one of the most commonly used estimators in statistics u Intuitively appealing

14 Example: MLE in Binomial Data u Applying the MLE principle we get (Which coincides with what one would expect) 00.20.40.60.81 L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

15 Learning Parameters for a Bayesian Network E B A C u Training data has the form:

16 Learning Parameters for a Bayesian Network E B A C u Since we assume i.i.d. samples, likelihood function is

17 Learning Parameters for a Bayesian Network E B A C u By definition of network, we get

18 Learning Parameters for a Bayesian Network E B A C u Rewriting terms, we get

19 General Bayesian Networks Generalizing for any Bayesian network: u The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

20 General Bayesian Networks (Cont.) Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other.

21 From Binomial to Multinomial  For example, suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2. …,  K Sufficient statistics:  N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE:

22 Likelihood for Multinomial Networks  When we assume that P(X i | Pa i ) is multinomial, we get further decomposition:

23 Likelihood for Multinomial Networks  When we assume that P(X i | Pa i ) is multinomial, we get further decomposition:  For each value pa i of the parents of X i we get an independent multinomial problem u The MLE is

24 Maximum Likelihood Estimation Consistency u Estimate converges to best possible value as the number of examples grow u To make this formal, we need to introduce some definitions

25 KL-Divergence  Let P and Q be two distributions over X u A measure of distance between P and Q is the Kullback-Leibler Divergence  KL(P||Q) = 1 (when logs are in base 2) = l The probability P assigns to an instance is, on average, half the probability Q assigns to it u KL(P||Q)  0  KL(P||Q) = 0 iff are P and Q equal

26 Consistency  Let P(X|  ) be a parametric family l We need to make various regularity condition we won’t go into now  Let P * (X) be the distribution that generates the data u Let be the MLE estimate given a dataset D Thm  As N , where with probability 1

27 Consistency -- Geometric Interpretation P*P* P(X|  * ) Space of probability distribution Distributions that can represented by P(X|  )

28 Is MLE all we need? u Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack l Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet?

29 Bayesian Inference Frequentist Approach: u Assumes there is an unknown but fixed parameter  u Estimates  with some confidence u Prediction by using the estimated parameter value Bayesian Approach: u Represents uncertainty about the unknown parameter u Uses probability to quantify this uncertainty: l Unknown parameters as random variables u Prediction follows from the rules of probability: l Expectation over the unknown parameters

30 Bayesian Inference (cont.) u We can represent our uncertainty about the sampling process using a Bayesian network  The values of X are independent given   The conditional probabilities, P(x[m] |  ), are the parameters in the model u Prediction is now inference in this network  X[1]X[2]X[m] X[m+1] Observed dataQuery

31 Bayesian Inference (cont.) Prediction as inference in this network where Posterior Likelihood Prior Probability of data  X[1]X[2]X[m] X[m+1]

32 Example: Binomial Data Revisited  Prior: uniform for  in [0,1] P(  ) = 1  Then P(  |D) is proportional to the likelihood L(  :D) (N H,N T ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is 00.20.40.60.81

33 Bayesian Inference and MLE u In our example, MLE and Bayesian prediction differ But… If prior is well-behaved u Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value u Both are consistent

34 Dirichlet Priors u Recall that the likelihood function is  A Dirichlet prior with hyperparameters  1,…,  K is defined as for legal  1,…,  K Then the posterior has the same form, with hyperparameters  1 +N 1,…,  K +N K

35 Dirichlet Priors (cont.) u We can compute the prediction on a new event in closed form:  If P(  ) is Dirichlet with hyperparameters  1,…,  K then u Since the posterior is also Dirichlet, we get

36 Dirichlet Priors -- Example 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 00.20.40.60.81 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5)

37 Prior Knowledge  The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience  Equivalent sample size =  1 +…+  K u The larger the equivalent sample size the more confident we are in our prior

38 Effect of Priors Prediction of P(X=H ) after seeing data with N H = 0.25N T for different sample sizes 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 020406080100 0 0.1 0.2 0.3 0.4 0.5 0.6 020406080100 Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T

39 Effect of Priors (cont.) u In real data, Bayesian estimates are less sensitive to noise in the data 0.1 0.2 0.3 0.4 0.5 0.6 0.7 5101520253035404550 P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result

40 Conjugate Families u The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy l Dirichlet prior is a conjugate family for the multinomial likelihood u Conjugate families are useful since: l For many distributions we can represent them with hyperparameters l They allow for sequential update within the same representation l In many cases we have closed-form solution for prediction

41 Bayesian Networks and Bayesian Prediction u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

42 Bayesian Networks and Bayesian Prediction (Cont.) u We can also “read” from the network: Complete data  posteriors on parameters are independent XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

43 Bayesian Prediction(cont.) u Since posteriors on parameters for each family are independent, we can compute them separately u Posteriors for parameters within families are also independent:  Complete data  independent posteriors on  Y|X=0 and  Y|X=1 XX  Y|X m X[m] Y[m] Refined model XX  Y|X=0 m X[m] Y[m]  Y|X=1

44 Bayesian Prediction(cont.)  Given these observations, we can compute the posterior for each multinomial  X i | pa i independently l The posterior is Dirichlet with parameters  (X i =1|pa i )+N (X i =1|pa i ),…,  (X i =k|pa i )+N (X i =k|pa i )  The predictive distribution is then represented by the parameters

45 Assessing Priors for Bayesian Networks We need the  (x i,pa i ) for each node x j  We can use initial parameters  0 as prior information Need also an equivalent sample size parameter M 0 Then, we let  (x i,pa i ) = M 0  P(x i,pa i |  0 ) u This allows to update a network using new data

46 Learning Parameters: Case Study (cont.) Experiment: u Sample a stream of instances from the alarm network u Learn parameters using l MLE estimator l Bayesian estimator with uniform prior with different strengths

47 Learning Parameters: Case Study (cont.) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0500100015002000250030003500400045005000 KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50

48 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomial these are of the form N (x i,pa i ) l Parameter estimation u Bayesian methods also require choice of priors u Both MLE and Bayesian are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

Download ppt ". PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often."

Similar presentations