Download presentation

Presentation is loading. Please wait.

Published byLillian Glass Modified about 1 year ago

1
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C. Berkeley Moises Goldszmidt SRI International For current slides, additional material, and reading list see

2
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-2 Outline »Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion

3
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-3 Learning (in this context) u Process l Input: dataset and prior information l Output: Bayesian network u Prior information: background knowledge l a Bayesian network (or fragments of it) l time ordering l prior probabilities l... Represents P(E,B,R,S,C) Independence Statements Causality Inducer Data + Prior information E R B A C

4
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-4 Bayesian Networks Computer efficient representation of probability distributions via conditional independence e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call

5
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-5 Bayesian Networks Qualitative part: statistical independence statements (causality!) u Directed acyclic graph (DAG) l Nodes - random variables of interest (exhaustive and mutually exclusive states) l Edges - direct (causal) influence Quantitative part: Local probability models. Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call

6
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-6 Monitoring Intensive-Care Patients The “alarm” network 37 variables, 509 parameters (instead of 2 37 ) PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

7
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-7 Qualitative part u Nodes are independent of non-descendants given their parents P(R|E=y,A) = P(R|E=y) for all values of R,A,E Given that there is and earthquake, I can predict a radio announcement regardless of whether the alarm sounds u d-separation: a graph theoretic criterion for reading independence statements Can be computed in linear time (on the number of edges) Earthquake Radio Burglary Alarm Call

8
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-8 Blocked Unblocked E B A C E B A C E B A C E C A E C A E R A E R A d-separation u Two variables are independent if all paths between them are blocked by evidence Three cases: l Common cause l Intermediate cause l Common Effect Blocked Unblocked

9
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-9 I(X,Y|Z) denotes X and Y are independent given Z l I(R,B) l ~I(R,B|A) l I(R,B|E,A) l ~I(R,C|B) Example E B A C R

10
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-10 u Associated with each node X i there is a set of conditional probability distributions P(X i |Pa i : ) l If variables are discrete, is usually multinomial l Variables can be continuous, can be a linear Gaussian l Combinations of discrete and continuous are only constrained by available inference mechanisms Earthquake Burglary Alarm e b e be b b e BE P(A | E,B) Quantitative Part

11
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-11 What Can We Do with Bayesian Networks? u Probabilistic inference: belief update l P(E =Y| R = Y, C = Y) u Probabilistic inference: belief revision Argmax {E,B} P(e, b | C=Y) u Qualitative inference I(R,C| A) u Complex inference l rational decision making (influence diagrams) l value of information l sensitivity analysis u Causality (analysis under interventions) Earthquake Radio Burglary Alarm Call

12
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-12 Bayesian Networks: Summary u Bayesian networks: an efficient and effective representation of probability distributions u Efficient: l Local models l Independence (d-separation) u Effective: Algorithms take advantage of structure to l Compute posterior probabilities l Compute most probable instantiation l Decision making u But there is more: statistical induction LEARNING

13
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-13 Learning Bayesian networks (reminder) Inducer Data + Prior information E R B A C.9.1 e b e be b b e BEP(A | E,B)

14
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-14 The Learning Problem

15
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-15 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

16
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-16 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

17
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-17 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

18
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-18 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

19
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-19 Outline u Introduction u Bayesian networks: a review »Parameter learning: Complete data l Statistical parametric fitting l Maximum likelihood estimation l Bayesian inference u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

20
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-20 Example: Binomial Experiment (Statistics 101) u When tossed, it can land in one of two positions: Head or Tail We denote by the (unknown) probability P(H). Estimation task: u Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 - HeadTail

21
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-21 Statistical parameter fitting u Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same distribution l Each sampled independently of the rest The task is to find a parameter so that the data can be summarized by a probability P(x[j]| ). l The parameters depend on the given family of probability distributions: multinomial, Gaussian, Poisson, etc. l We will focus on multinomial distributions l The main ideas generalize to other distribution families i.i.d. samples

22
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-22 The Likelihood Function u How good is a particular ? It depends on how likely it is to generate the observed data u Thus, the likelihood for the sequence H,T, T, H, H is L( :D)

23
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-23 Sufficient Statistics To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails) N H and N T are sufficient statistics for the binomial distribution u A sufficient statistic is a function that summarizes, from the data, the relevant information for the likelihood If s(D) = s(D’ ), then L( |D) = L( |D’)

24
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-24 Maximum Likelihood Estimation MLE Principle: Learn parameters that maximize the likelihood function This is one of the most commonly used estimators in statistics Intuitively appealing

25
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-25 Maximum Likelihood Estimation (Cont.) u Consistent l Estimate converges to best possible value as the number of examples grow u Asymptotic efficiency l Estimate is as close to the true value as possible given a particular training set u Representation invariant l A transformation in the parameter representation does not change the estimated probability distribution

26
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-26 Example: MLE in Binomial Data u Applying the MLE principle we get (Which coincides with what one would expect) L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

27
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-27 Learning Parameters for the Burglary Story E B A C i.i.d. samples Network factorization We have 4 independent estimation problems

28
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-28 General Bayesian Networks We can define the likelihood for a Bayesian network: The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

29
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-29 General Bayesian Networks (Cont.) Decomposition Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other.

30
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-30 From Binomial to Multinomial For example, suppose X can have the values 1,2,…,K We want to learn the parameters 1, 2. …, K Sufficient statistics: N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE:

31
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-31 Likelihood for Multinomial Networks When we assume that P(X i | Pa i ) is multinomial, we get further decomposition: For each value pa i of the parents of X i we get an independent multinomial problem u The MLE is

32
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-32 Is MLE all we need? u Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack l Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet?

33
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-33 u MLE commits to a specific value of the unknown parameter(s) u MLE is the same in both cases u Confidence in prediction is clearly different Bayesian Inference vs. CoinThumbtack

34
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-34 Bayesian Inference (cont.) Frequentist Approach: u Assumes there is an unknown but fixed parameter u Estimates with some confidence u Prediction by using the estimated parameter value Bayesian Approach: u Represents uncertainty about the unknown parameter u Uses probability to quantify this uncertainty: l Unknown parameters as random variables u Prediction follows from the rules of probability: l Expectation over the unknown parameters

35
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-35 Bayesian Inference (cont.) u We can represent our uncertainty about the sampling process using a Bayesian network The observed values of X are independent given The conditional probabilities, P(x[m] | ), are the parameters in the model l Prediction is now inference in this network X[1]X[2]X[m] X[m+1] Observed dataQuery

36
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-36 Bayesian Inference (cont.) u Prediction as inference in this network where Posterior Likelihood Prior Probability of data X[1]X[2]X[m] X[m+1]

37
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-37 Example: Binomial Data Revisited Suppose that we choose a uniform prior P( ) = 1 for in [0,1] Then P( |D) is proportional to the likelihood L( :D) u (N H,N T ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is

38
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-38 Bayesian Inference and MLE u In our example, MLE and Bayesian prediction differ But… If prior is well-behaved u Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value u Both converge to the “true” underlying distribution (almost surely)

39
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-39 Dirichlet Priors u Recall that the likelihood function is A Dirichlet prior with hyperparameters 1,…, K is defined as for legal 1,…, K Then the posterior has the same form, with hyperparameters 1 +N 1,…, K +N K

40
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-40 Dirichlet Priors (cont.) u We can compute the prediction on a new event in closed form: If P( ) is Dirichlet with hyperparameters 1,…, K then Since the posterior is also Dirichlet, we get

41
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-41 Priors Intuition The hyperparameters 1,…, K can be thought of as “imaginary” counts from our prior experience Equivalent sample size = 1 +…+ K u The larger the equivalent sample size the more confident we are in our prior

42
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-42 Effect of Priors Prediction of P(X=H ) after seeing data with N H = 0.25N T for different sample sizes Different strength H + T Fixed ratio H / T Fixed strength H + T Different ratio H / T

43
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-43 Effect of Priors (cont.) u In real data, Bayesian estimates are less sensitive to noise in the data P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result

44
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-44 Conjugate Families u The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy l Dirichlet prior is a conjugate family for the multinomial likelihood u Conjugate families are useful since: l For many distributions we can represent them with hyperparameters l They allow for sequential update within the same representation l In many cases we have closed-form solution for prediction

45
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-45 Bayesian Networks and Bayesian Prediction u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1] Y|X XX m X[m] Y[m] Query

46
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-46 Bayesian Networks and Bayesian Prediction (Cont.) u We can also “read” from the network: Complete data posteriors on parameters are independent XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1] Y|X XX m X[m] Y[m] Query

47
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-47 Bayesian Prediction(cont.) u Since posteriors on parameters for each family are independent, we can compute them separately u Posteriors for parameters within families are also independent: Complete data the posteriors on Y|X=0 and Y|X=1 are independent XX Y|X m X[m] Y[m] Refined model XX Y|X=0 m X[m] Y[m] Y|X=1

48
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-48 Bayesian Prediction(cont.) Given these observations, we can compute the posterior for each multinomial X i | pa i independently l The posterior is Dirichlet with parameters (X i =1|pa i )+N (X i =1|pa i ),…, (X i =k|pa i )+N (X i =k|pa i ) The predictive distribution is then represented by the parameters which is what we expected! The Bayesian analysis just made the assumptions explicit

49
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-49 Assessing Priors for Bayesian Networks We need the (x i,pa i ) for each node x j We can use initial parameters 0 as prior information Need also an equivalent sample size parameter M 0 Then, we let (x i,pa i ) = M 0 P(x i,pa i | 0 ) u This allows to update a network using new data

50
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-50 Learning Parameters: Case Study (cont.) u Experiment: l Sample a stream of instances from the alarm network l Learn parameters using MLE estimator Bayesian estimator with uniform prior with different strengths

51
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-51 Learning Parameters: Case Study (cont.) Comparing two distribution P(x) (true model) vs. Q(x) (learned distribution) -- Measure their KL Divergence l 1 KL divergence (when logs are in base 2) = The probability P assigns to an instance will be, on average, twice as small as the probability Q assigns to it l KL(P||Q) 0 KL(P||Q) = 0 iff are P and Q equal

52
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-52 Learning Parameters: Case Study (cont.) KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50

53
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-53 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomial these are of the form N (x i,pa i ) l Parameter estimation u Bayesian methods also require choice of priors u Both MLE and Bayesian are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

54
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-54 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data »Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

55
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-55 Incomplete Data Data is often incomplete u Some variables of interest are not assigned value This phenomena happen when we have u Missing values u Hidden variables

56
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-56 Missing Values u Examples: u Survey data u Medical records l Not all patients undergo all possible tests

57
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-57 Missing Values (cont.) Complicating issue: u The fact that a value is missing might be indicative of its value l The patient did not undergo X-Ray since she complained about fever and not about broken bones…. To learn from incomplete data we need the following assumption: Missing at Random (MAR): The probability that the value of X i is missing is independent of its actual value given other observed values

58
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-58 Missing Values (cont.) u If MAR assumption does not hold, we can create new variables that ensure that it does u We now can predict new examples (w/ pattern of ommisions) u We might not be able to learn about the underlying process X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Data X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Augmented Data Obs-X Obs-Z YYYYYYYYYY Obs-Y NNYYYNNYYY YNNYYYNNYY X Y Z X Y Z Obs-X Obs-Y Obs-Z

59
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-59 Hidden (Latent) Variables u Attempt to learn a model with variables we never observe l In this case, MAR always holds u Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters

60
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-60 Hidden Variables (cont.) u Hidden variables also appear in clustering u Autoclass model: l Hidden variables assigns class labels l Observed attributes are independent given the class Cluster X1X1... X2X2 XnXn Hidden Observed possible missing values

61
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-61 Learning Parameters from Incomplete Data Complete data: Independent posteriors for X, Y|X=H and Y|X=T Incomplete data: u Posteriors can be interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX Y|X=H m X[m] Y[m] Y|X=T

62
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-62 Example 3 missing values2 missing valueno missing values u Simple network: P(X) assumed to be known Likelihood is a function of 2 parameters: P(Y=H|X=H), P(Y=H|X=T) Contour plots of log likelihood for different number of missing values of X (M = 8): P(Y=H|X=H) P(Y=H|X=T) XY

63
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-63 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables a serious problem HY

64
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-64 L( |D) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters Add line search and conjugate gradient methods to get fast convergence Both: Find local maxima. Require multiple restarts to find approx. to the global maximum Require computations in each iteration

65
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-65 Gradient Ascent u Main result Requires computation: P(x i,Pa i |o[m], ) for all i, m u Pros: l Flexible l Closely related to methods in neural network training u Cons: l Need to project gradient onto space of legal parameters l To get reasonable convergence we need to combine with “smart” optimization techniques

66
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-66 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT T??THT??TH HTHTHTHT HHTTHHTT P(Y=H|X=T, ) = 0.4 Expected Counts P(Y=H|X=H,Z=T, ) = 0.3 Data Current model

67
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-67 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G, 0 ) Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G, 1 ) (M-Step) Reiterate

68
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-68 EM (cont.) Formal Guarantees: L( 1 :D) L( 0 :D) l Each iteration improves the likelihood If 1 = 0, then 0 is a stationary point of L( :D) l Usually, this means a local maximum Main cost: u Computations of expected counts in E-Step u Requires a computation pass for each instance in training set l These are exactly the same as for gradient ascent!

69
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-69 Example: EM in clustering u Consider clustering example E-Step: Compute P(C[m]|X 1 [m],…,X n [m], ) l This corresponds to “soft” assignment to clusters l Compute expected statistics: M-Step Re-estimate P(X i |C), P(C) Cluster X1X1... X2X2 XnXn

70
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-70 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones Speed up: u various methods to speed convergence

71
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-71 Error on training set (Alarm) Experiment by Baur, Koller and Singer [UAI97]

72
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-72 Test set error (alarm)

73
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-73 Parameter value (Alarm)

74
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-74 Parameter value (Alarm)

75
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-75 Parameter value (Alarm)

76
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-76 Bayesian Inference with Incomplete Data Recall, Bayesian estimation: Complete data: closed form solution for integral Incomplete data: u No sufficient statistics (except the data) u Posterior does not decompose u No closed form solution ïNeed to use approximations

77
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-77 MAP Approximation u Simplest approximation: MAP parameters l MAP --- Maximum A-posteriori Probability where Assumption: u Posterior mass is dominated by a MAP parameters Finding MAP parameters: u Same techniques as finding ML parameters Maximize P( |D) instead of L( :D)

78
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-78 Stochastic Approximations Stochastic approximation: Sample 1, …, k from P( |D) u Approximate

79
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-79 Stochastic Approximations (cont.) How do we sample from P( |D) ? Markov Chain Monte Carlo (MCMC) methods: Find a Markov Chain whose stationary probability Is P( |D) u Simulate the chain until convergence to stationary behavior u Collect samples for the “stationary” regions Pros: u Very flexible method: when other methods fails, this one usually works u The more samples collected, the better the approximation Cons: u Can be computationally expensive u How do we know when we are converging on stationary distribution?

80
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-80 Stochastic Approximations: Gibbs Sampling Gibbs Sampler: u A simple method to construct MCMC sampling process Start: u Choose (random) values for all unknown variables Iteration: u Choose an unknown variable l A missing data variable or unknown parameter l Either a random choice or round-robin visits u Sample a value for the variable given the current values of all other variables XX Y|X=H m X[m] Y[m] Y|X=T

81
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-81 Parameter Learning from Incomplete Data: Summary u Non-linear optimization problem u Methods for learning: EM and Gradient Ascent l Exploit inference for learning Difficulties: u Exploration of a complex likelihood/posterior l More missing data many more local maxima l Cannot represent posterior must resort to approximations u Inference l Main computational bottleneck for learning l Learning large networks exact inference is infeasible resort to stochastic simulation or approximate inference (e.g., see Jordan’s tutorial)

82
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-82 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data »Structure learning: Complete data »Scoring metrics l Maximizing the score l Learning local structure u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

83
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-83 Benefits of Learning Structure u Efficient learning -- more accurate models with less data Compare: P(A) and P(B) vs joint P(A,B) former requires less data! l Discover structural properties of the domain l Identifying independencies in the domain helps to Order events that occur sequentially Sensitivity analysis and inference u Predict effect of actions l Involves learning causal relationship among variables defer to later part of the tutorial

84
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-84 Approaches to Learning Structure u Constraint based l Perform tests of conditional independence l Search for a network that is consistent with the observed dependencies and independencies u Score based l Define a score that evaluates how well the (in)dependencies in a structure match the observations l Search for a structure that maximizes the score

85
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-85 Likelihood Score for Structures First cut approach: l Use likelihood function u Recall, the likelihood score for a network structure and parameters is u Since we know how to maximize parameters from now we assume

86
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-86 Avoiding Overfitting (cont..) Other approaches include: u Holdout/Cross-validation/Leave-one-out l Validate generalization on data withheld during training u Structural Risk Minimization l Penalize hypotheses subclasses based on their VC dimension

87
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-87 Learning Bayesian Networks Known graph C S B D X Complete data: parameter estimation (ML, MAP) Incomplete data: non-linear parametric optimization (gradient descent, EM) P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B) – learn parameters C S B D X C S B D X Unknown graph Complete data: optimization (search in space of graphs) Incomplete data: structural EM, mixture models – learn graph and parameters

88
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-88 Learning Parameters: complete data u ML-estimate: - decomposable! MAP-estimate ( Bayesian statistics) Conjugate priors - Dirichlet X CB Multinomial counts Equivalent sample size (prior knowledge)

89
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-89 Learning Parameters: incomplete data EM-algorithm: iterate until convergence Initial parameters Current model Non-decomposable marginal likelihood (hidden nodes) S X D C B ……… Data S X D C B ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Update parameters (ML, MAP) Maximization

90
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-90 Learning graph structure NP-hard optimization Heuristic search: Greedy local search Find C S B C S B Add S->B C S B Delete S->B C S B Reverse S->B Best-first search Simulated annealing Complete data – local computations Incomplete data (score non-decomposable): Structural EM Constrained-based methods Data impose independence relations (constrains)

91
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-91 Scoring functions: Minimum Description Length (MDL) u Learning data compression u Other: MDL = -BIC (Bayesian Information Criterion) u Bayesian score (BDe) - asymptotically equivalent to MDL DL(Model)DL(Data|model) ……………….

92
© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-92 Summary u Bayesian Networks – graphical probabilistic models u Efficient representation and inference u Expert knowledge + learning from data u Learning: l parameters (parameter estimation, EM) l structure (optimization w/ score functions – e.g., MDL) u Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI)) u Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google