Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.

Similar presentations


Presentation on theme: "© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C."— Presentation transcript:

1 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C. Berkeley www.cs.berkeley.edu/~nir Moises Goldszmidt SRI International www.erg.sri.com/people/moises See http://www.cs.berkeley.edu/~nir/Tutorial for additional material and reading list

2 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-2 Outline »Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion

3 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-3 Learning (in this context) u Process l Input: dataset and prior information l Output: Bayesian network u Prior information: background knowledge l a Bayesian network (or fragments of it) l time ordering l prior probabilities l... Represents P(E,B,R,S,C) Independence Statements Causality Inducer Data + Prior information E R B A C

4 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-4 Why learning? u Feasibility of learning l Availability of data and computational power u Need for learning l Characteristics of current systems and processes Defy closed form analysis ïneed data-driven approach for characterization Scale and change fast ïneed continuous automatic adaptation u Examples: l communication networks, economic markets, illegal activities, the brain...

5 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-5 Why learn a Bayesian network? u Combine knowledge engineering and statistical induction l Covers the whole spectrum from knowledge-intensive model construction to data-intensive model induction u More than a learning black-box l Explanation of outputs l Interpretability and modifiability l Algorithms for decision making, value of information, diagnosis and repair u Causal representation, reasoning, and discovery l Does smoking cause cancer?

6 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-6 What will I get out of this tutorial? u An understanding of the basic concepts behind the process of learning Bayesian networks from data so that you can l Read advanced papers on the subject l Jump start possible applications l Implement the necessary algorithms l Advance the state-of-the-art

7 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-7 Outline u Introduction »Bayesian networks: a review l Probability 101 l What are Bayesian networks? l What can we do with Bayesian networks? l The learning problem... u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion

8 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-8 Probability 101 u Bayes rule u Chain rule u Introduction of a variable (reasoning by cases)

9 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-9 Representing the Uncertainty in a Domain u A story with five random variables: l Burglary, Earthquake, Alarm, Neighbor Call, Radio Announcement Specify a joint distribution with 2 5 -1 =31 parameters maybe… u An expert system for monitoring intensive care patients l Specify a joint distribution over 37 variables with (at least) 2 37 parameters no way!!!

10 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-10 Probabilistic Independence: a Key for Representation and Reasoning u Recall that if X and Y are independent given Z then u In our story…if l burglary and earthquake are independent l burglary and radio are independent given earthquake u then we can reduce the number of probabilities needed

11 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-11 Probabilistic Independence: a Key for Representation and Reasoning u In our story…if l burglary and earthquake are independent l burglary and radio are independent given earthquake u then instead of 15 parameters we need 8 versus Need a language to represent independence statements

12 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-12 Bayesian Networks Computer efficient representation of probability distributions via conditional independence 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call

13 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-13 Bayesian Networks Qualitative part: statistical independence statements (causality!) u Directed acyclic graph (DAG) l Nodes - random variables of interest (exhaustive and mutually exclusive states) l Edges - direct (causal) influence u Quantitative part: Local probability models. Set of conditional probability distributions. 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call

14 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-14 Bayesian Network Semantics u Compact & efficient representation: l nodes have  k parents  O(2 k n) vs. O(2 n ) params l parameters pertain to local interactions Qualitative part conditional independence statements in BN structure + Quantitative part local probability models Unique joint distribution over domain = P(C,A,R,E,B) = P(B)*P(E|B)*P(R|E,B)*P(A|R,B,E)*P(C|A,R,B,E) versus P(C,A,R,E,B) = P(B)*P(E) * P(R|E) * P(A|B,E) * P(C|A) E R B A C

15 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-15 Monitoring Intensive-Care Patients The “alarm” network 37 variables, 509 parameters (instead of 2 37 ) PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

16 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-16 Qualitative part u Nodes are independent of non-descendants given their parents P(R|E=y,A) = P(R|E=y) for all values of R,A,E Given that there is and earthquake, I can predict a radio announcement regardless of whether the alarm sounds u d-separation: a graph theoretic criterion for reading independence statements Can be computed in linear time (on the number of edges) Earthquake Radio Burglary Alarm Call

17 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-17 Blocked Unblocked E B A C E B A C E B A C E C A E C A E R A E R A d-separation u Two variables are independent if all paths between them are blocked by evidence Three cases: l Common cause l Intermediate cause l Common Effect Blocked Unblocked

18 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-18  I(X,Y|Z) denotes X and Y are independent given Z l I(R,B) l ~I(R,B|A) l I(R,B|E,A) l ~I(R,C|B) Example E B A C R

19 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-19 I-Equivalent Bayesian Networks u Networks are I-equivalent if their structures encode the same independence statements u Theorem: Networks are I-equivalent iff they have the same skeleton and the same “V” structures E A R E A R E A R I(R,A|E) SC E D FG SC E D FG NOT I-equivalent

20 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-20 u Associated with each node X i there is a set of conditional probability distributions P(X i |Pa i :  ) l If variables are discrete,  is usually multinomial l Variables can be continuous,  can be a linear Gaussian l Combinations of discrete and continuous are only constrained by available inference mechanisms Earthquake Burglary Alarm 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Quantitative Part

21 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-21 What Can We Do with Bayesian Networks? u Probabilistic inference: belief update l P(E =Y| R = Y, C = Y) u Probabilistic inference: belief revision Argmax {E,B} P(e, b | C=Y) u Qualitative inference I(R,C| A) u Complex inference l rational decision making (influence diagrams) l value of information l sensitivity analysis u Causality (analysis under interventions) Earthquake Radio Burglary Alarm Call

22 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-22 Bayesian Networks: Summary u Bayesian networks: an efficient and effective representation of probability distributions u Efficient: l Local models l Independence (d-separation) u Effective: Algorithms take advantage of structure to l Compute posterior probabilities l Compute most probable instantiation l Decision making u But there is more: statistical induction  LEARNING

23 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-23 Learning Bayesian networks (reminder) Inducer Data + Prior information E R B A C.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B)

24 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-24 The Learning Problem

25 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-25 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) E B A

26 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-26 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) E B A

27 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-27 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) E B A

28 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-28 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) E B A

29 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-29 Outline u Introduction u Bayesian networks: a review »Parameter learning: Complete data l Statistical parametric fitting l Maximum likelihood estimation l Bayesian inference u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

30 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-30 Example: Binomial Experiment (Statistics 101) u When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task: u Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  HeadTail

31 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-31 Statistical parameter fitting u Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same distribution l Each sampled independently of the rest  The task is to find a parameter  so that the data can be summarized by a probability P(x[j]|  ). l The parameters depend on the given family of probability distributions: multinomial, Gaussian, Poisson, etc. l We will focus on multinomial distributions l The main ideas generalize to other distribution families i.i.d. samples

32 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-32 The Likelihood Function u How good is a particular  ? It depends on how likely it is to generate the observed data u Thus, the likelihood for the sequence H,T, T, H, H is 00.20.40.60.81  L(  :D)

33 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-33 Sufficient Statistics  To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails) N H and N T are sufficient statistics for the binomial distribution u A sufficient statistic is a function that summarizes, from the data, the relevant information for the likelihood If s(D) = s(D’ ), then L(  |D) = L(  |D’)

34 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-34 Maximum Likelihood Estimation MLE Principle: Learn parameters that maximize the likelihood function This is one of the most commonly used estimators in statistics Intuitively appealing

35 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-35 Maximum Likelihood Estimation (Cont.) u Consistent l Estimate converges to best possible value as the number of examples grow u Asymptotic efficiency l Estimate is as close to the true value as possible given a particular training set u Representation invariant l A transformation in the parameter representation does not change the estimated probability distribution

36 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-36 Example: MLE in Binomial Data u Applying the MLE principle we get (Which coincides with what one would expect) 00.20.40.60.81 L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

37 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-37 Learning Parameters for the Burglary Story E B A C i.i.d. samples Network factorization We have 4 independent estimation problems

38 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-38 General Bayesian Networks We can define the likelihood for a Bayesian network: The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

39 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-39 General Bayesian Networks (Cont.) Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other.

40 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-40 From Binomial to Multinomial  For example, suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2. …,  K Sufficient statistics:  N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE:

41 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-41 Likelihood for Multinomial Networks  When we assume that P(X i | Pa i ) is multinomial, we get further decomposition:  For each value pa i of the parents of X i we get an independent multinomial problem u The MLE is

42 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-42 Is MLE all we need? u Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack l Would you bet on heads for the next toss? u Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin l Would you place the same bet?

43 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-43 u MLE commits to a specific value of the unknown parameter(s) u MLE is the same in both cases u Confidence in prediction is clearly different Bayesian Inference 00.20.40.60.8100.20.40.60.81 vs. CoinThumbtack

44 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-44 Bayesian Inference (cont.) Frequentist Approach: u Assumes there is an unknown but fixed parameter  u Estimates  with some confidence u Prediction by using the estimated parameter value Bayesian Approach: u Represents uncertainty about the unknown parameter u Uses probability to quantify this uncertainty: l Unknown parameters as random variables u Prediction follows from the rules of probability: l Expectation over the unknown parameters

45 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-45 Bayesian Inference (cont.) u We can represent our uncertainty about the sampling process using a Bayesian network The observed values of X are independent given  The conditional probabilities, P(x[m] |  ), are the parameters in the model l Prediction is now inference in this network  X[1]X[2]X[m] X[m+1] Observed dataQuery

46 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-46 Bayesian Inference (cont.) u Prediction as inference in this network where Posterior Likelihood Prior Probability of data  X[1]X[2]X[m] X[m+1]

47 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-47 Example: Binomial Data Revisited  Suppose that we choose a uniform prior P(  ) = 1 for  in [0,1]  Then P(  |D) is proportional to the likelihood L(  :D) u (N H,N T ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 00.20.40.60.81

48 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-48 Bayesian Inference and MLE u In our example, MLE and Bayesian prediction differ But… If prior is well-behaved u Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value u Both converge to the “true” underlying distribution (almost surely)

49 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-49 Dirichlet Priors u Recall that the likelihood function is  A Dirichlet prior with hyperparameters  1,…,  K is defined as for legal  1,…,  K Then the posterior has the same form, with hyperparameters  1 +N 1,…,  K +N K

50 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-50 Dirichlet Priors (cont.) u We can compute the prediction on a new event in closed form: If P(  ) is Dirichlet with hyperparameters  1,…,  K then Since the posterior is also Dirichlet, we get

51 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-51 Priors Intuition  The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience  Equivalent sample size =  1 +…+  K u The larger the equivalent sample size the more confident we are in our prior

52 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-52 Effect of Priors Prediction of P(X=H ) after seeing data with N H = 0.25N T for different sample sizes 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 020406080100 0 0.1 0.2 0.3 0.4 0.5 0.6 020406080100 Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T

53 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-53 Effect of Priors (cont.) u In real data, Bayesian estimates are less sensitive to noise in the data 0.1 0.2 0.3 0.4 0.5 0.6 0.7 5101520253035404550 P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result

54 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-54 Conjugate Families u The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy l Dirichlet prior is a conjugate family for the multinomial likelihood u Conjugate families are useful since: l For many distributions we can represent them with hyperparameters l They allow for sequential update within the same representation l In many cases we have closed-form solution for prediction

55 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-55 Bayesian Networks and Bayesian Prediction u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

56 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-56 Bayesian Networks and Bayesian Prediction (Cont.) u We can also “read” from the network: Complete data  posteriors on parameters are independent XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

57 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-57 Bayesian Prediction(cont.) u Since posteriors on parameters for each family are independent, we can compute them separately u Posteriors for parameters within families are also independent:  Complete data  the posteriors on  Y|X=0 and  Y|X=1 are independent XX  Y|X m X[m] Y[m] Refined model XX  Y|X=0 m X[m] Y[m]  Y|X=1

58 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-58 Bayesian Prediction(cont.)  Given these observations, we can compute the posterior for each multinomial  X i | pa i independently l The posterior is Dirichlet with parameters  (X i =1|pa i )+N (X i =1|pa i ),…,  (X i =k|pa i )+N (X i =k|pa i )  The predictive distribution is then represented by the parameters which is what we expected! The Bayesian analysis just made the assumptions explicit

59 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-59 Assessing Priors for Bayesian Networks We need the  (x i,pa i ) for each node x j  We can use initial parameters  0 as prior information Need also an equivalent sample size parameter M 0 Then, we let  (x i,pa i ) = M 0  P(x i,pa i |  0 ) u This allows to update a network using new data

60 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-60 Learning Parameters: Case Study (cont.) u Experiment: l Sample a stream of instances from the alarm network l Learn parameters using MLE estimator Bayesian estimator with uniform prior with different strengths

61 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-61 Learning Parameters: Case Study (cont.) Comparing two distribution P(x) (true model) vs. Q(x) (learned distribution) -- Measure their KL Divergence l 1 KL divergence (when logs are in base 2) = The probability P assigns to an instance will be, on average, twice as small as the probability Q assigns to it l KL(P||Q)  0 KL(P||Q) = 0 iff are P and Q equal

62 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-62 Learning Parameters: Case Study (cont.) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0500100015002000250030003500400045005000 KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50

63 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-63 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomial these are of the form N (x i,pa i ) l Parameter estimation u Bayesian methods also require choice of priors u Both MLE and Bayesian are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

64 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-64 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data »Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

65 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-65 Incomplete Data Data is often incomplete u Some variables of interest are not assigned value This phenomena happen when we have u Missing values u Hidden variables

66 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-66 Missing Values u Examples: u Survey data u Medical records l Not all patients undergo all possible tests

67 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-67 Missing Values (cont.) Complicating issue: u The fact that a value is missing might be indicative of its value l The patient did not undergo X-Ray since she complained about fever and not about broken bones…. To learn from incomplete data we need the following assumption: Missing at Random (MAR):  The probability that the value of X i is missing is independent of its actual value given other observed values

68 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-68 Missing Values (cont.) u If MAR assumption does not hold, we can create new variables that ensure that it does u We now can predict new examples (w/ pattern of ommisions) u We might not be able to learn about the underlying process X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Data X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Augmented Data Obs-X Obs-Z YYYYYYYYYY Obs-Y NNYYYNNYYY YNNYYYNNYY X Y Z X Y Z Obs-X Obs-Y Obs-Z

69 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-69 Hidden (Latent) Variables u Attempt to learn a model with variables we never observe l In this case, MAR always holds u Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters

70 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-70 Hidden Variables (cont.) u Hidden variables also appear in clustering u Autoclass model: l Hidden variables assigns class labels l Observed attributes are independent given the class Cluster X1X1... X2X2 XnXn Hidden Observed possible missing values

71 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-71 Learning Parameters from Incomplete Data Complete data:  Independent posteriors for  X,  Y|X=H and  Y|X=T Incomplete data: u Posteriors can be interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX  Y|X=H m X[m] Y[m]  Y|X=T

72 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-72 Example 3 missing values2 missing valueno missing values u Simple network:  P(X) assumed to be known  Likelihood is a function of 2 parameters: P(Y=H|X=H), P(Y=H|X=T)  Contour plots of log likelihood for different number of missing values of X (M = 8): P(Y=H|X=H) P(Y=H|X=T) XY

73 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-73 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables  a serious problem HY

74 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-74 L(  |D) Expectation Maximization (EM): u Use “current point” to construct alternative function (which is “nice”) u Guaranty: maximum of new function is better scoring the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem Gradient Ascent: u Follow gradient of likelihood w.r.t. to parameters u Add line search and conjugate gradient methods to get fast convergence  Both: u Find local maxima. u Require multiple restarts to find approx. to the global maximum u Require computations in each iteration

75 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-75 Gradient Ascent u Main result  Requires computation: P(x i,Pa i |o[m],  ) for all i, m u Pros: l Flexible l Closely related to methods in neural network training u Cons: l Need to project gradient onto space of legal parameters l To get reasonable convergence we need to combine with “smart” optimization techniques

76 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-76 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment 1.3 0.4 1.7 1.6 X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT T??THT??TH HTHTHTHT HHTTHHTT P(Y=H|X=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current model

77 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-77 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) (M-Step) Reiterate

78 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-78 EM (cont.) Formal Guarantees:  L(  1 :D)  L(  0 :D) l Each iteration improves the likelihood  If  1 =  0, then  0 is a stationary point of L(  :D) l Usually, this means a local maximum Main cost: u Computations of expected counts in E-Step u Requires a computation pass for each instance in training set l These are exactly the same as for gradient ascent!

79 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-79 Example: EM in clustering u Consider clustering example E-Step: Compute P(C[m]|X 1 [m],…,X n [m],  ) l This corresponds to “soft” assignment to clusters M-Step Re-estimate P(X i |C) l For each cluster, this is a weighted sum over examples Cluster X1X1... X2X2 XnXn

80 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-80 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones

81 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-81 Bayesian Inference with Incomplete Data Recall, Bayesian estimation: Complete data: closed form solution for integral Incomplete data: u No sufficient statistics (except the data) u Posterior does not decompose u No closed form solution ïNeed to use approximations

82 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-82 MAP Approximation u Simplest approximation: MAP parameters l MAP --- Maximum A-posteriori Probability where Assumption: u Posterior mass is dominated by a MAP parameters Finding MAP parameters: u Same techniques as finding ML parameters  Maximize P(  |D) instead of L(  :D)

83 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-83 Stochastic Approximations Stochastic approximation:  Sample  1, …,  k from P(  |D) u Approximate

84 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-84 Stochastic Approximations (cont.) How do we sample from P(  |D) ? Markov Chain Monte Carlo (MCMC) methods:  Find a Markov Chain whose stationary probability Is P(  |D) u Simulate the chain until convergence to stationary behavior u Collect samples for the “stationary” regions Pros: u Very flexible method: when other methods fails, this one usually works u The more samples collected, the better the approximation Cons: u Can be computationally expensive u How do we know when we are converging on stationary distribution?

85 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-85 Stochastic Approximations: Gibbs Sampling Gibbs Sampler: u A simple method to construct MCMC sampling process Start: u Choose (random) values for all unknown variables Iteration: u Choose an unknown variable l A missing data variable or unknown parameter l Either a random choice or round-robin visits u Sample a value for the variable given the current values of all other variables XX  Y|X=H m X[m] Y[m]  Y|X=T

86 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-86 Parameter Learning from Incomplete Data: Summary u Non-linear optimization problem u Methods for learning: EM and Gradient Ascent l Exploit inference for learning Difficulties: u Exploration of a complex likelihood/posterior l More missing data  many more local maxima l Cannot represent posterior  must resort to approximations u Inference l Main computational bottleneck for learning l Learning large networks  exact inference is infeasible  resort to stochastic simulation or approximate inference (e.g., see Jordan’s tutorial)

87 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-87 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data »Structure learning: Complete data »Scoring metrics l Maximizing the score l Learning local structure u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

88 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-88 Benefits of Learning Structure u Efficient learning -- more accurate models with less data Compare: P(A) and P(B) vs joint P(A,B) former requires less data! l Discover structural properties of the domain l Identifying independencies in the domain helps to Order events that occur sequentially Sensitivity analysis and inference u Predict effect of actions l Involves learning causal relationship among variables  defer to later part of the tutorial

89 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-89 Approaches to Learning Structure u Constraint based l Perform tests of conditional independence l Search for a network that is consistent with the observed dependencies and independencies u Score based l Define a score that evaluates how well the (in)dependencies in a structure match the observations l Search for a structure that maximizes the score

90 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-90 Constraints versus Scores u Constraint based l Intuitive, follows closely the definition of BNs l Separates structure construction from the form of the independence tests l Sensitive to errors in individual tests u Score based l Statistically motivated l Can make compromises u Both l Consistent---with sufficient amounts of data and computation, they learn the correct structure

91 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-91 Likelihood Score for Structures First cut approach: l Use likelihood function u Recall, the likelihood score for a network structure and parameters is u Since we know how to maximize parameters from now we assume

92 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-92 Likelihood Score for Structure (cont.) Rearranging terms: where  H(X) is the entropy of X  I(X;Y) is the mutual information between X and Y I(X;Y) measures how much “information” each variables provides about the other I(X;Y)  0 I(X;Y) = 0 iff X and Y are independent I(X;Y) = H(X) iff X is totally predictable given Y

93 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-93 Likelihood Score for Structure (cont.) Good news: u Intuitive explanation of likelihood score: l The larger the dependency of each variable on its parents, the higher the score l Likelihood as a compromise among dependencies, based on their strength Bad news: u Adding arcs always helps l I(X;Y)  I(X;Y,Z) l Maximal score attained by “complete” networks l Such networks can overfit the data --- the parameters they learn capture the noise in the data

94 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-94 Avoiding Overfitting “Classic” issue in learning. Standard approaches: u Restricted hypotheses l Limits the overfitting capability of the learner l Example: restrict # of parents or # of parameters u Minimum description length l Description length measures complexity l Choose model that compactly describes the training data u Bayesian methods l Average over all possible parameter values l Use prior knowledge

95 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-95 Avoiding Overfitting (cont..) Other approaches include: u Holdout/Cross-validation/Leave-one-out l Validate generalization on data withheld during training u Structural Risk Minimization l Penalize hypotheses subclasses based on their VC dimension

96 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-96 Minimum Description Length Rationale: u prefer networks that facilitate compression of the data u Compression  summarization  generalization Data Base S L C E DX Encoding Scheme Compressed Data Decoding Scheme Data Base

97 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-97 Minimum Description Length (cont.) u Computing the description length of the data, we get u Minimizing this term is equivalent to maximizing # bits to encode G # bits to encode  G # bits to encode D using (G,  G )

98 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-98 Minimum Description: Complexity Penalization u Likelihood is (roughly) linear in M u Penalty is logarithmic in M As we get more data, the penalty for complex structure is less harsh

99 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-99 Minimum Description: Example u Idealized behavior: L(G:D) = -15.12*M, dim(G) = 509 L(G:D) = -15.70*M, dim(G) = 359 L(G:D) = -17.04*M, dim(G) = 214 M -24 -22 -20 -18 -16 050010001500200025003000 Score/M

100 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-100 Minimum Description: Example (cont.) Real data illustration with three network: u “True” alarm (509 param), simplified (359 param), tree (214 param) -24 -22 -20 -18 -16 050010001500200025003000 Score/M M True Network Simplified Network Tree network

101 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-101 Consistency of the MDL Score MDL Score is consistent  As M  the “true” structure G* maximizes the score (almost surely)  For sufficiently large M, the maximal scoring structures are equivalent to G* Proof (outline): u Suppose G implies an independence statement not in G*, then as M , l(G:D)  l(G*:D) - eM ( e depends on G ) so MDL(G*:D) - MDL(G:D)  eM - (dim(G*)-dim(G))/2 log M u Now suppose G* implies an independence statement not in G, then as M , l(G:D)  l(G*:D) so MDL(G:D) - MDL(G*:D)  (dim(G)-dim(G*))/2 log M

102 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-102 Bayesian Inference  Bayesian Reasoning---compute expectation over unknown G where Assumption: G s are mutually exclusive and exhaustive Marginal likelihood Prior over structures Likelihood Prior over parameters Posterior score

103 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-103 Marginal Likelihood: Binomial case u Assume we observe a sequence of coin tosses…. u By the chain rule we have: recall that where N m H is the number of heads in first m examples.

104 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-104 Marginal Likelihood: Binomials (cont.) We simplify this by using Thus

105 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-105 Binomial Likelihood: Example  Idealized experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 M MDL Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) (Log P(D))/M

106 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-106 Marginal Likelihood: Example (cont.)  Actual experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 (Log P(D))/M M MDL Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5)

107 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-107 Marginal Likelihood: Multinomials The same argument generalizes to multinomials with Dirichlet prior  P(  ) is Dirichlet with hyperparameters  1,…,  K  D is a dataset with sufficient statistics N 1,…,N K Then

108 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-108 Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y u Network structure determines form of marginal likelihood 1234567 Network 1: u Two Dirichlet marginal likelihoods u P(X[1],…,X[7]) u P(Y[1],…,Y[7]) XY Network 2: u Three Dirichlet marginal likelihoods u P(X[1],…,X[7]) u P(Y[1],Y[4],Y[6],Y[7]) u P(Y[2],Y[3],Y[5]) XY

109 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-109 Marginal Likelihood (cont.) In general networks, the marginal likelihood has the form: u where u N(..) are the counts from the data   (..) are the hyperparameters for each family given G Dirichlet Marginal Likelihood For the sequence of values of X i when X i ’ s parents have a particular value

110 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-110 Priors and BDe score  We need: prior counts  (..) for each network structure G u This can be a formidable task l There are exponentially many structures… Possible solution: The BDe prior Use prior of the form M 0, B 0 =(G 0,  0 ) Corresponds to M 0 prior examples distributed according to B 0 Set  (x i,pa i G ) = M 0 P(x i,pa i G | G 0,  0 ) Note that pa i G are, in general, not the same as the parents of X i in G 0. We can compute this using standard BN tools l This choice also has desirable theoretical properties Equivalent networks are assigned the same score

111 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-111 Bayesian Score: Asymptotic Behavior u The Bayesian score seems quite different from the MDL score u However, the two scores are asymptotically equivalent Theorem: If the prior P(  |G) is “well-behaved”, then Proof:  (Simple) Use Stirling’s approximation to  ( ) l Applies to Bayesian networks with Dirichlet priors u (General) Use properties of exponential models and Laplace’s method for approximating integrals l Applies to Bayesian networks with other parametric families

112 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-112 Bayesian Score: Asymptotic Behavior Consequences: u Bayesian score is asymptotically equivalent to MDL score The terms log P(G) and description length of G are constant and thus they are negligible when M is large. u Bayesian score is consistent l Follows immediately from consistency of MDL score u Observed data eventually overrides prior information l Assuming that the prior does not assign probability 0 to some parameter settings

113 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-113 Scores -- Summary u Likelihood, MDL and (log) BDe have the form u BDe requires assessing prior network. It can naturally incorporate prior knowledge and previous experience u Both MDL and BDe are consistent and asymptotically equivalent (up to a constant) u All three are score-equivalent---they assign the same score to equivalent networks

114 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-114 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data »Structure learning: Complete data l Scoring metrics »Maximizing the score l Learning local structure u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

115 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-115 Optimization Problem Input: l Training data l Scoring function (including priors, if needed) l Set of possible structures Including prior knowledge about structure Output: l A network (or networks) that maximize the score Key Property: l Decomposability: the score of a network is a sum of terms.

116 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-116 Learning Trees u Trees: l At most one parent per variable u Why trees? l Elegant math  we can solve the optimization problem l Sparse parameterization  avoid overfitting

117 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-117 Learning Trees (cont.)  Let p(i) denote the parent of X i, or 0 if X i has no parents u We can write the score as u Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

118 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-118 Learning Trees (cont) Algorithm: u Construct graph with vertices: 1, 2, …  Set w(i  j) be Score( X j | X i ) - Score(X j ) u Find tree (or forest) with maximal weight l This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm) Theorem: This procedure finds the tree with maximal score When score is likelihood, then w(i  j) is proportional to I(X i ; X j ) this is known as the Chow & Liu method

119 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-119 Learning Trees: Example Tree learned from alarm data u Green -- correct arcs u Red -- spurious arcs PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP u Not every edge in tree is in the the original network u Tree direction is arbitrary --- we can’t learn about arc direction

120 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-120 Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow two parents u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring network structure with at most k parents for each variables is NP-hard for k > 1

121 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-121 Heuristic Search u We address the problem by using heuristic search u Define a search space: l nodes are possible structures l edges denote adjacency of structures u Traverse this space looking for high-scoring structures Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

122 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-122 Heuristic Search (cont.) u Typical operations: S C E D S C E D S C E D S C E D Add C  D Reverse C  E Remove C  E

123 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-123 Exploiting Decomposability in Local Search u Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move S C E D S C E D S C E D S C E D

124 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-124 Greedy Hill-Climbing Simplest heuristic local search l Start with a given network empty network best tree a random network l At each iteration Evaluate all possible changes Apply change that leads to best improvement in score Reiterate l Stop when no modification improves score  Each step requires evaluating approximately n new changes

125 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-125 Greedy Hill-Climbing (cont.) u Greedy Hill-Climbing can get struck in: l Local Maxima: All one-edge changes reduce the score l Plateaus: Some one-edge changes leave the score unchanged u Both are occur in the search space

126 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-126 Greedy Hill-Climbing (cont.) To avoid these problems, we can use: u TABU-search Keep list of K most recently visited structures l Apply best move that does not lead to a structure in the list This escapes plateaus and local maxima and with “basin” smaller than K structures u Random Restarts l Once stuck, apply some fixed number of random edge changes and restart search l This can escape from the basin of one maxima to another

127 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-127 Greedy Hill-Climbing u Greedy Hill Climbing with TABU-list and random restarts on alarm Score/M step # -17 -16.8 -16.6 -16.4 -16.2 -16 -15.8 0100200300400500600700

128 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-128 Other Local Search Heuristics u Stochastic First-Ascent Hill-Climbing l Evaluate possible changes at random l Apply the first one that leads “uphill” l Stop when a fix amount of “unsuccessful” attempts to change the current candidate u Simulated Annealing l Similar idea, but also apply “downhill” changes with a probability that is proportional to the change in score l Use a temperature to control amount of random downhill steps l Slowly “cool” temperature to reach a regime where performing strict uphill moves

129 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-129 I-Equivalence Class Search So far, we seen generic search methods… u Can exploit the structure of our domain? Idea: u Search the space of I-equivalence classes u Each I-equivalence class is represented by a PDAG (partially ordered graph) -- skeleton + v-structures Benefits: u The space of PDAGs has fewer local maxima and plateaus u There are fewer PDAGs than DAGs

130 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-130 I-Equivalence Class Search (cont.) Evaluating changes is more expensive u These algorithms are more complex to implement X Z YX Z YX Z Y Add Y---Z Original PDAG New PDAG Consistent DAG Score

131 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-131 Search and Statistics u Evaluating the score of a structure requires the corresponding counts (sufficient statistics) u Significant computation is spent in collecting these counts l Requires a pass over the training data u Reduce overhead by caching previously computed counts l Avoid duplicated efforts Marginalize counts: N(X,Y)  N(X) Training Data Statistics Cache Search + Score

132 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-132 Learning in Practice: Time & Statistics u Using greedy Hill-Climbing on 10000 instances from alarm 500 1000 1500 2000 2500 3000 3500 406080100120140160180200 Cache Size Seconds -30 -28 -26 -24 -22 -20 -18 -16 -14 406080100120140160180200 Score Seconds

133 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-133 Learning in Practice: Alarm domain 0 0.5 1 1.5 2 0500100015002000250030003500400045005000 KL Divergence M True Structure/BDe M' = 10 Unknown Structure/BDe M' = 10 True Structure/MDL Unknown Structure/MDL

134 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-134 Model Averaging u Recall, Bayesian analysis started with l This requires us to average over all possible models

135 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-135 Model Averaging (cont.) u So far, we focused on single model l Find best scoring model l Use it to predict next example u Implicit assumption: l Best scoring model dominates the weighted sum u Pros: l We get a single structure l Allows for efficient use in our tasks u Cons: l We are committing to the independencies of a particular structure l Other structures might be as probable given the data

136 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-136 Model Averaging (cont.) Can we do better? u Full Averaging l Sum over all structures l Usually intractable--- there are exponentially many structures u Approximate Averaging l Find K largest scoring structures l Approximate the sum by averaging over their prediction l Weight of each structure determined by the Bayes Factor The actual score we compute

137 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-137 Search: Summary u Discrete optimization problem u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~10 min): Decomposability Sufficient statistics u In some cases, we can reduce the search problem to an easy optimization problem l Example: learning trees

138 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-138 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data »Structure learning: Complete data l Scoring metrics l Maximizing the score »Learning local structure u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

139 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-139 Learning Loop: Summary LearnNetwork(G 0 ) Gc := G 0 do Generate successors S of Gc DiffScore = max GinS Score(G) - Score(Gc) if DiffScore > 0 then Gc := G* such that Score(G*) is max while DiffScore > 0 Model selection (Greedy appoach)

140 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-140 Why Struggle for Accurate Structure u Increases the number of parameters to be fitted u Wrong assumptions about causality and domain structure u Cannot be compensated by accurate fitting of parameters u Also misses causality and domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arcMissing an arc

141 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-141 Local and Global Structure EarthquakeAlarm Set Sound Burglary 8 parameters ABE P(S=1|A,B,E) 111.95 110.95 101.20 100.05 011.001 010.001 001.001 000.001 A E B 4 parameters 01 01 0.95 01 0.050.20 0.001 Global structure Explicitly represents P(E|A) = P(E) Local structure Explicitly represents P(S|A = 0) = P(S|B,E,A=0)

142 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-142 Context Specific Independence u Variable independence: structure of the graph l P(A|B,C) = P(A|C) for all values of A,B,C l Knowledge acquisition, probabilistic inference, and learning u CSI: local structure on the conditional probability distributions l P(A|B,C=c) = p for all values of A,B l Knowledge acquisition, probabilistic inference, and learning

143 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-143 Effects on learning u Global structure: l Enables decomposability of the score Search is feasible u Local structure: l Reduces the number of parameters to be fitted Better estimates More accurate global structure!More accurate global structure!

144 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-144 Local Structure  More Accurate Global Structure Earthquake Alarm Set Sound BurglaryEarthquake Alarm Set Sound Burglary versus fitness model complexity The score balancing act Adding an arc may imply an exponential increase on the number of parameters to fit, independently of the relation between the variables Without local structure... *preferred model

145 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-145 Learning loop with local structure u What is needed: l Bayesian and/or MDL score for the local structures l Control loop (search) for inducing the local structure LearnNetwork(G 0 ) Gc := G 0 do Generate successors S of Gc Learn a local structure for the modified families DiffScore = max GinS Score(G) - Score(Gc) if DiffScore > 0 then Gc := G* such that Score(G*) is max while DiffScore > 0 By now we know how to generate both...

146 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-146 Scores u MDL l Go through the exercise of building the description length + show maximum likelihood parameter fitting u BDe l Priors... From M’ and P’(x i,pa i ) compute priors for the parameters of any structure l Need two assumptions,besides parameter independence 1. Semantic coherence 2. Composition

147 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-147 Experimental Evaluation u Set-up l Use Bayesian networks to generate data l Measures of error based on cross-entropy u Results l Tree-based learn better networks given the same amount of data. For M > 8k in the alarm experiment BDe tree-error is 70% of tabular-error MDL tree-error is 50% of tabular-error l Tree-based learn better global structure given the same amount of data Check cross-entropy of found structure with optimal parameter fit against generating Bayesian network

148 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-148 Other Types of Local Structure u Graphs u Noisy-or, Noisy-max (Causal independence) u Regression u Neural nets u Continuous representations, such as Gaussians To “plug in” a different representation, we need the following l Sufficient Statistics l Estimation of parameters l Marginal likelihood Any type of representation that reduces the number of parameters to fit

149 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-149 Learning with Complete Data: Summary Search for Structure (Score + Search Strategy + Representation) Parameter Learning (Statistical Parameter Fitting)

150 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-150 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data »Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion

151 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-151 The Classification Problem u From a data set describing objects by vectors of features and a class F u Find a function F: features  class to classify a new object Vector 1 = Presence Vector 2 = Presence Vector 3 = Presence Vector 4 = Absence Vector 5 = Absence Vector 6 = Presence Vector 7 = Absence

152 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-152 Examples u Predicting heart disease l Features: cholesterol, chest pain, angina, age, etc. l Class: {present, absent} u Finding lemons in cars l Features: make, brand, miles per gallon, acceleration,etc. l Class: {normal, lemon} u Digit recognition l Features: matrix of pixel descriptors l Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} u Speech recognition l Features: Signal characteristics, language model l Class: {pause/hesitation, retraction}

153 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-153 Approaches u Memory based l Define a distance between samples l Nearest neighbor, support vector machines u Decision surface l Find best partition of the space l CART, decision trees u Generative models l Induce a model and impose a decision rule l Bayesian networks

154 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-154 Generative Models u Bayesian classifiers l Induce a probability describing the data P(F 1,…,F n,C) l Impose a decision rule. Given a new object c = argmax C P(C = c | f 1,…,f n )  We have shifted the problem to learning P(F 1,…,F n,C)  Learn a Bayesian network representation for P(F 1,…,F n,C)

155 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-155 Optimality of the decision rule Minimizing the error rate... u Let c i be the true class, and let l j be the class returned by the classifier. correcterror A decision by the classifier is correct if c i =l j, and in error if c i  l j. u The error incurred by choose label l j is u Thus, had we had access to P, we minimize error rate by choosing l i when which is the decision rule for the Bayesian classifier

156 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-156 Advantages of the Generative Model Approach u Output u Output: Rank over the outcomes---likelihood of present vs. absent u Explanation u Explanation: What is the profile of a “typical” person with a heart disease u Missing values u Missing values: both in training and testing u Value of information u Value of information: If the person has high cholesterol and blood sugar, which other test should be conducted? u Validation u Validation: confidence measures over the model and its parameters u Background knowledge u Background knowledge: priors and structure

157 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-157 Age Sex ChestPainRestBP Cholesterol BloodSugar ECG MaxHeartRate Angina OldPeak STSlope Vessels Thal Outcome Heart disease Accuracy = 85% Data source UCI repository Advantages of Using a Bayesian Network u Efficiency in learning and query answering l Combine knowledge engineering and statistical induction l Algorithms for decision making, value of information, diagnosis and repair

158 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-158 C F1F1 F2F2 F5F5 F3F3 F4F4 F6F6 pregnant age insulin dpf massglucose Diabetes in Pima Indians (from UCI repository) The Naïve Bayesian Classifier u Fixed structure encoding the assumption that features are independent of each other given the class. u Learning amounts to estimating the parameters for each P(F i |C) for each F i.

159 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-159 The Naïve Bayesian Classifier (cont.) u Common practice is to estimate u These estimate are identical to MLE for multinomials u Estimates are robust consisting of low order statistics requiring few instances u Has proven to be a powerful classifier C F1F1 F2F2 F5F5 F3F3 F4F4 F6F6

160 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-160 Improving Naïve Bayes u Naïve Bayes encodes assumptions of independence that may be unreasonable: Are pregnancy and age independent given diabetes? Problem: same evidence may be incorporated multiple times u The success of naïve Bayes is attributed to l Robust estimation l Decision may be correct even if probabilities are inaccurate u Idea: improve on naïve Bayes by weakening the independence assumptions Bayesian networks provide the appropriate mathematical language for this task

161 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-161 Tree Augmented Naïve Bayes (TAN) u Approximate the dependence among features with a tree Bayes net u Tree induction algorithm l Optimality: maximum likelihood tree l Efficiency: polynomial algorithm u Robust parameter estimation C pregnantage insulin dpf mass glucose F1F1 F2F2 F5F5 F3F3 F4F4 F6F6

162 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-162 Evaluating the performance of a classifier: n-fold cross validation D1D2D3Dn Run 1 Run 2 Run 3 Run n u Partition the data set in n segments u Do n times l Train the classifier with the green segments l Test accuracy on the red segments u Compute statistics on the n runs Variance Mean accuracy u Accuracy: on test data of size m l Acc = Original data set

163 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-163 Performance: TAN vs. Naïve Bayes 65 70 75 80 85 90 95 100 65707580859095100 TAN Naïve Bayes u 25 Data sets from UCI repository Medical Signal processing Financial Games u Accuracy based on 5-fold cross- validation u No parameter tuning

164 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-164 Performance: TAN vs C4.5 65 70 75 80 85 90 95 100 65707580859095100 C4.5 TAN u 25 Data sets from UCI repository Medical Signal processing Financial Games u Accuracy based on 5-fold cross- validation u No parameter tuning

165 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-165 Beyond TAN u Can we do better by learning a more flexible structure? u Experiment: learn a Bayesian network without restrictions on the structure

166 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-166 Performance: TAN vs. Bayesian Networks 65 70 75 80 85 90 95 100 65707580859095100 TAN Bayesian Networks u 25 Data sets from UCI repository Medical Signal processing Financial Games u Accuracy based on 5-fold cross- validation u No parameter tuning

167 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-167 What is the problem? u Objective function l Learning of arbitrary Bayesian networks optimizes P(C, F 1,...,F n ) l It may learn a network that does a great job on P(F i,...,F j ) but a poor job on P(C | F 1,...,F n ) (Given enough data… No problem…) l We want to optimize classification accuracy or at least the conditional likelihood P(C | F 1,...,F n ) Scores based on this likelihood do not decompose  learning is computationally expensive! Controversy as to the correct form for these scores u Naive Bayes, Tan, etc circumvent the problem by forcing a structure where all features are connected to the class

168 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-168 Classification: Summary u Bayesian networks provide a useful language to improve Bayesian classifiers l Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc u Additional benefits l Missing values l Compute the tradeoffs involved in finding out feature values l Compute misclassification costs u Recent progress: l Combine generative probabilistic models, such as Bayesian networks, with decision surface approaches such as Support Vector Machines

169 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-169 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification »Learning causal relationships l Causality and Bayesian networks l Constraint-based approach l Bayesian approach u Structure learning: Incomplete data u Conclusion

170 MP1-170 Learning Causal Relations (Thanks to David Heckerman and Peter Spirtes for the slides) u Does smoking cause cancer? u Does ingestion of lead paint decrease IQ? u Do school vouchers improve education? u Do Microsoft business practices harm customers?

171 MP1-171 Causal Discovery by Experiment randomize measure rate lung cancer measure rate lung cancer smoke don’t smoke Can we discover causality from observational data alone?

172 MP1-172 What is “Cause” Anyway? Probabilistic question: What is p( lung cancer | yellow fingers ) ? Causal question: What is p( lung cancer | set(yellow fingers) ) ?

173 MP1-173 Probabilistic vs. Causal Models Probabilistic question: What is p( lung cancer | yellow fingers ) ? Causal question: What is p( lung cancer | set(yellow fingers) ) ? Smoking Yellow Fingers Lung Cancer Smoking Yellow Fingers Lung Cancer

174 MP1-174 To Predict the Effects of Actions: Modify the Causal Graph Smoking Yellow Fingers Lung Cancer Smoking Yellow Fingers Lung Cancer set yellow fingers p( lung cancer | set(yellow fingers) ) = p(lung cancer)

175 MP1-175 Causal Model XiXi Pa i XiXi set(X i ) A definition of causal model … ?

176 MP1-176 Ideal Interventions Smoking Yellow Fingers Lung Cancer Smoking Yellow Fingers Lung Cancer set yellow fingers u Pearl: ideal intervention is primitive, defines cause u Spirtes et al.: cause is primitive, defines ideal intervention u Heckerman and Shachter: from decision theory one could define both

177 MP1-177 How Can We Learn Cause and Effect from Observational Data?  (A  B)  A B H HH' or etc.

178 MP1-178 Learning Cause from Observations: Constraint-Based Approach Bridging assumptions: u Causal Markov assumption u Faithfulness Smoking Yellow Fingers Lung Cancer Smoking Yellow Fingers Lung Cancer (cond independencies from data) (causal assertions) ?

179 Causal Markov assumption Smoking Yellow Fingers Lung Cancer Smoking Yellow Fingers Lung Cancer  We can interpret the causal graph as a probabilistic one i.e.: absence of cause  conditional independence

180 Faithfulness Smoking Lung Cancer and There are no accidental independencies i.e.: conditional independence  absence of cause E.g., cannot have: I(smoking,lung cancer)

181 MP1-181 Other assumptions u All models under consideration are causal u All models are acyclic

182 MP1-182 All models under consideration are causal coffee consumption lung cancer coffee consumption lung cancer smoking No unexplained correlations

183 Learning Cause from Observations: Constraint-based method X Y Z Assumption: These are all the possible models

184 Learning Cause from Observations: Constraint-based method X Y Z Data: The only independence is I(X,Y)

185 Learning Cause from Observations: Constraint-based method X Y Z CMA: Absence of cause  conditional independence only I(X,Y)

186 Learning Cause from Observations: Constraint-based method X Y Z Faithfulness: Conditional independence  absence of cause I(X,Y)

187 Learning Cause from Observations: Constraint-Based Method X Y Z Conclusion: X and Y are causes of Z only I(X,Z)

188 MP1-188 Cannot Always Learn Cause  (A  B)  A B H HH' A B O or etc. hidden common causes selection bias

189 MP1-189 But with four (or more) variables… Suppose we observe the independencies & dependencies consistent with Then, in every acyclic causal structure not excluded by CMA and faithfulness, there is a directed path from Z to W. Z causes W X Y Z W I(X  Y) I(X &Y  W|Z)  (X  Y|Z) etc  that is...

190 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-190 Constraint-Based Approach u Algorithm based on the systematic application of l Independence tests l Discovery of “Y” and “V” structures u Difficulties: l Need infinite data to learn independence with certainty What significance level for independence tests should we use? Learned structures are susceptible to errors in independence tests

191 The Bayesian Approach X Y Z Data d One conclusion: p(X and Y cause Z|d)=0.01+0.8=0.81

192 The Bayesian approach X Y Z Data d

193 MP1-193 Assumptions u Causal Markov assumption u Faithfulness u All models under consideration are causal u etc. Smoking Yellow Fingers Lung Cancer Data (causal assertions) ? new assumption

194 MP1-194 Definition of Model Hypothesis G The hypothesis corresponding to DAG model G: u m is a causal model u (+CMA) the true distribution has the independencies implied by m Smoking Yellow Fingers Lung Cancer Smoking Yellow Fingers Lung Cancer and Smoking Yellow Fingers Lung Cancer DAG model G:Hypothesis G:

195 MP1-195 Faithfulness is a probability density function for every G  the probability that faithfulness is violated = 0 Example: DAG model G: X->Y

196 MP1-196 Causes of publishing productivity Rodgers and Maranto 1989 GPQ ABILITY QFJ PREPROD SEX Measure of ability (undergraduate) Graduate program quality Measure of productivity Quality of first job Sex Citation rate Publication rate Data: 86 men, 76 women CITES PUBS

197 MP1-197 Causes of publishing productivity Assumptions: u No hidden variables u Time ordering: u Otherwise uniform distribution on structure u Node likelihood: linear regression on parents SEXABILITY GPQ PREPROD QFJ PUBSCITES

198 MP1-198 Results of Greedy Search... GPQ ABILITY QFJ PREPROD SEX CITES PUBS

199 MP1-199 Other Models GPQ ABILITY QFJ PREPROD SEX CITES PUBS Bayesian GPQ ABILITY QFJ PREPROD SEX CITES PUBS PC (0.1) GPQ ABILITY QFJ PREPROD SEX CITES PUBS Rodgers & Maranto GPQ ABILITY QFJ PREPROD SEX CITES PUBS PC (0.05)

200 MP1-200 Bayesian Model Averaging GPQ ABILITY QFJ PREPROD SEX CITES PUBS p=0.31 parents of PUBSprob SEX, QFJ, ABILITY, PREPROD0.09 SEX, QFJ, ABILITY0.37 SEX, QFJ, PREPROD0.24 SEX, QFJ0.30

201 MP1-201 Challenges for the Bayesian Approach u Efficient selective model averaging / model selection u Hidden variables and selection bias l Prior assessment l Computation of the score (posterior of model) l Structure search u Extend simple discrete and linear models

202 MP1-202 Benefits of the Two Approaches Bayesian approach: u Encode uncertainty in answers u Not susceptible to errors in independence tests Constraint-based approach: u More efficient search u Identify possible hidden variables

203 MP1-203 Summary u The concepts of l Ideal manipulation l Causal Markov and faithfulness assumptions enable us to use Bayesian networks as causal graphs for causal reasoning and causal discovery u Under certain conditions and assumptions, we can discover causal relationships from observational data u The constraint-based and Bayesian approaches have different strengths and weaknesses

204 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-204 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships »Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

205 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-205 Learning Structure for Incomplete Data u Distinguish

206 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-206 Learning Structure From Incomplete Data Simple Approaches: Perform search over possible network structures For each candidate structure G: u Find optimal parameters  * for G Using non-linear optimization methods: l Gradient Ascent l Expectation Maximization (EM) u Score G using these parameters. Body of work on efficient approximations for the score, in particular for the Bayesian score

207 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-207 G1G1 G3G3 G4G4 G2G2 GnGn Parametric optimization Parameter space Local Maxima

208 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-208 Problem Such procedures are computationally expansive! u Computation of optimal parameters, per candidate, requires non-trivial optimization step u Spend non-negligible computation on a candidate, even if it is a low scoring one In practice, such learning procedures are feasible only when we consider small sets of candidate structures

209 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-209 G1G1 G3G3 G4G4 G2G2 GnGn Parametric optimization Structural EM Step

210 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-210 Structural EM u Idea: Use parameters found for previous structures to help evaluate new structures. u Scope: searching over structures over the same set of random variables. Outline: u Perform search in (Structure, Parameters) space. u Use EM-like iterations, using previously best found solution as a basis for finding either: l Better scoring parameters --- “parametric” EM step or l Better scoring structure --- “structural” EM step

211 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-211 Structural EM Step Assume B 0 = (G 0,  0 ) is “current” hypothesis. Goal: Maximize expected score, given B 0 where D+ denotes completed data sets. Theorem: If E[Score(B : D+) | D,B 0 ] > E[Score(B 0 : D+) | D,B 0 ]  Score(B : D) > Score(B 0 : D). u This implies that by improving the expected score, we find networks that have higher objective score. u Corollary: repeated applications of this EM step will converge.

212 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-212 u For the MDL score, we get that Consequence: u We can use complete-data methods, were we use expected counts, instead of actual counts Structural EM for MDL

213 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-213 Training Data Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) N(X 2, X 1 ) N(H, X 1, X 3 ) N(Y 1, X 2 ) N(Y 2, Y 1, H) Computation X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3  Score & Parameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3

214 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-214 Structural EM in Practice In theory: u E-Step: compute expected counts for all candidate structures u M-Step: choose structure that maximizes expected score Problem: there are (exponentially) many structures u We cannot computed expected counts for all of them in advance Solution: u M-Step: search over network structures (e.g., hill-climbing) u E-Step: on-demand, for each structure G examined by M- Step, compute expected counts u Use smart caching schemes to minimize overall computations

215 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-215 The Structural EM Procedure Input: B 0 = (G 0,  0 ) loop for n  … until convergence Improve parameters:  ` n  Parametric-EM (G n,  n ) let B` n = (G n`,  ` n ) Improve structure: Search for a network B n+1 = (G n+1,  n+1 ) s.t. E[Score(B n+1 :D) | B` n ] > E[Score(B` n :D) | B` n ] u Parametric-EM() can be replaced by Gradient Ascent, Newton-Raphson methods, or accelerated EM. u Early stopping parameter optimization stage avoids “entrenchment” in current structure.

216 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-216 Structural EM: Convergence Properties Theorem: The SEM procedure converges in score: The limit exists. Theorem: Convergence point is a local maxima: If G n = G infinitely often, then,  G, the limit of the parameters in the subsequence with structure G, is a stationary point in the parameter space of G.

217 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-217 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data »Conclusion

218 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-218 Summary: Learning Bayesian Networks u Parameters: Statistical estimation u Structure: Search optimization u Incomplete data: EM S L C E DX Data + Prior information StructureSearch SufficientStatisticsGenerator ParameterFittingInducer

219 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-219 Untouched issues u Feature engineering l From measurements to features u Feature selection l Discovering the relevant features u Smoothing and priors l Not enough data to compute robust statistics u Representing selection bias l All the subjects from the same population

220 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-220 Untouched Issues (Cont.) u Unsupervised learning l Clustering and exploratory data analysis u Incorporating time l Learning DBNs u Sampling, approximate inference and learning l Non-conjugate families, and time constraints u On-line learning, relation to other graphical models

221 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-221 Some Applications u Biostatistics -- Medical Research Council (Bugs) u Data Analysis -- NASA (AutoClass) u Collaborative filtering -- Microsoft (MSBN) u Fraud detection -- ATT u Classification -- SRI (TAN-BLT) u Speech recognition -- UC Berkeley

222 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-222 Systems u BUGS - Bayesian inference Using Gibbs Sampling l Assumes fixed structure l No restrictions on the distribution families l Relies on Markov Chain Montecarlo Methods for inference l www.mrc-bsu.com.ac.uk/bugs u AutoClass - Unsupervised Bayesian classification l Assumes a naïve Bayes structure with hidden variable at the root representing the classes l Extensive library of distribution families. l ack.arc.nasa.gov/ic/projects/bayes-group/group/autoclass/

223 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-223 Systems (Cont.) u MSBN - Microsoft Belief Netwoks l Learns both parameters and structure, various search methods l Restrictions on the family of distributions l www.research.microsoft.com/dtas/ u TAN-BLT - Tree Augmented Naïve Bayes for supervised classification l Correlations among features restricted to forests of trees l Multinomial, Gaussians, mixtures of Gaussians, and linear Gaussians l www.erg.sri.com/projects/LAS u Many more - look into AUAI, Web etc.

224 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-224 Current Topics u Time l Beyond discrete time and beyond fixed rate u Causality l Removing the assumptions u Hidden variables l Where to place them and how many? u Model evaluation and active learning l What parts of the model are suspect and what and how much data is needed?

225 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-225 Perspective: What’s Old and What’s New u Old: Statistics and probability theory l Provide the enabling concepts and tools for parameter estimation and fitting, and for testing the results u New: Representation and exploitation of domain structure l Decomposability Enabling scalability and computation-oriented methods l Discovery of causal relations and statistical independence Enabling explanation generation and interpretability l Prior knowledge Enabling a mixture of knowledge-engineering and induction

226 © 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-226 The Future... u Parallel extensions on Bayesian network modeling l More expressive representation languages l Better hybrid (continuous/discrete) models l Increase cross-fertilization with neural networks u Range of applications l Biology - DNA l Control l Financial l Perception u Beyond current learning model l Feature discovery l Model decisions about the process: distributions, feature selection


Download ppt "© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C."

Similar presentations


Ads by Google