© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.

Slides:

Advertisements

Similar presentations

A Tutorial on Learning with Bayesian Networks

Advertisements

Learning with Missing Data

Learning: Parameter Estimation

Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning I Excerpts from Tutorial at:

Parameter Estimation using likelihood functions Tutorial #1

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Graphical Models - Learning -

Visual Recognition Tutorial

Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.

Bayesian network inference

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

. Learning Bayesian networks Slides by Nir Friedman.

Lecture 5: Learning models using EM

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.

Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Learning Bayesian Networks. Dimensions of Learning ModelBayes netMarkov net DataCompleteIncomplete StructureKnownUnknown ObjectiveGenerativeDiscriminative.

Haimonti Dutta, Department Of Computer And Information Science1 David HeckerMann A Tutorial On Learning With Bayesian Networks.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C.

Maximum Likelihood (ML), Expectation Maximization (EM)

Visual Recognition Tutorial

Class 3: Estimating Scoring Rules for Sequence Alignment.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Thanks to Nir Friedman, HU

. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.

. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.

Learning Bayesian Networks (From David Heckerman’s tutorial)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

A Brief Introduction to Graphical Models

.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for

Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.

COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.

1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Randomized Algorithms for Bayesian Hierarchical Clustering

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.

Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.

Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Crash Course on Machine Learning Part VI Several slides from Derek Hoiem, Ben Taskar, Christopher Bishop, Lise Getoor.

CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Lecture 1.31 Criteria for optimal reception of radio signals.

Qian Liu CSE spring University of Pennsylvania

Maximum Likelihood Estimation

Irina Rish IBM T.J.Watson Research Center

Tutorial #3 by Ma’ayan Fishelson

Learning Bayesian networks

CS498-EA Reasoning in AI Lecture #20

CONTEXT DEPENDENT CLASSIFICATION

Parametric Methods Berlin Chen, 2005 References:

Learning Bayesian networks

Presentation transcript:

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. Learning Bayesian Networks from Data Nir Friedman U.C. Berkeley Moises Goldszmidt SRI International For current slides, additional material, and reading list see

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-2 Outline »Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-3 Learning (in this context) u Process l Input: dataset and prior information l Output: Bayesian network u Prior information: background knowledge l a Bayesian network (or fragments of it) l time ordering l prior probabilities l... Represents P(E,B,R,S,C) Independence Statements Causality Inducer Data + Prior information E R B A C

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-4 Bayesian Networks Computer efficient representation of probability distributions via conditional independence e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-5 Bayesian Networks Qualitative part: statistical independence statements (causality!) u Directed acyclic graph (DAG) l Nodes - random variables of interest (exhaustive and mutually exclusive states) l Edges - direct (causal) influence Quantitative part: Local probability models. Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-6 Monitoring Intensive-Care Patients The “alarm” network 37 variables, 509 parameters (instead of 2 37 ) PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-7 Qualitative part u Nodes are independent of non-descendants given their parents P(R|E=y,A) = P(R|E=y) for all values of R,A,E Given that there is and earthquake, I can predict a radio announcement regardless of whether the alarm sounds u d-separation: a graph theoretic criterion for reading independence statements Can be computed in linear time (on the number of edges) Earthquake Radio Burglary Alarm Call

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-8 Blocked Unblocked E B A C E B A C E B A C E C A E C A E R A E R A d-separation u Two variables are independent if all paths between them are blocked by evidence Three cases: l Common cause l Intermediate cause l Common Effect Blocked Unblocked

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-9  I(X,Y|Z) denotes X and Y are independent given Z l I(R,B) l ~I(R,B|A) l I(R,B|E,A) l ~I(R,C|B) Example E B A C R

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-10 u Associated with each node X i there is a set of conditional probability distributions P(X i |Pa i :  ) l If variables are discrete,  is usually multinomial l Variables can be continuous,  can be a linear Gaussian l Combinations of discrete and continuous are only constrained by available inference mechanisms Earthquake Burglary Alarm e b e be b b e BE P(A | E,B) Quantitative Part

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-11 What Can We Do with Bayesian Networks? u Probabilistic inference: belief update l P(E =Y| R = Y, C = Y) u Probabilistic inference: belief revision Argmax {E,B} P(e, b | C=Y) u Qualitative inference I(R,C| A) u Complex inference l rational decision making (influence diagrams) l value of information l sensitivity analysis u Causality (analysis under interventions) Earthquake Radio Burglary Alarm Call

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-12 Bayesian Networks: Summary u Bayesian networks: an efficient and effective representation of probability distributions u Efficient: l Local models l Independence (d-separation) u Effective: Algorithms take advantage of structure to l Compute posterior probabilities l Compute most probable instantiation l Decision making u But there is more: statistical induction  LEARNING

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-13 Learning Bayesian networks (reminder) Inducer Data + Prior information E R B A C.9.1 e b e be b b e BEP(A | E,B)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-14 The Learning Problem

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-15 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-16 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-17 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-18 Learning Problem ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Inducer E B A.9.1 e b e be b b e BEP(A | E,B) E B A

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-19 Outline u Introduction u Bayesian networks: a review »Parameter learning: Complete data l Statistical parametric fitting l Maximum likelihood estimation l Bayesian inference u Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-20 Example: Binomial Experiment (Statistics 101) u When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task: u Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  HeadTail

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-21 Statistical parameter fitting u Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same distribution l Each sampled independently of the rest  The task is to find a parameter  so that the data can be summarized by a probability P(x[j]|  ). l The parameters depend on the given family of probability distributions: multinomial, Gaussian, Poisson, etc. l We will focus on multinomial distributions l The main ideas generalize to other distribution families i.i.d. samples

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-22 The Likelihood Function u How good is a particular  ? It depends on how likely it is to generate the observed data u Thus, the likelihood for the sequence H,T, T, H, H is  L(  :D)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-23 Sufficient Statistics  To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails) N H and N T are sufficient statistics for the binomial distribution u A sufficient statistic is a function that summarizes, from the data, the relevant information for the likelihood If s(D) = s(D’ ), then L(  |D) = L(  |D’)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-24 Maximum Likelihood Estimation MLE Principle: Learn parameters that maximize the likelihood function This is one of the most commonly used estimators in statistics Intuitively appealing

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-25 Maximum Likelihood Estimation (Cont.) u Consistent l Estimate converges to best possible value as the number of examples grow u Asymptotic efficiency l Estimate is as close to the true value as possible given a particular training set u Representation invariant l A transformation in the parameter representation does not change the estimated probability distribution

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-26 Example: MLE in Binomial Data u Applying the MLE principle we get (Which coincides with what one would expect) L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-27 Learning Parameters for the Burglary Story E B A C i.i.d. samples Network factorization We have 4 independent estimation problems

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-28 General Bayesian Networks We can define the likelihood for a Bayesian network: The likelihood decomposes according to the structure of the network. i.i.d. samples Network factorization

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-29 General Bayesian Networks (Cont.) Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other.

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-30 From Binomial to Multinomial  For example, suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2. …,  K Sufficient statistics:  N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE:

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-31 Likelihood for Multinomial Networks  When we assume that P(X i | Pa i ) is multinomial, we get further decomposition:  For each value pa i of the parents of X i we get an independent multinomial problem u The MLE is

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-32 Is MLE all we need? u Suppose that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack l Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet?

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-33 u MLE commits to a specific value of the unknown parameter(s) u MLE is the same in both cases u Confidence in prediction is clearly different Bayesian Inference vs. CoinThumbtack

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-34 Bayesian Inference (cont.) Frequentist Approach: u Assumes there is an unknown but fixed parameter  u Estimates  with some confidence u Prediction by using the estimated parameter value Bayesian Approach: u Represents uncertainty about the unknown parameter u Uses probability to quantify this uncertainty: l Unknown parameters as random variables u Prediction follows from the rules of probability: l Expectation over the unknown parameters

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-35 Bayesian Inference (cont.) u We can represent our uncertainty about the sampling process using a Bayesian network The observed values of X are independent given  The conditional probabilities, P(x[m] |  ), are the parameters in the model l Prediction is now inference in this network  X[1]X[2]X[m] X[m+1] Observed dataQuery

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-36 Bayesian Inference (cont.) u Prediction as inference in this network where Posterior Likelihood Prior Probability of data  X[1]X[2]X[m] X[m+1]

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-37 Example: Binomial Data Revisited  Suppose that we choose a uniform prior P(  ) = 1 for  in [0,1]  Then P(  |D) is proportional to the likelihood L(  :D) u (N H,N T ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-38 Bayesian Inference and MLE u In our example, MLE and Bayesian prediction differ But… If prior is well-behaved u Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value u Both converge to the “true” underlying distribution (almost surely)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-39 Dirichlet Priors u Recall that the likelihood function is  A Dirichlet prior with hyperparameters  1,…,  K is defined as for legal  1,…,  K Then the posterior has the same form, with hyperparameters  1 +N 1,…,  K +N K

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-40 Dirichlet Priors (cont.) u We can compute the prediction on a new event in closed form: If P(  ) is Dirichlet with hyperparameters  1,…,  K then Since the posterior is also Dirichlet, we get

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-41 Priors Intuition  The hyperparameters  1,…,  K can be thought of as “imaginary” counts from our prior experience  Equivalent sample size =  1 +…+  K u The larger the equivalent sample size the more confident we are in our prior

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-42 Effect of Priors Prediction of P(X=H ) after seeing data with N H = 0.25N T for different sample sizes Different strength  H +  T Fixed ratio  H /  T Fixed strength  H +  T Different ratio  H /  T

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-43 Effect of Priors (cont.) u In real data, Bayesian estimates are less sensitive to noise in the data P(X = 1|D) N MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) N 0 1 Toss Result

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-44 Conjugate Families u The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy l Dirichlet prior is a conjugate family for the multinomial likelihood u Conjugate families are useful since: l For many distributions we can represent them with hyperparameters l They allow for sequential update within the same representation l In many cases we have closed-form solution for prediction

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-45 Bayesian Networks and Bayesian Prediction u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-46 Bayesian Networks and Bayesian Prediction (Cont.) u We can also “read” from the network: Complete data  posteriors on parameters are independent XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-47 Bayesian Prediction(cont.) u Since posteriors on parameters for each family are independent, we can compute them separately u Posteriors for parameters within families are also independent:  Complete data  the posteriors on  Y|X=0 and  Y|X=1 are independent XX  Y|X m X[m] Y[m] Refined model XX  Y|X=0 m X[m] Y[m]  Y|X=1

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-48 Bayesian Prediction(cont.)  Given these observations, we can compute the posterior for each multinomial  X i | pa i independently l The posterior is Dirichlet with parameters  (X i =1|pa i )+N (X i =1|pa i ),…,  (X i =k|pa i )+N (X i =k|pa i )  The predictive distribution is then represented by the parameters which is what we expected! The Bayesian analysis just made the assumptions explicit

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-49 Assessing Priors for Bayesian Networks We need the  (x i,pa i ) for each node x j  We can use initial parameters  0 as prior information Need also an equivalent sample size parameter M 0 Then, we let  (x i,pa i ) = M 0  P(x i,pa i |  0 ) u This allows to update a network using new data

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-50 Learning Parameters: Case Study (cont.) u Experiment: l Sample a stream of instances from the alarm network l Learn parameters using MLE estimator Bayesian estimator with uniform prior with different strengths

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-51 Learning Parameters: Case Study (cont.) Comparing two distribution P(x) (true model) vs. Q(x) (learned distribution) -- Measure their KL Divergence l 1 KL divergence (when logs are in base 2) = The probability P assigns to an instance will be, on average, twice as small as the probability Q assigns to it l KL(P||Q)  0 KL(P||Q) = 0 iff are P and Q equal

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-52 Learning Parameters: Case Study (cont.) KL Divergence M MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-53 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomial these are of the form N (x i,pa i ) l Parameter estimation u Bayesian methods also require choice of priors u Both MLE and Bayesian are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-54 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data »Parameter learning: Incomplete data u Structure learning: Complete data u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-55 Incomplete Data Data is often incomplete u Some variables of interest are not assigned value This phenomena happen when we have u Missing values u Hidden variables

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-56 Missing Values u Examples: u Survey data u Medical records l Not all patients undergo all possible tests

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-57 Missing Values (cont.) Complicating issue: u The fact that a value is missing might be indicative of its value l The patient did not undergo X-Ray since she complained about fever and not about broken bones…. To learn from incomplete data we need the following assumption: Missing at Random (MAR):  The probability that the value of X i is missing is independent of its actual value given other observed values

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-58 Missing Values (cont.) u If MAR assumption does not hold, we can create new variables that ensure that it does u We now can predict new examples (w/ pattern of ommisions) u We might not be able to learn about the underlying process X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Data X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Augmented Data Obs-X Obs-Z YYYYYYYYYY Obs-Y NNYYYNNYYY YNNYYYNNYY X Y Z X Y Z Obs-X Obs-Y Obs-Z

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-59 Hidden (Latent) Variables u Attempt to learn a model with variables we never observe l In this case, MAR always holds u Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-60 Hidden Variables (cont.) u Hidden variables also appear in clustering u Autoclass model: l Hidden variables assigns class labels l Observed attributes are independent given the class Cluster X1X1... X2X2 XnXn Hidden Observed possible missing values

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-61 Learning Parameters from Incomplete Data Complete data:  Independent posteriors for  X,  Y|X=H and  Y|X=T Incomplete data: u Posteriors can be interdependent u Consequence: l ML parameters can not be computed separately for each multinomial l Posterior is not a product of independent posteriors XX  Y|X=H m X[m] Y[m]  Y|X=T

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-62 Example 3 missing values2 missing valueno missing values u Simple network:  P(X) assumed to be known  Likelihood is a function of 2 parameters: P(Y=H|X=H), P(Y=H|X=T)  Contour plots of log likelihood for different number of missing values of X (M = 8): P(Y=H|X=H) P(Y=H|X=T) XY

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-63 Learning Parameters from Incomplete Data (cont.). u In the presence of incomplete data, the likelihood can have multiple global maxima u Example: l We can rename the values of hidden variable H l If H has two values, likelihood has two global maxima u Similarly, local maxima are also replicated u Many hidden variables  a serious problem HY

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-64 L(  |D) Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function is better scoring the current point MLE from Incomplete Data u Finding MLE parameters: nonlinear optimization problem Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters Add line search and conjugate gradient methods to get fast convergence  Both: Find local maxima. Require multiple restarts to find approx. to the global maximum Require computations in each iteration

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-65 Gradient Ascent u Main result  Requires computation: P(x i,Pa i |o[m],  ) for all i, m u Pros: l Flexible l Closely related to methods in neural network training u Cons: l Need to project gradient onto space of legal parameters l To get reasonable convergence we need to combine with “smart” optimization techniques

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-66 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had access to counts, then we can estimate parameters u However, missing values do not allow to perform counts u “Complete” counts using current parameter assignment X Z N (X,Y ) XY # HTHHTHTHHT Y ??HTT??HTT T??THT??TH HTHTHTHT HHTTHHTT P(Y=H|X=T,  ) = 0.4 Expected Counts P(Y=H|X=H,Z=T,  ) = 0.3 Data Current model

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-67 EM (cont.) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) Reparameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) (M-Step) Reiterate

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-68 EM (cont.) Formal Guarantees:  L(  1 :D)  L(  0 :D) l Each iteration improves the likelihood  If  1 =  0, then  0 is a stationary point of L(  :D) l Usually, this means a local maximum Main cost: u Computations of expected counts in E-Step u Requires a computation pass for each instance in training set l These are exactly the same as for gradient ascent!

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-69 Example: EM in clustering u Consider clustering example E-Step: Compute P(C[m]|X 1 [m],…,X n [m],  ) l This corresponds to “soft” assignment to clusters l Compute expected statistics: M-Step Re-estimate P(X i |C), P(C) Cluster X1X1... X2X2 XnXn

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-70 EM in Practice Initial parameters: u Random parameters setting u “Best” guess from other source Stopping criteria: u Small change in likelihood of data u Small change in parameter values Avoiding bad local maxima: u Multiple restarts u Early “pruning” of unpromising ones Speed up: u various methods to speed convergence

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-71 Error on training set (Alarm) Experiment by Baur, Koller and Singer [UAI97]

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-72 Test set error (alarm)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-73 Parameter value (Alarm)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-74 Parameter value (Alarm)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-75 Parameter value (Alarm)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-76 Bayesian Inference with Incomplete Data Recall, Bayesian estimation: Complete data: closed form solution for integral Incomplete data: u No sufficient statistics (except the data) u Posterior does not decompose u No closed form solution ïNeed to use approximations

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-77 MAP Approximation u Simplest approximation: MAP parameters l MAP --- Maximum A-posteriori Probability where Assumption: u Posterior mass is dominated by a MAP parameters Finding MAP parameters: u Same techniques as finding ML parameters  Maximize P(  |D) instead of L(  :D)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-78 Stochastic Approximations Stochastic approximation:  Sample  1, …,  k from P(  |D) u Approximate

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-79 Stochastic Approximations (cont.) How do we sample from P(  |D) ? Markov Chain Monte Carlo (MCMC) methods:  Find a Markov Chain whose stationary probability Is P(  |D) u Simulate the chain until convergence to stationary behavior u Collect samples for the “stationary” regions Pros: u Very flexible method: when other methods fails, this one usually works u The more samples collected, the better the approximation Cons: u Can be computationally expensive u How do we know when we are converging on stationary distribution?

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-80 Stochastic Approximations: Gibbs Sampling Gibbs Sampler: u A simple method to construct MCMC sampling process Start: u Choose (random) values for all unknown variables Iteration: u Choose an unknown variable l A missing data variable or unknown parameter l Either a random choice or round-robin visits u Sample a value for the variable given the current values of all other variables XX  Y|X=H m X[m] Y[m]  Y|X=T

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-81 Parameter Learning from Incomplete Data: Summary u Non-linear optimization problem u Methods for learning: EM and Gradient Ascent l Exploit inference for learning Difficulties: u Exploration of a complex likelihood/posterior l More missing data  many more local maxima l Cannot represent posterior  must resort to approximations u Inference l Main computational bottleneck for learning l Learning large networks  exact inference is infeasible  resort to stochastic simulation or approximate inference (e.g., see Jordan’s tutorial)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-82 Outline u Introduction u Bayesian networks: a review u Parameter learning: Complete data u Parameter learning: Incomplete data »Structure learning: Complete data »Scoring metrics l Maximizing the score l Learning local structure u Application: classification u Learning causal relationships u Structure learning: Incomplete data u Conclusion Known StructureUnknown Structure Complete data Incomplete data

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-83 Benefits of Learning Structure u Efficient learning -- more accurate models with less data Compare: P(A) and P(B) vs joint P(A,B) former requires less data! l Discover structural properties of the domain l Identifying independencies in the domain helps to Order events that occur sequentially Sensitivity analysis and inference u Predict effect of actions l Involves learning causal relationship among variables  defer to later part of the tutorial

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-84 Approaches to Learning Structure u Constraint based l Perform tests of conditional independence l Search for a network that is consistent with the observed dependencies and independencies u Score based l Define a score that evaluates how well the (in)dependencies in a structure match the observations l Search for a structure that maximizes the score

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-85 Likelihood Score for Structures First cut approach: l Use likelihood function u Recall, the likelihood score for a network structure and parameters is u Since we know how to maximize parameters from now we assume

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-86 Avoiding Overfitting (cont..) Other approaches include: u Holdout/Cross-validation/Leave-one-out l Validate generalization on data withheld during training u Structural Risk Minimization l Penalize hypotheses subclasses based on their VC dimension

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-87 Learning Bayesian Networks Known graph C S B D X  Complete data: parameter estimation (ML, MAP)  Incomplete data: non-linear parametric optimization (gradient descent, EM) P(S) P(B|S) P(X|C,S) P(C|S) P(D|C,B) – learn parameters C S B D X C S B D X Unknown graph  Complete data: optimization (search in space of graphs)  Incomplete data: structural EM, mixture models – learn graph and parameters

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-88 Learning Parameters: complete data u ML-estimate: - decomposable! MAP-estimate ( Bayesian statistics) Conjugate priors - Dirichlet X CB Multinomial counts Equivalent sample size (prior knowledge)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-89 Learning Parameters: incomplete data EM-algorithm: iterate until convergence Initial parameters Current model Non-decomposable marginal likelihood (hidden nodes) S X D C B ……… Data S X D C B ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Update parameters (ML, MAP) Maximization

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-90 Learning graph structure NP-hard optimization Heuristic search:  Greedy local search Find C S B C S B Add S->B C S B Delete S->B C S B Reverse S->B  Best-first search  Simulated annealing Complete data – local computations Incomplete data (score non-decomposable): Structural EM Constrained-based methods  Data impose independence relations (constrains)

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-91 Scoring functions: Minimum Description Length (MDL) u Learning  data compression u Other: MDL = -BIC (Bayesian Information Criterion) u Bayesian score (BDe) - asymptotically equivalent to MDL DL(Model)DL(Data|model) ……………….

© 1998, Nir Friedman, U.C. Berkeley, and Moises Goldszmidt, SRI International. All rights reserved. MP1-92 Summary u Bayesian Networks – graphical probabilistic models u Efficient representation and inference u Expert knowledge + learning from data u Learning: l parameters (parameter estimation, EM) l structure (optimization w/ score functions – e.g., MDL) u Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI)) u Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.