Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.

Similar presentations


Presentation on theme: "Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP."— Presentation transcript:

1 Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP Advanced I WS 06/07 Based on J. A. Bilmes,“A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models“, TR-97- 021, U.C. Berkeley, April 1998; G. J. McLachlan, T. Krishnan, „The EM Algorithm and Extensions“, John Wiley & Sons, Inc., 1997; D. Koller, course CS-228 handouts, Stanford University, 2001., N. Friedman & D. Koller‘s NIPS‘99. Parameter Estimation

2 Bayesian Networks Advanced I WS 06/07 Outline Introduction Reminder: Probability theory Basics of Bayesian Networks Modeling Bayesian networks Inference (VE, Junction tree) [Excourse: Markov Networks] Learning Bayesian networks Relational Models

3 Bayesian Networks Advanced I WS 06/07 What is Learning? Agents are said to learn if they improve their performance over time based on experience. The problem of understanding intelligence is said to be the greatest problem in science today and “the” problem for this century – as deciphering the genetic code was for the second half of the last one…the problem of learning represents a gateway to understanding intelligence in man and machines. -- Tomasso Poggio and Steven Smale, 2003. - Learning

4 Bayesian Networks Advanced I WS 06/07 Why bothering with learning? Bottleneck of knowledge aquisition –Expensive, difficult –Normally, no expert is around Data is cheap ! –Huge amount of data avaible, e.g. Clinical tests Web mining, e.g. log files.... - Learning

5 Bayesian Networks Advanced I WS 06/07 Why Learning Bayesian Networks? Conditional independencies & graphical language capture structure of many real-world distributions Graph structure provides much insight into domain –Allows “knowledge discovery” Learned model can be used for many tasks Supports all the features of probabilistic learning –Model selection criteria –Dealing with missing data & hidden variables

6 Bayesian Networks Advanced I WS 06/07 Learning With Bayesian Networks Data + Priori Info E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) - Learning

7 Bayesian Networks Advanced I WS 06/07 Learning With Bayesian Networks Data + Priori Info E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) Learning algorithm - Learning

8 Bayesian Networks Advanced I WS 06/07 What does the data look like? A1A2A3A4A5A6 true falsetruefalse true false... truefalse true complete data set X1 X2 XM... - Learning attributes/variables data cases

9 Bayesian Networks Advanced I WS 06/07 What does the data look like? A1A2A3A4A5A6 true ? false ? true ?? false... truefalse ? true ? incomplete data set Real-world data: states of some random variables are missing –E.g. medical diagnose: not all patient are subjects to all test –Parameter reduction, e.g. clustering,... - Learning

10 Bayesian Networks Advanced I WS 06/07 missing value What does the data look like? incomplete data set Real-world data: states of some random variables are missing –E.g. medical diagnose: not all patient are subjects to all test –Parameter reduction, e.g. clustering,... - Learning A1A2A3A4A5A6 true ? false ? true ?? false... truefalse ? true ?

11 Bayesian Networks Advanced I WS 06/07 missing value What does the data look like? incomplete data set Real-world data: states of some random variables are missing –E.g. medical diagnose: not all patient are subjects to all test –Parameter reduction, e.g. clustering,... - Learning hidden/ latent A1A2A3A4A5A6 true ? false ? true ?? false... truefalse ? true ?

12 Bayesian Networks Advanced I WS 06/07 Hidden variable – Examples X1 X2 X3 Y3Y2 Y1 X1 X2 X3 H Y3Y2 Y1 H hidden 222 16 4 4 4 34 118 222 16 32 64 1.Parameter reduction: - Learning

13 Bayesian Networks Advanced I WS 06/07 Hidden variable – Examples 2.Clustering: Cluster Xn X2 X1... assignment of cluster attributes Cluster hidden - Learning Hidden variables also appear in clustering Autoclass model: –Hidden variables assigns class labels –Observed attributes are independent given the class

14 Bayesian Networks Advanced I WS 06/07 [Slides due to Eamonn Keogh] - Learning

15 Bayesian Networks Advanced I WS 06/07 Iteration 1 The cluster means are randomly assigned [Slides due to Eamonn Keogh] - Learning

16 Bayesian Networks Advanced I WS 06/07 Iteration 2 [Slides due to Eamonn Keogh] - Learning

17 Bayesian Networks Advanced I WS 06/07 Iteration 5 [Slides due to Eamonn Keogh] - Learning

18 Bayesian Networks Advanced I WS 06/07 Iteration 25 [Slides due to Eamonn Keogh] - Learning

19 What is a natural grouping among these objects? [Slides due to Eamonn Keogh]

20 School Employees Simpson's Family MalesFemales Clustering is subjective What is a natural grouping among these objects? [Slides due to Eamonn Keogh]

21 Bayesian Networks Advanced I WS 06/07 Learning With Bayesian Networks Fixed structureFixed variablesHidden variables observed fully Partially Easiest problem counting Selection of arcs New domain with no domain expert Data mining Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, „Only“ Structural EM is known Scientific discouvery AB AB ? AB ?? H - Learning

22 Bayesian Networks Advanced I WS 06/07 Let be set of data over m RVs is called a data case iid - assumption: – All data cases are independently sampled from identical distributions Parameter Estimation Find: Parameters of CPDs which match the data best - Learning

23 Bayesian Networks Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? - Learning

24 Bayesian Networks Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? - Learning Find paramteres which have most likely produced the data

25 Bayesian Networks Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? 1.MAP parameters - Learning

26 Bayesian Networks Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? 1.MAP parameters 2.Data is equally likely for all parameters. - Learning

27 Bayesian Networks Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? 1.MAP parameters 2.Data is equally likely for all parameters 3.All parameters are apriori equally likely - Learning

28 Bayesian Networks Advanced I WS 06/07 Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? Find: ML parameters - Learning

29 Bayesian Networks Advanced I WS 06/07 Likelihood of the paramteres given the data Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? Find: ML parameters - Learning

30 Bayesian Networks Advanced I WS 06/07 Likelihood of the paramteres given the data Maximum Likelihood - Parameter Estimation What does „best matching“ mean ? Find: ML parameters - Learning Log-Likelihood of the paramteres given the data

31 Bayesian Networks Advanced I WS 06/07 Maximum Likelihood This is one of the most commonly used estimators in statistics Intuitively appealing Consistent: estimate converges to best possible value as the number of examples grow Asymptotic efficiency: estimate is as close to the true value as possible given a particular training set - Learning

32 Bayesian Networks Advanced I WS 06/07 Learning With Bayesian Networks Fixed structureFixed variablesHidden variables observed fully Partially Easiest problem counting Selection of arcs New domain with no domain expert Data mining Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, „Only“ Structural EM is known Scientific discouvery AB AB ? AB ?? H - Learning ?

33 Bayesian Networks Advanced I WS 06/07 Known Structure, Complete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) Learning algorithm - Learning E B A ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Network structure is specified –Learner needs to estimate parameters Data does not contain missing values

34 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning

35 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid)

36 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid)

37 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid) (BN semantics)

38 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid) (BN semantics)

39 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid) (BN semantics)

40 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid) Only local parameters of family of Aj involved (BN semantics)

41 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid) Each factor individually !! Only local parameters of family of Aj involved (BN semantics)

42 Bayesian Networks Advanced I WS 06/07 ML Parameter Estimation A1A2A3A4A5A6 true falsetruefalse true false... truefalse true - Learning (iid) Each factor individually !! Only local parameters of family of Aj involved (BN semantics) Decomposability of the likelihood

43 Bayesian Networks Advanced I WS 06/07 Decomposability of Likelihood If the data set if complete (no question marks) we can maximize each local likelihood function independently, and then combine the solutions to get an MLE solution. decomposition of the global problem to independent, local sub-problems. This allows efficient solutions to the MLE problem.

44 Bayesian Networks Advanced I WS 06/07 Random variable V with 1,...,K values where Nk is the counts of state k in data This constraint implies that the choice on  I influences the choice on  j (i<>j) Likelihood for Multinominals - Learning

45 Bayesian Networks Advanced I WS 06/07 Likelihood for Binominals (2 states only) - Learning => MLE is  1 +  2 =1 Compute partial derivative Set partial derivative zero

46 Bayesian Networks Advanced I WS 06/07 Likelihood for Binominals (2 states only) - Learning => MLE is  1 +  2 =1 Compute partial derivative Set partial derivative zero In general, for multinomials (>2 states), the MLE is

47 Bayesian Networks Advanced I WS 06/07 multinomial for each joint state pa of the parents of V: MLE Likelihood for Conditional Multinominals - Learning

48 Bayesian Networks Advanced I WS 06/07 Learning With Bayesian Networks Fixed structureFixed variablesHidden variables observed fully Partially Easiest problem counting Selection of arcs New domain with no domain expert Data mining Numerical, nonlinear optimization, Multiple calls to BNs, Difficult for large networks Encompasses to difficult subproblem, „Only“ Structural EM is known Scientific discouvery AB AB ? AB ?? H - Learning ?

49 Bayesian Networks Advanced I WS 06/07 Known Structure, Incomplete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) Learning algorithm - Learning E B A ?? e b e ?? ? ? ?? be b b e BEP(A | E,B) E, B, A. Network structure is specified Data contains missing values –Need to consider assignments to missing values

50 Bayesian Networks Advanced I WS 06/07 EM Idea In the case of complete data, ML parameter estimation is easy: –simply counting (1 iteration) Incomplete data ? 1.Complete data (Imputation) most probable?, average?,... value 2.Count 3.Iterate - Learning

51 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? - Learning AB

52 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? complete data 0.5 1.5 N false truefalse true BA expected counts complete - Learning AB

53 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? complete data 0.5 1.5 N false truefalse true BA expected counts complete - Learning AB ABN true 1.0 true ?=true0.5 true ?=false0.5 falsetrue1.0 truefalse1.0 false ?=true0.5 false ?=false0.5

54 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? complete data 0.5 1.5 N false truefalse true BA expected counts complete - Learning maximize AB

55 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? complete data 0.5 1.5 N false truefalse true BA expected counts complete iterate - Learning maximize AB

56 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? - Learning AB

57 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? complete data 0.25 1.75 1.5 N false truefalse true BA expected counts complete iterate - Learning maximize AB

58 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? - Learning AB

59 Bayesian Networks Advanced I WS 06/07 EM Idea: complete the data incomplete data AB true ? falsetrue false ? complete data 0.125 1.875 1.5 N false truefalse true BA expected counts complete iterate - Learning maximize AB

60 Bayesian Networks Advanced I WS 06/07 Complete-data likelihood A1A2A3A4A5A6 true ? false ?true??false... truefalse? true? incomplete-data likelihood Assume complete data exists with complete-data likelihood - Learning

61 Bayesian Networks Advanced I WS 06/07 EM Algorithm - Abstract - Learning Expectation Step Maximization Step

62 Bayesian Networks Advanced I WS 06/07 P(Y   Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. - Learning EM Algorithm - Principle

63 Bayesian Networks Advanced I WS 06/07 P(Y  EM Algorithm - Principle  Current point q k Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. - Learning

64 Bayesian Networks Advanced I WS 06/07 P(Y  EM Algorithm - Principle  Current point q k new function Q (q,q k ) Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. - Learning

65 Bayesian Networks Advanced I WS 06/07 P(Y  EM Algorithm - Principle  Current point q k Maximum q k+1 new function Q (q,q k ) Expectation Maximization (EM): Construct an new function based on the “current point” (which “behaves well”) Property: The maximum of the new function has a better scoring then the current point. - Learning

66 Bayesian Networks Advanced I WS 06/07 Random variable V with 1,...,K values where EN k are the expected counts of state k in the data, i.e. „MLE“: EM for Multi-Nominals - Learning

67 Bayesian Networks Advanced I WS 06/07 multinomial for each joint state pa of the parents of V: „MLE“ EM for Conditional Multinominals - Learning

68 Bayesian Networks Advanced I WS 06/07 Learning Parameters: incomplete data Initial parameters Current model Non-decomposable likelihood (missing value, hidden nodes) S X D C B ……… Data - Learning

69 Bayesian Networks Advanced I WS 06/07 Learning Parameters: incomplete data Initial parameters Current model Non-decomposable likelihood (missing value, hidden nodes) S X D C B ……… Data S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) - Learning

70 Bayesian Networks Advanced I WS 06/07 Learning Parameters: incomplete data Initial parameters Current model Non-decomposable likelihood (missing value, hidden nodes) S X D C B ……… Data S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Update parameters (ML, MAP) Maximization - Learning

71 Bayesian Networks Advanced I WS 06/07 Learning Parameters: incomplete data EM-algorithm: iterate until convergence Initial parameters Current model Non-decomposable likelihood (missing value, hidden nodes) S X D C B ……… Data S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ……….. Expected counts Expectation Inference: P(S|X=0,D=1,C=0,B=1) Update parameters (ML, MAP) Maximization - Learning

72 Bayesian Networks Advanced I WS 06/07 Learning Parameters: incomplete data 1.Initialize parameters 2.Compute pseudo counts for each variable 3.Set parameters to the (completed) ML estimates 4.If not converged, iterate to 2 junction tree algorithm - Learning

73 Bayesian Networks Advanced I WS 06/07 Monotonicity (Dempster, Laird, Rubin ´77): the incomplete- data likelihood fuction is not decreased after an EM iteration (discrete) Bayesian networks: for any initial, non- uniform value the EM algorithm converges to a (local or global) maximum. - Learning

74 Bayesian Networks Advanced I WS 06/07 LL on training set (Alarm) Experiment by Bauer, Koller and Singer [UAI97] - Learning

75 Bayesian Networks Advanced I WS 06/07 Parameter value (Alarm) Experiment by Baur, Koller and Singer [UAI97] - Learning

76 Bayesian Networks Advanced I WS 06/07 EM in Practice - Learning Initial parameters: Random parameters setting “Best” guess from other source Stopping criteria: Small change in likelihood of data Small change in parameter values Avoiding bad local maxima: Multiple restarts Early “pruning” of unpromising ones Speed up: various methods to speed convergence

77 Bayesian Networks Advanced I WS 06/07 Gradient Ascent Main result Requires same BN inference computations as EM Pros: –Flexible –Closely related to methods in neural network training Cons: –Need to project gradient onto space of legal parameters –To get reasonable convergence we need to combine with “smart” optimization techniques - Learning

78 Bayesian Networks Advanced I WS 06/07 Parameter Estimation: Summary Parameter estimation is a basic task for learning with Bayesian networks Due to missing values non-linear optimization –EM, Gradient,... EM for multi-nominal random variables –Fully observed data: counting –Partially observed data: pseudo counts Junction tree to do multiple inference - Learning


Download ppt "Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP."

Similar presentations


Ads by Google