Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.

Similar presentations


Presentation on theme: ". Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford."— Presentation transcript:

1 . Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford

2 2 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

3 3 Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) u Nodes - random variables u Edges - direct influence Quantitative part: Set of conditional probability distributions 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form

4 4 Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients u 37 variables u 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

5 5 Inference u Posterior probabilities l Probability of any event given any evidence u Most likely explanation l Scenario that explains evidence u Rational decision making l Maximize expected utility l Value of Information u Effect of intervention Earthquake Radio Burglary Alarm Call Radio Call

6 6 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often we don’t have an expert Data is cheap u Amount of available information growing rapidly u Learning allows us to construct models from raw data

7 7 Why Learn Bayesian Networks? u Conditional independencies & graphical language capture structure of many real-world distributions u Graph structure provides much insight into domain l Allows “knowledge discovery” u Learned model can be used for many tasks u Supports all the features of probabilistic learning l Model selection criteria l Dealing with missing data & hidden variables

8 8 Learning Bayesian networks E R B A C.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) Data + Prior Information Learner

9 9 Known Structure, Complete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified l Inducer needs to estimate parameters u Data does not contain missing values Learner E, B, A.

10 10 Unknown Structure, Complete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified l Inducer needs to select arcs & estimate parameters u Data does not contain missing values E, B, A. Learner

11 11 Known Structure, Incomplete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

12 12 Unknown Structure, Incomplete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

13 13 Overview u Introduction u Parameter Estimation l Likelihood function l Bayesian estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

14 14 Learning Parameters E B A C u Training data has the form:

15 15 Likelihood Function E B A C u Assume i.i.d. samples u Likelihood function is

16 16 Likelihood Function E B A C u By definition of network, we get

17 17 Likelihood Function E B A C u Rewriting terms, we get =

18 18 General Bayesian Networks Generalizing for any Bayesian network: Decomposition  Independent estimation problems

19 19 Likelihood Function: Multinomials u The likelihood for the sequence H,T, T, H, H is 00.20.40.60.81  L(  :D) Probability of k th outcome Count of k th outcome in D General case:

20 20 Bayesian Inference u Represent uncertainty about parameters using a probability distribution over parameters, data u Learning using Bayes rule Posterior Likelihood Prior Probability of data

21 21 Bayesian Inference u Represent Bayesian distribution as Bayes net  The values of X are independent given  u P(x[m] |  ) =  u Bayesian prediction is inference in this network  X[1]X[2]X[m] Observed data

22 22 u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query Bayesian Nets & Bayesian Prediction

23 23 u We can also “read” from the network: Complete data  posteriors on parameters are independent u Can compute posterior over parameters separately! Bayesian Nets & Bayesian Prediction XX X[1]X[2] X[M] Observed data Y[1]Y[2] Y[M]  Y|X

24 24 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomials: counts N(x i,pa i ) l Parameter estimation u Both are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

25 25 Overview u Introduction u Parameter Learning u Model Selection l Scoring function l Structure search u Structure Discovery u Incomplete Data u Learning from Structured Data

26 26 Why Struggle for Accurate Structure? u Increases the number of parameters to be estimated u Wrong assumptions about domain structure u Cannot be compensated for by fitting parameters u Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc

27 27 Score­based Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

28 28 Likelihood Score for Structure  Larger dependence of X i on Pa i  higher score u Adding arcs always helps l I(X; Y)  I(X; {Y,Z}) l Max score attained by fully connected network l Overfitting: A bad idea… Mutual information between X i and its parents

29 29 Max likelihood params Bayesian Score Likelihood score: Bayesian approach: l Deal with uncertainty by assigning probability to all possibilities Likelihood Prior over parameters Marginal Likelihood

30 30 Heuristic Search u Define a search space: l search states are possible structures l operators make small changes to structure u Traverse space looking for high-scoring structures u Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

31 31 Local Search u Start with a given network l empty network l best tree l a random network u At each iteration l Evaluate all possible changes l Apply change based on score u Stop when no modification improves score

32 32 Heuristic Search u Typical operations: S C E D Reverse C  E Delete C  E Add C  D S C E D S C E D S C E D  score = S({C,E}  D) - S({E}  D) To update score after local change, only re-score families that changed

33 33 Learning in Practice: Alarm domain 0 0.5 1 1.5 2 0500100015002000250030003500400045005000 KL Divergence to true distribution #samples Structure known, fit params Learn both structure & params

34 34 Local Search: Possible Pitfalls u Local search can get stuck in: l Local Maxima:  All one-edge changes reduce the score l Plateaux:  Some one-edge changes leave the score unchanged u Standard heuristics can escape both l Random restarts l TABU search l Simulated annealing

35 35 Improved Search: Weight Annealing u Standard annealing process: l Take bad steps with probability  exp(  score/t) l Probability increases with temperature u Weight annealing: l Take uphill steps relative to perturbed score l Perturbation increases with temperature Score(G|D) G

36 36 Perturbing the Score u Perturb the score by reweighting instances u Each weight sampled from distribution: l Mean = 1 l Variance  temperature u Instances sampled from “original” distribution u … but perturbation changes emphasis Benefit: u Allows global moves in the search space

37 37 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search Greedy hill-climbing True structure Learned params Annealed search

38 38 Structure Search: Summary u Discrete optimization problem u In some cases, optimization problem is easy l Example: learning trees u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~2-5 min):  Decomposability  Sufficient statistics l Adding randomness to search is critical

39 39 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

40 40 Structure Discovery Task: Discover structural properties l Is there a direct connection between X & Y l Does X separate between two “subsystems” l Does X causally effect Y Example: scientific data mining l Disease properties and symptoms l Interactions between the expression of genes

41 41 Discovering Structure u Current practice: model selection l Pick a single high-scoring model l Use that model to infer domain structure E R B A C P(G|D)

42 42 Discovering Structure Problem Small sample size  many high scoring models l Answer based on one model often useless l Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

43 43 Bayesian Approach u Posterior distribution over structures u Estimate probability of features Edge X  Y Path X  …  Y l … Feature of G, e.g., X  Y Indicator function for feature f Bayesian score for G

44 44 MCMC over Networks u Cannot enumerate structures, so sample structures u MCMC Sampling l Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: l Huge (superexponential) number of networks l Time for chain to converge to posterior is unknown l Islands of high posterior, connected by low bridges

45 45 ICU Alarm BN: No Mixing u 500 instances: u The runs clearly do not mix MCMC Iteration Score of cuurent sample

46 46 Effects of Non-Mixing u Two MCMC runs over same 500 instances u Probability estimates for edges for two runs Probability estimates highly variable, nonrobust True BN Random start True BN

47 47 Fixed Ordering Suppose that u We know the ordering of variables say, X 1 > X 2 > X 3 > X 4 > … > X n parents for X i must be in X 1,…,X i-1  Limit number of parents per nodes to k Intuition: Order decouples choice of parents  Choice of Pa(X 7 ) does not restrict choice of Pa(X 12 ) Upshot: Can compute efficiently in closed form  Likelihood P(D |  )  Feature probability P(f | D,  ) 2 knlog n networks

48 48 Our Approach: Sample Orderings We can write Sample orderings and approximate u MCMC Sampling l Define Markov chain over orderings Run chain to get samples from posterior P(  | D)

49 49 Mixing with MCMC-Orderings u 4 runs on ICU-Alarm with 500 instances l fewer iterations than MCMC-Nets l approximately same amount of computation Process appears to be mixing! MCMC Iteration Score of cuurent sample

50 50 Mixing of MCMC runs u Two MCMC runs over same instances u Probability estimates for edges 500 instances 1000 instances Probability estimates very robust

51 51 Application to Gene Array Analysis

52 52 Chips and Features

53 53 Application to Gene Array Analysis u See www.cs.tau.ac.il/~nin/BioInfo04www.cs.tau.ac.il/~nin/BioInfo04 u Bayesian Inference : http://www.cs.huji.ac.il/~nir/Papers/FLNP1Full.pdf

54 54 Application: Gene expression Input: u Measurement of gene expression under different conditions l Thousands of genes l Hundreds of experiments Output: u Models of gene interaction l Uncover pathways

55 55 Map of Feature Confidence Yeast data [Hughes et al 2000] u 600 genes u 300 experiments

56 56 “Mating response” Substructure u Automatically constructed sub-network of high- confidence edges u Almost exact reconstruction of yeast mating pathway KAR4 AGA1PRM1 TEC1 SST2 STE6 KSS1 NDJ1 FUS3 AGA2 YEL059W TOM6 FIG1 YLR343W YLR334C MFA1 FUS1

57 57 Summary of the course u Bayesian learning l The idea of considering model parameters as variables with prior distribution u PAC learning l Assigning confidence and accepted error to the learning problem and analyzing polynomial L T u Boosting and Bagging l Use of the data for estimating multiple models and fusion between them

58 58 Summary of the course u Hidden Markov Models l Markov property is widely assumed, hidden Markov model very powerful & easily estimated u Model selection and validation l Crucial for any type of modeling! u (Artificial) neural networks l The brain performs computations radically different than modern computers & often much better! we need to learn how? l ANN powerful modeling tool (BP, RBF)

59 59 Summary of the course u Evolutionary learning l Very different learning rule, has its merits u VC dimensionality l Powerful theoretical tool in defining solvable problems, difficult for practical use u Support Vector Machine l Clean theory, different than classical statistics as it looks for simple estimators in high dim, rather than reducing dim u Bayesian networks: compact way to represent conditional dependencies between variables

60 60 Final project

61 61 Probabilistic Relational Models u Universals: Probabilistic patterns hold for all objects in class u Locality: Represent direct probabilistic dependencies l Links give us potential interactions! Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability Key ideas: ABC

62 62 Prof. SmithProf. Jones Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 PRM Semantics Teaching-ability Difficulty Grade Satisfac Intelligence Instantiated PRM  BN  variables: attributes of all objects  dependencies: determined by links & PRM  Grade|Intell,Diffic

63 63 Welcome to CS101 weak / smart The Web of Influence Welcome to Geo101 A C weak smart easy / hard u Objects are all correlated u Need to perform inference over entire model u For large databases, use approximate inference: l Loopy belief propagation

64 64 PRM Learning: Complete Data Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence High Low Easy A B C Like Hate Like Smart Weak  Grade|Intell,Diffic u Entire database is single “instance” u Parameters used many times in instance u Introduce prior over parameters u Update prior with sufficient statistics: Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)

65 65 PRM Learning: Incomplete Data Welcome to CS101 Welcome to Geo101 ??? B A C Hi Low Hi ??? u Use expected sufficient statistics u But, everything is correlated:  E-step uses (approx) inference over entire model

66 66 Example: Binomial Data  Prior: uniform for  in [0,1]  P(  |D)  the likelihood L(  :D) (N H,N T ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is 00.20.40.60.81

67 67 Dirichlet Priors u Recall that the likelihood function is  Dirichlet prior with hyperparameters  1,…,  K  the posterior has the same form, with hyperparameters  1 +N 1,…,  K +N K

68 68 Dirichlet Priors - Example 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 00.20.40.60.81 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(5,5) Dirichlet(0.5,0.5) Dirichlet(  heads,  tails )  heads P(  heads )

69 69 Dirichlet Priors (cont.)  If P(  ) is Dirichlet with hyperparameters  1,…,  K u Since the posterior is also Dirichlet, we get

70 70 Learning Parameters: Case Study 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0500100015002000250030003500400045005000 KL Divergence to true distribution # instances MLE Bayes; M'=50 Bayes; M'=20 Bayes; M'=5 Instances sampled from ICU Alarm network M’ — strength of prior

71 71 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form  P(  ) is Dirichlet with hyperparameters  1,…,  K  D is a dataset with sufficient statistics N 1,…,N K Then

72 72 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood 1234567 Network 1: Two Dirichlet marginal likelihoods P( ) XY Integral over  X Integral over  Y HTTHTHH HTTHTHH HTHHTTH HTHHTTH

73 73 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood 1234567 Network 2: Three Dirichlet marginal likelihoods P( ) XY Integral over  X Integral over  Y|X=H HTTHTHH HTTHTHH HTHHTTH THTHHTH Integral over  Y|X=T

74 74 Marginal Likelihood for Networks The marginal likelihood has the form: Dirichlet marginal likelihood for multinomial P(X i | pa i ) N(..) are counts from the data  (..) are hyperparameters for each family given G

75 75  As M (amount of data) grows, l Increasing pressure to fit dependencies in distribution l Complexity term avoids fitting noise u Asymptotic equivalence to MDL score u Bayesian score is consistent l Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score: Asymptotic Behavior

76 76 Structure Search as Optimization Input: l Training data l Scoring function l Set of possible structures Output: l A network that maximizes the score Key Computational Property: Decomposability: score(G) =  score ( family of X in G )

77 77 Tree-Structured Networks Trees: u At most one parent per variable Why trees? u Elegant math  we can solve the optimization problem u Sparse parameterization  avoid overfitting PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT MINOVL PVSAT PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

78 78 Learning Trees  Let p(i) denote parent of X i u We can write the Bayesian score as Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

79 79 Learning Trees  Set w(j  i) = Score( X j  X i ) - Score(X i ) u Find tree (or forest) with maximal weight l Standard max spanning tree algorithm — O(n 2 log n) Theorem: This procedure finds tree with max score

80 80 Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow at most two parents per node u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1


Download ppt ". Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford."

Similar presentations


Ads by Google