Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.

Similar presentations


Presentation on theme: ". Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford."— Presentation transcript:

1 . Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford

2 2 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

3 3 Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) u Nodes - random variables u Edges - direct influence Quantitative part: Set of conditional probability distributions e b e be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form

4 4 Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients u 37 variables u 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

5 5 Inference u Posterior probabilities l Probability of any event given any evidence u Most likely explanation l Scenario that explains evidence u Rational decision making l Maximize expected utility l Value of Information u Effect of intervention Earthquake Radio Burglary Alarm Call Radio Call

6 6 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often we don’t have an expert Data is cheap u Amount of available information growing rapidly u Learning allows us to construct models from raw data

7 7 Why Learn Bayesian Networks? u Conditional independencies & graphical language capture structure of many real-world distributions u Graph structure provides much insight into domain l Allows “knowledge discovery” u Learned model can be used for many tasks u Supports all the features of probabilistic learning l Model selection criteria l Dealing with missing data & hidden variables

8 8 Learning Bayesian networks E R B A C.9.1 e b e be b b e BEP(A | E,B) Data + Prior Information Learner

9 9 Known Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified l Inducer needs to estimate parameters u Data does not contain missing values Learner E, B, A.

10 10 Unknown Structure, Complete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified l Inducer needs to select arcs & estimate parameters u Data does not contain missing values E, B, A. Learner

11 11 Known Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

12 12 Unknown Structure, Incomplete Data E B A.9.1 e b e be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

13 13 Overview u Introduction u Parameter Estimation l Likelihood function l Bayesian estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

14 14 Learning Parameters E B A C u Training data has the form:

15 15 Likelihood Function E B A C u Assume i.i.d. samples u Likelihood function is

16 16 Likelihood Function E B A C u By definition of network, we get

17 17 Likelihood Function E B A C u Rewriting terms, we get =

18 18 General Bayesian Networks Generalizing for any Bayesian network: Decomposition  Independent estimation problems

19 19 Likelihood Function: Multinomials u The likelihood for the sequence H,T, T, H, H is  L(  :D) Probability of k th outcome Count of k th outcome in D General case:

20 20 Bayesian Inference u Represent uncertainty about parameters using a probability distribution over parameters, data u Learning using Bayes rule Posterior Likelihood Prior Probability of data

21 21 Bayesian Inference u Represent Bayesian distribution as Bayes net  The values of X are independent given  u P(x[m] |  ) =  u Bayesian prediction is inference in this network  X[1]X[2]X[m] Observed data

22 22 Example: Binomial Data  Prior: uniform for  in [0,1]  P(  |D)  the likelihood L(  :D) (N H,N T ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is

23 23 Dirichlet Priors u Recall that the likelihood function is  Dirichlet prior with hyperparameters  1,…,  K  the posterior has the same form, with hyperparameters  1 +N 1,…,  K +N K

24 24 Dirichlet Priors - Example Dirichlet(1,1) Dirichlet(2,2) Dirichlet(5,5) Dirichlet(0.5,0.5) Dirichlet(  heads,  tails )  heads P(  heads )

25 25 Dirichlet Priors (cont.)  If P(  ) is Dirichlet with hyperparameters  1,…,  K u Since the posterior is also Dirichlet, we get

26 26 u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1]  Y|X XX m X[m] Y[m] Query Bayesian Nets & Bayesian Prediction

27 27 u We can also “read” from the network: Complete data  posteriors on parameters are independent u Can compute posterior over parameters separately! Bayesian Nets & Bayesian Prediction XX X[1]X[2] X[M] Observed data Y[1]Y[2] Y[M]  Y|X

28 28 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomials: counts N(x i,pa i ) l Parameter estimation u Both are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

29 29 Learning Parameters: Case Study KL Divergence to true distribution # instances MLE Bayes; M'=50 Bayes; M'=20 Bayes; M'=5 Instances sampled from ICU Alarm network M’ — strength of prior

30 30 Overview u Introduction u Parameter Learning u Model Selection l Scoring function l Structure search u Structure Discovery u Incomplete Data u Learning from Structured Data

31 31 Why Struggle for Accurate Structure? u Increases the number of parameters to be estimated u Wrong assumptions about domain structure u Cannot be compensated for by fitting parameters u Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc

32 32 Score­based Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

33 33 Likelihood Score for Structure  Larger dependence of X i on Pa i  higher score u Adding arcs always helps l I(X; Y)  I(X; {Y,Z}) l Max score attained by fully connected network l Overfitting: A bad idea… Mutual information between X i and its parents

34 34 Max likelihood params Bayesian Score Likelihood score: Bayesian approach: l Deal with uncertainty by assigning probability to all possibilities Likelihood Prior over parameters Marginal Likelihood

35 35 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form  P(  ) is Dirichlet with hyperparameters  1,…,  K  D is a dataset with sufficient statistics N 1,…,N K Then

36 36 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood Network 1: Two Dirichlet marginal likelihoods P( ) XY Integral over  X Integral over  Y HTTHTHH HTTHTHH HTHHTTH HTHHTTH

37 37 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood Network 2: Three Dirichlet marginal likelihoods P( ) XY Integral over  X Integral over  Y|X=H HTTHTHH HTTHTHH HTHHTTH THTHHTH Integral over  Y|X=T

38 38 Marginal Likelihood for Networks The marginal likelihood has the form: Dirichlet marginal likelihood for multinomial P(X i | pa i ) N(..) are counts from the data  (..) are hyperparameters for each family given G

39 39  As M (amount of data) grows, l Increasing pressure to fit dependencies in distribution l Complexity term avoids fitting noise u Asymptotic equivalence to MDL score u Bayesian score is consistent l Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score: Asymptotic Behavior

40 40 Structure Search as Optimization Input: l Training data l Scoring function l Set of possible structures Output: l A network that maximizes the score Key Computational Property: Decomposability: score(G) =  score ( family of X in G )

41 41 Tree-Structured Networks Trees: u At most one parent per variable Why trees? u Elegant math  we can solve the optimization problem u Sparse parameterization  avoid overfitting PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT MINOVL PVSAT PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

42 42 Learning Trees  Let p(i) denote parent of X i u We can write the Bayesian score as Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

43 43 Learning Trees  Set w(j  i) = Score( X j  X i ) - Score(X i ) u Find tree (or forest) with maximal weight l Standard max spanning tree algorithm — O(n 2 log n) Theorem: This procedure finds tree with max score

44 44 Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow at most two parents per node u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1

45 45 Heuristic Search u Define a search space: l search states are possible structures l operators make small changes to structure u Traverse space looking for high-scoring structures u Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

46 46 Local Search u Start with a given network l empty network l best tree l a random network u At each iteration l Evaluate all possible changes l Apply change based on score u Stop when no modification improves score

47 47 Heuristic Search u Typical operations: S C E D Reverse C  E Delete C  E Add C  D S C E D S C E D S C E D  score = S({C,E}  D) - S({E}  D) To update score after local change, only re-score families that changed

48 48 Learning in Practice: Alarm domain KL Divergence to true distribution #samples Structure known, fit params Learn both structure & params

49 49 Local Search: Possible Pitfalls u Local search can get stuck in: l Local Maxima:  All one-edge changes reduce the score l Plateaux:  Some one-edge changes leave the score unchanged u Standard heuristics can escape both l Random restarts l TABU search l Simulated annealing

50 50 Improved Search: Weight Annealing u Standard annealing process: l Take bad steps with probability  exp(  score/t) l Probability increases with temperature u Weight annealing: l Take uphill steps relative to perturbed score l Perturbation increases with temperature Score(G|D) G

51 51 Perturbing the Score u Perturb the score by reweighting instances u Each weight sampled from distribution: l Mean = 1 l Variance  temperature u Instances sampled from “original” distribution u … but perturbation changes emphasis Benefit: u Allows global moves in the search space

52 52 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search Greedy hill-climbing True structure Learned params Annealed search

53 53 Structure Search: Summary u Discrete optimization problem u In some cases, optimization problem is easy l Example: learning trees u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~2-5 min):  Decomposability  Sufficient statistics l Adding randomness to search is critical

54 54 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

55 55 Structure Discovery Task: Discover structural properties l Is there a direct connection between X & Y l Does X separate between two “subsystems” l Does X causally effect Y Example: scientific data mining l Disease properties and symptoms l Interactions between the expression of genes

56 56 Discovering Structure u Current practice: model selection l Pick a single high-scoring model l Use that model to infer domain structure E R B A C P(G|D)

57 57 Discovering Structure Problem Small sample size  many high scoring models l Answer based on one model often useless l Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

58 58 Bayesian Approach u Posterior distribution over structures u Estimate probability of features Edge X  Y Path X  …  Y l … Feature of G, e.g., X  Y Indicator function for feature f Bayesian score for G

59 59 MCMC over Networks u Cannot enumerate structures, so sample structures u MCMC Sampling l Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: l Huge (superexponential) number of networks l Time for chain to converge to posterior is unknown l Islands of high posterior, connected by low bridges

60 60 ICU Alarm BN: No Mixing u 500 instances: u The runs clearly do not mix MCMC Iteration Score of cuurent sample

61 61 Effects of Non-Mixing u Two MCMC runs over same 500 instances u Probability estimates for edges for two runs Probability estimates highly variable, nonrobust True BN Random start True BN

62 62 Fixed Ordering Suppose that u We know the ordering of variables say, X 1 > X 2 > X 3 > X 4 > … > X n parents for X i must be in X 1,…,X i-1  Limit number of parents per nodes to k Intuition: Order decouples choice of parents  Choice of Pa(X 7 ) does not restrict choice of Pa(X 12 ) Upshot: Can compute efficiently in closed form  Likelihood P(D |  )  Feature probability P(f | D,  ) 2 knlog n networks

63 63 Our Approach: Sample Orderings We can write Sample orderings and approximate u MCMC Sampling l Define Markov chain over orderings Run chain to get samples from posterior P(  | D)

64 64 Mixing with MCMC-Orderings u 4 runs on ICU-Alarm with 500 instances l fewer iterations than MCMC-Nets l approximately same amount of computation Process appears to be mixing! MCMC Iteration Score of cuurent sample

65 65 Mixing of MCMC runs u Two MCMC runs over same instances u Probability estimates for edges 500 instances 1000 instances Probability estimates very robust

66 66 Application: Gene expression Input: u Measurement of gene expression under different conditions l Thousands of genes l Hundreds of experiments Output: u Models of gene interaction l Uncover pathways

67 67 Map of Feature Confidence Yeast data [Hughes et al 2000] u 600 genes u 300 experiments

68 68 “Mating response” Substructure u Automatically constructed sub-network of high- confidence edges u Almost exact reconstruction of yeast mating pathway KAR4 AGA1PRM1 TEC1 SST2 STE6 KSS1 NDJ1 FUS3 AGA2 YEL059W TOM6 FIG1 YLR343W YLR334C MFA1 FUS1

69 69 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data l Parameter estimation l Structure search u Learning from Structured Data

70 70 Incomplete Data Data is often incomplete u Some variables of interest are not assigned values This phenomenon happens when we have u Missing values: l Some variables unobserved in some instances u Hidden variables: l Some variables are never observed l We might not even know they exist

71 71 Hidden (Latent) Variables Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters

72 72 Example  P(X) assumed to be known u Likelihood function of:  Y|X=T,  Y|X=H  Contour plots of log likelihood for different number of missing values of X (M = 8): no missing values  Y|X=H  Y|X=T 2 missing value  Y|X=T 3 missing values  Y|X=T XY In general: likelihood function has multiple modes

73 73 Incomplete Data u In the presence of incomplete data, the likelihood can have multiple maxima Example: u We can rename the values of hidden variable H u If H has two values, likelihood has two maxima u In practice, many local maxima HY

74 74 L(  |D) EM: MLE from Incomplete Data  u Use current point to construct “nice” alternative function u Max of new function scores ≥ than current point

75 75 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had true counts, we could estimate parameters u But with missing values, counts are unknown u We “complete” counts using probabilistic inference based on current parameter assignment u We use completed counts as if real to re-estimate parameters

76 76 Expectation Maximization (EM) N (X,Y ) XY # HTHTHTHT HHTTHHTT Expected Counts X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Data P(Y=H|X=T,  ) = 0.4 P(Y=H|X=H,Z=T,  ) = 0.3 Current model

77 77 Expectation Maximization (EM) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G,  0 )  Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G,  1 ) Reparameterize (M-Step) Reiterate X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3

78 78 Expectation Maximization (EM) Formal Guarantees:  L(  1 :D)  L(  0 :D) l Each iteration improves the likelihood  If  1 =  0, then  0 is a stationary point of L(  :D) l Usually, this means a local maximum

79 79 Expectation Maximization (EM) Computational bottleneck: u Computation of expected counts in E-Step l Need to compute posterior for each unobserved variable in each instance of training set l All posteriors for an instance can be derived from one pass of standard BN inference

80 80 Summary: Parameter Learning with Incomplete Data u Incomplete data makes parameter estimation hard u Likelihood function l Does not have closed form l Is multimodal u Finding max likelihood parameters: l EM l Gradient ascent u Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics

81 81 Incomplete Data: Structure Scores Recall, Bayesian score: With incomplete data: u Cannot evaluate marginal likelihood in closed form u We have to resort to approximations: l Evaluate score around MAP parameters l Need to find MAP parameters (e.g., EM)

82 82 Naive Approach u Perform EM for each candidate graph G1G1 G3G3 G2G2 Parametric optimization (EM) Parameter space Local Maximum G4G4 GnGn u Computationally expensive: l Parameter optimization via EM — non-trivial l Need to perform EM for all candidate structures l Spend time even on poor candidates u  In practice, considers only a few candidates

83 83 Structural EM Recall, in complete data we had l Decomposition  efficient search Idea: u Instead of optimizing the real score… u Find decomposable alternative score u Such that maximizing new score  improvement in real score

84 84 Structural EM Idea: u Use current model to help evaluate new structures Outline: u Perform search in (Structure, Parameters) space u At each iteration, use current model for finding either: l Better scoring parameters: “parametric” EM step or l Better scoring structure: “structural” EM step

85 85 Training Data Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3  Score & Parameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Reiterate N(X 2, X 1 ) N(H, X 1, X 3 ) N(Y 1, X 2 ) N(Y 2, Y 1, H) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3

86 86 Example: Phylogenetic Reconstruction Input: Biological sequences Human CGTTGC… Chimp CCTAGG… Orang CGAACG… …. Output: a phylogeny leaf An “instance” of evolutionary process Assumption: positions are independent 10 billion years

87 87 Phylogenetic Model u Topology: bifurcating Observed species – 1…N Ancestral species – N+1…2N-2  Lengths t = {t i,j } for each branch (i,j) u Evolutionary model: l P(A changes to T| 10 billion yrs ) internal node branch (8,9) leaf

88 88 Phylogenetic Tree as a Bayes Net u Variables: Letter at each position for each species l Current day species – observed l Ancestral species - hidden u BN Structure: Tree topology u BN Parameters: Branch lengths (time spans) Main problem: Learn topology If ancestral were observed  easy learning problem (learning trees)

89 89 Algorithm Outline Original Tree (T 0,t 0 ) Weights: Branch scores Compute expected pairwise stats

90 90 Pairwise weights Algorithm Outline Find: Weights: Branch scores Compute expected pairwise stats O(N 2 ) pairwise statistics suffice to evaluate all trees

91 91 Max. Spanning Tree Algorithm Outline Find: Construct bifurcation T 1 Weights: Branch scores Compute expected pairwise stats

92 92 New Tree Theorem: L(T 1,t 1 )  L(T 0,t 0 ) Algorithm Outline Construct bifurcation T 1 Find: Repeat until convergence… Weights: Branch scores Compute expected pairwise stats

93 93 Real Life Data -74, ,916.2Traditional approach 3,578122# pos 3443# sequences Mitochondrial genomes Lysozyme c Difference per position -70, ,892.1Structural EM Approach Log- likelihood Each position twice as likely

94 94 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

95 95 Bayesian Networks: Problem u Bayesian nets use propositional representation u Real world has objects, related to each other Intelligence Difficulty Grade

96 96 Bayesian Networks: Problem u Bayesian nets use propositional representation u Real world has objects, related to each other Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 Intell_FGump Diffic_Geo101 Grade_FGump_Geo101 Intell_FGump Diffic_CS101 Grade_FGump_CS101 These “instances” are not independent! A C

97 97 St. Nordaf University Teaches In-course Registered In-course Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 Difficulty Registered Grade Satisfac Intelligence Teaching-ability

98 98 Relational Schema u Specifies types of objects in domain, attributes of each type of object, & types of links between objects Teach Student Intelligence Registration Grade Satisfaction Course Difficulty Professor Teaching-Ability In TakeClasses Links Attributes

99 99 Representing the Distribution u Many possible worlds for a given university l All possible assignments of all attributes of all objects u Infinitely many potential universities l Each associated with a very different set of worlds Need to represent infinite set of complex distributions

100 100 Possible Worlds Prof. SmithProf. Jones Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence High Low Easy A B C Like Hate Like Smart Weak  World: assignment to all attributes of all objects in domain High Hard Easy A C B Hate Smart Weak

101 101 Probabilistic Relational Models u Universals: Probabilistic patterns hold for all objects in class u Locality: Represent direct probabilistic dependencies l Links give us potential interactions! Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability Key ideas: ABC

102 102 Prof. SmithProf. Jones Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 PRM Semantics Teaching-ability Difficulty Grade Satisfac Intelligence Instantiated PRM  BN  variables: attributes of all objects  dependencies: determined by links & PRM  Grade|Intell,Diffic

103 103 Welcome to CS101 weak / smart The Web of Influence Welcome to Geo101 A C weak smart easy / hard u Objects are all correlated u Need to perform inference over entire model u For large databases, use approximate inference: l Loopy belief propagation

104 104 PRM Learning: Complete Data Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence High Low Easy A B C Like Hate Like Smart Weak  Grade|Intell,Diffic u Entire database is single “instance” u Parameters used many times in instance u Introduce prior over parameters u Update prior with sufficient statistics: Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)

105 105 PRM Learning: Incomplete Data Welcome to CS101 Welcome to Geo101 ??? B A C Hi Low Hi ??? u Use expected sufficient statistics u But, everything is correlated:  E-step uses (approx) inference over entire model

106 106 A Web of Data Tom Mitchell Professor WebKB Project Sean Slattery Student CMU CS Faculty Contains Advisee-of Project-of Works-on [Craven et al.]

107 107 Professor department extract information computer science machine learning … Standard Approach Page Category Word 1 Word N

108 108 What’s in a Link Page Category Word 1 Word N From-... Page Category Word 1 Word N Link Exists To-

109 109 Discovering Hidden Concepts Internet Movie Database

110 110 Discovering Hidden Concepts Type Internet Movie Database Gender Actor Director Movie Genre Rating Year #Votes Credit-Order MPAA Rating Appeared

111 111 Web of Influence, Yet Again Movies Terminator 2 Batman Batman Forever Mission: Impossible GoldenEye Starship Troopers Hunt for Red October Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson … Actors Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger … Directors Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola …

112 112 Conclusion u Many distributions have combinatorial dependency structure u Utilizing this structure is good u Discovering this structure has implications: l To density estimation l To knowledge discovery u Many applications l Medicine l Biology l Web

113 113 The END u Gal Elidan u Lise Getoor u Moises Goldszmidt u Matan Ninio u Dana Pe’er u Eran Segal u Ben Taskar Thanks to Slides will be available from:


Download ppt ". Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford."

Similar presentations


Ads by Google