Download presentation

Presentation is loading. Please wait.

Published byLatrell Bolson Modified over 2 years ago

1
. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford

2
2 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

3
3 Family of Alarm Bayesian Networks Qualitative part: Directed acyclic graph (DAG) u Nodes - random variables u Edges - direct influence Quantitative part: Set of conditional probability distributions 0.90.1 e b e 0.20.8 0.01 0.99 0.90.1 be b b e BE P(A | E,B) Earthquake Radio Burglary Alarm Call Compact representation of probability distributions via conditional independence Together: Define a unique distribution in a factored form

4
4 Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients u 37 variables u 509 parameters …instead of 2 54 PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

5
5 Inference u Posterior probabilities l Probability of any event given any evidence u Most likely explanation l Scenario that explains evidence u Rational decision making l Maximize expected utility l Value of Information u Effect of intervention Earthquake Radio Burglary Alarm Call Radio Call

6
6 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often we don’t have an expert Data is cheap u Amount of available information growing rapidly u Learning allows us to construct models from raw data

7
7 Why Learn Bayesian Networks? u Conditional independencies & graphical language capture structure of many real-world distributions u Graph structure provides much insight into domain l Allows “knowledge discovery” u Learned model can be used for many tasks u Supports all the features of probabilistic learning l Model selection criteria l Dealing with missing data & hidden variables

8
8 Learning Bayesian networks E R B A C.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) Data + Prior Information Learner

9
9 Known Structure, Complete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified l Inducer needs to estimate parameters u Data does not contain missing values Learner E, B, A.

10
10 Unknown Structure, Complete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified l Inducer needs to select arcs & estimate parameters u Data does not contain missing values E, B, A. Learner

11
11 Known Structure, Incomplete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

12
12 Unknown Structure, Incomplete Data E B A.9.1 e b e.7.3.99.01.8.2 be b b e BEP(A | E,B) ?? e b e ?? ? ? ?? be b b e BE E B A u Network structure is not specified u Data contains missing values l Need to consider assignments to missing values E, B, A. Learner

13
13 Overview u Introduction u Parameter Estimation l Likelihood function l Bayesian estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

14
14 Learning Parameters E B A C u Training data has the form:

15
15 Likelihood Function E B A C u Assume i.i.d. samples u Likelihood function is

16
16 Likelihood Function E B A C u By definition of network, we get

17
17 Likelihood Function E B A C u Rewriting terms, we get =

18
18 General Bayesian Networks Generalizing for any Bayesian network: Decomposition Independent estimation problems

19
19 Likelihood Function: Multinomials u The likelihood for the sequence H,T, T, H, H is 00.20.40.60.81 L( :D) Probability of k th outcome Count of k th outcome in D General case:

20
20 Bayesian Inference u Represent uncertainty about parameters using a probability distribution over parameters, data u Learning using Bayes rule Posterior Likelihood Prior Probability of data

21
21 Bayesian Inference u Represent Bayesian distribution as Bayes net The values of X are independent given u P(x[m] | ) = u Bayesian prediction is inference in this network X[1]X[2]X[m] Observed data

22
22 Example: Binomial Data Prior: uniform for in [0,1] P( |D) the likelihood L( :D) (N H,N T ) = (4,1) MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is 00.20.40.60.81

23
23 Dirichlet Priors u Recall that the likelihood function is Dirichlet prior with hyperparameters 1,…, K the posterior has the same form, with hyperparameters 1 +N 1,…, K +N K

24
24 Dirichlet Priors - Example 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 00.20.40.60.81 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(5,5) Dirichlet(0.5,0.5) Dirichlet( heads, tails ) heads P( heads )

25
25 Dirichlet Priors (cont.) If P( ) is Dirichlet with hyperparameters 1,…, K u Since the posterior is also Dirichlet, we get

26
26 u Priors for each parameter group are independent u Data instances are independent given the unknown parameters XX X[1]X[2] X[M] X[M+1] Observed data Plate notation Y[1]Y[2] Y[M] Y[M+1] Y|X XX m X[m] Y[m] Query Bayesian Nets & Bayesian Prediction

27
27 u We can also “read” from the network: Complete data posteriors on parameters are independent u Can compute posterior over parameters separately! Bayesian Nets & Bayesian Prediction XX X[1]X[2] X[M] Observed data Y[1]Y[2] Y[M] Y|X

28
28 Learning Parameters: Summary u Estimation relies on sufficient statistics For multinomials: counts N(x i,pa i ) l Parameter estimation u Both are asymptotically equivalent and consistent u Both can be implemented in an on-line manner by accumulating sufficient statistics MLE Bayesian (Dirichlet)

29
29 Learning Parameters: Case Study 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0500100015002000250030003500400045005000 KL Divergence to true distribution # instances MLE Bayes; M'=50 Bayes; M'=20 Bayes; M'=5 Instances sampled from ICU Alarm network M’ — strength of prior

30
30 Overview u Introduction u Parameter Learning u Model Selection l Scoring function l Structure search u Structure Discovery u Incomplete Data u Learning from Structured Data

31
31 Why Struggle for Accurate Structure? u Increases the number of parameters to be estimated u Wrong assumptions about domain structure u Cannot be compensated for by fitting parameters u Wrong assumptions about domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arc Missing an arc

32
32 Scorebased Learning E, B, A. E B A E B A E B A Search for a structure that maximizes the score Define scoring function that evaluates how well a structure matches the data

33
33 Likelihood Score for Structure Larger dependence of X i on Pa i higher score u Adding arcs always helps l I(X; Y) I(X; {Y,Z}) l Max score attained by fully connected network l Overfitting: A bad idea… Mutual information between X i and its parents

34
34 Max likelihood params Bayesian Score Likelihood score: Bayesian approach: l Deal with uncertainty by assigning probability to all possibilities Likelihood Prior over parameters Marginal Likelihood

35
35 Marginal Likelihood: Multinomials Fortunately, in many cases integral has closed form P( ) is Dirichlet with hyperparameters 1,…, K D is a dataset with sufficient statistics N 1,…,N K Then

36
36 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood 1234567 Network 1: Two Dirichlet marginal likelihoods P( ) XY Integral over X Integral over Y HTTHTHH HTTHTHH HTHHTTH HTHHTTH

37
37 Marginal Likelihood: Bayesian Networks X Y u Network structure determines form of marginal likelihood 1234567 Network 2: Three Dirichlet marginal likelihoods P( ) XY Integral over X Integral over Y|X=H HTTHTHH HTTHTHH HTHHTTH THTHHTH Integral over Y|X=T

38
38 Marginal Likelihood for Networks The marginal likelihood has the form: Dirichlet marginal likelihood for multinomial P(X i | pa i ) N(..) are counts from the data (..) are hyperparameters for each family given G

39
39 As M (amount of data) grows, l Increasing pressure to fit dependencies in distribution l Complexity term avoids fitting noise u Asymptotic equivalence to MDL score u Bayesian score is consistent l Observed data eventually overrides prior Fit dependencies in empirical distribution Complexity penalty Bayesian Score: Asymptotic Behavior

40
40 Structure Search as Optimization Input: l Training data l Scoring function l Set of possible structures Output: l A network that maximizes the score Key Computational Property: Decomposability: score(G) = score ( family of X in G )

41
41 Tree-Structured Networks Trees: u At most one parent per variable Why trees? u Elegant math we can solve the optimization problem u Sparse parameterization avoid overfitting PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT MINOVL PVSAT PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

42
42 Learning Trees Let p(i) denote parent of X i u We can write the Bayesian score as Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

43
43 Learning Trees Set w(j i) = Score( X j X i ) - Score(X i ) u Find tree (or forest) with maximal weight l Standard max spanning tree algorithm — O(n 2 log n) Theorem: This procedure finds tree with max score

44
44 Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow at most two parents per node u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1

45
45 Heuristic Search u Define a search space: l search states are possible structures l operators make small changes to structure u Traverse space looking for high-scoring structures u Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

46
46 Local Search u Start with a given network l empty network l best tree l a random network u At each iteration l Evaluate all possible changes l Apply change based on score u Stop when no modification improves score

47
47 Heuristic Search u Typical operations: S C E D Reverse C E Delete C E Add C D S C E D S C E D S C E D score = S({C,E} D) - S({E} D) To update score after local change, only re-score families that changed

48
48 Learning in Practice: Alarm domain 0 0.5 1 1.5 2 0500100015002000250030003500400045005000 KL Divergence to true distribution #samples Structure known, fit params Learn both structure & params

49
49 Local Search: Possible Pitfalls u Local search can get stuck in: l Local Maxima: All one-edge changes reduce the score l Plateaux: Some one-edge changes leave the score unchanged u Standard heuristics can escape both l Random restarts l TABU search l Simulated annealing

50
50 Improved Search: Weight Annealing u Standard annealing process: l Take bad steps with probability exp( score/t) l Probability increases with temperature u Weight annealing: l Take uphill steps relative to perturbed score l Perturbation increases with temperature Score(G|D) G

51
51 Perturbing the Score u Perturb the score by reweighting instances u Each weight sampled from distribution: l Mean = 1 l Variance temperature u Instances sampled from “original” distribution u … but perturbation changes emphasis Benefit: u Allows global moves in the search space

52
52 Weight Annealing: ICU Alarm network Cumulative performance of 100 runs of annealed structure search Greedy hill-climbing True structure Learned params Annealed search

53
53 Structure Search: Summary u Discrete optimization problem u In some cases, optimization problem is easy l Example: learning trees u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~2-5 min): Decomposability Sufficient statistics l Adding randomness to search is critical

54
54 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

55
55 Structure Discovery Task: Discover structural properties l Is there a direct connection between X & Y l Does X separate between two “subsystems” l Does X causally effect Y Example: scientific data mining l Disease properties and symptoms l Interactions between the expression of genes

56
56 Discovering Structure u Current practice: model selection l Pick a single high-scoring model l Use that model to infer domain structure E R B A C P(G|D)

57
57 Discovering Structure Problem Small sample size many high scoring models l Answer based on one model often useless l Want features common to many models E R B A C E R B A C E R B A C E R B A C E R B A C P(G|D)

58
58 Bayesian Approach u Posterior distribution over structures u Estimate probability of features Edge X Y Path X … Y l … Feature of G, e.g., X Y Indicator function for feature f Bayesian score for G

59
59 MCMC over Networks u Cannot enumerate structures, so sample structures u MCMC Sampling l Define Markov chain over BNs Run chain to get samples from posterior P(G | D) Possible pitfalls: l Huge (superexponential) number of networks l Time for chain to converge to posterior is unknown l Islands of high posterior, connected by low bridges

60
60 ICU Alarm BN: No Mixing u 500 instances: u The runs clearly do not mix MCMC Iteration Score of cuurent sample

61
61 Effects of Non-Mixing u Two MCMC runs over same 500 instances u Probability estimates for edges for two runs Probability estimates highly variable, nonrobust True BN Random start True BN

62
62 Fixed Ordering Suppose that u We know the ordering of variables say, X 1 > X 2 > X 3 > X 4 > … > X n parents for X i must be in X 1,…,X i-1 Limit number of parents per nodes to k Intuition: Order decouples choice of parents Choice of Pa(X 7 ) does not restrict choice of Pa(X 12 ) Upshot: Can compute efficiently in closed form Likelihood P(D | ) Feature probability P(f | D, ) 2 knlog n networks

63
63 Our Approach: Sample Orderings We can write Sample orderings and approximate u MCMC Sampling l Define Markov chain over orderings Run chain to get samples from posterior P( | D)

64
64 Mixing with MCMC-Orderings u 4 runs on ICU-Alarm with 500 instances l fewer iterations than MCMC-Nets l approximately same amount of computation Process appears to be mixing! MCMC Iteration Score of cuurent sample

65
65 Mixing of MCMC runs u Two MCMC runs over same instances u Probability estimates for edges 500 instances 1000 instances Probability estimates very robust

66
66 Application: Gene expression Input: u Measurement of gene expression under different conditions l Thousands of genes l Hundreds of experiments Output: u Models of gene interaction l Uncover pathways

67
67 Map of Feature Confidence Yeast data [Hughes et al 2000] u 600 genes u 300 experiments

68
68 “Mating response” Substructure u Automatically constructed sub-network of high- confidence edges u Almost exact reconstruction of yeast mating pathway KAR4 AGA1PRM1 TEC1 SST2 STE6 KSS1 NDJ1 FUS3 AGA2 YEL059W TOM6 FIG1 YLR343W YLR334C MFA1 FUS1

69
69 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data l Parameter estimation l Structure search u Learning from Structured Data

70
70 Incomplete Data Data is often incomplete u Some variables of interest are not assigned values This phenomenon happens when we have u Missing values: l Some variables unobserved in some instances u Hidden variables: l Some variables are never observed l We might not even know they exist

71
71 Hidden (Latent) Variables Why should we care about unobserved variables? X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 Y1Y1 Y2Y2 Y3Y3 17 parameters 59 parameters

72
72 Example P(X) assumed to be known u Likelihood function of: Y|X=T, Y|X=H Contour plots of log likelihood for different number of missing values of X (M = 8): no missing values Y|X=H Y|X=T 2 missing value Y|X=T 3 missing values Y|X=T XY In general: likelihood function has multiple modes

73
73 Incomplete Data u In the presence of incomplete data, the likelihood can have multiple maxima Example: u We can rename the values of hidden variable H u If H has two values, likelihood has two maxima u In practice, many local maxima HY

74
74 L( |D) EM: MLE from Incomplete Data u Use current point to construct “nice” alternative function u Max of new function scores ≥ than current point

75
75 Expectation Maximization (EM) u A general purpose method for learning from incomplete data Intuition: u If we had true counts, we could estimate parameters u But with missing values, counts are unknown u We “complete” counts using probabilistic inference based on current parameter assignment u We use completed counts as if real to re-estimate parameters

76
76 Expectation Maximization (EM) 1.3 0.4 1.7 1.6 N (X,Y ) XY # HTHTHTHT HHTTHHTT Expected Counts X Z HTHHTHTHHT Y ??HTT??HTT T??THT??TH Data P(Y=H|X=T, ) = 0.4 P(Y=H|X=H,Z=T, ) = 0.3 Current model

77
77 Expectation Maximization (EM) Training Data X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Initial network (G, 0 ) Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation (E-Step) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Updated network (G, 1 ) Reparameterize (M-Step) Reiterate X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3

78
78 Expectation Maximization (EM) Formal Guarantees: L( 1 :D) L( 0 :D) l Each iteration improves the likelihood If 1 = 0, then 0 is a stationary point of L( :D) l Usually, this means a local maximum

79
79 Expectation Maximization (EM) Computational bottleneck: u Computation of expected counts in E-Step l Need to compute posterior for each unobserved variable in each instance of training set l All posteriors for an instance can be derived from one pass of standard BN inference

80
80 Summary: Parameter Learning with Incomplete Data u Incomplete data makes parameter estimation hard u Likelihood function l Does not have closed form l Is multimodal u Finding max likelihood parameters: l EM l Gradient ascent u Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics

81
81 Incomplete Data: Structure Scores Recall, Bayesian score: With incomplete data: u Cannot evaluate marginal likelihood in closed form u We have to resort to approximations: l Evaluate score around MAP parameters l Need to find MAP parameters (e.g., EM)

82
82 Naive Approach u Perform EM for each candidate graph G1G1 G3G3 G2G2 Parametric optimization (EM) Parameter space Local Maximum G4G4 GnGn u Computationally expensive: l Parameter optimization via EM — non-trivial l Need to perform EM for all candidate structures l Spend time even on poor candidates u In practice, considers only a few candidates

83
83 Structural EM Recall, in complete data we had l Decomposition efficient search Idea: u Instead of optimizing the real score… u Find decomposable alternative score u Such that maximizing new score improvement in real score

84
84 Structural EM Idea: u Use current model to help evaluate new structures Outline: u Perform search in (Structure, Parameters) space u At each iteration, use current model for finding either: l Better scoring parameters: “parametric” EM step or l Better scoring structure: “structural” EM step

85
85 Training Data Expected Counts N(X 1 ) N(X 2 ) N(X 3 ) N(H, X 1, X 1, X 3 ) N(Y 1, H) N(Y 2, H) N(Y 3, H) Computation X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Score & Parameterize X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3 Reiterate N(X 2, X 1 ) N(H, X 1, X 3 ) N(Y 1, X 2 ) N(Y 2, Y 1, H) X1X1 X2X2 X3X3 H Y1Y1 Y2Y2 Y3Y3

86
86 Example: Phylogenetic Reconstruction Input: Biological sequences Human CGTTGC… Chimp CCTAGG… Orang CGAACG… …. Output: a phylogeny leaf An “instance” of evolutionary process Assumption: positions are independent 10 billion years

87
87 Phylogenetic Model u Topology: bifurcating Observed species – 1…N Ancestral species – N+1…2N-2 Lengths t = {t i,j } for each branch (i,j) u Evolutionary model: l P(A changes to T| 10 billion yrs ) internal node 12 34567 8 12 10 11 9 branch (8,9) leaf

88
88 Phylogenetic Tree as a Bayes Net u Variables: Letter at each position for each species l Current day species – observed l Ancestral species - hidden u BN Structure: Tree topology u BN Parameters: Branch lengths (time spans) Main problem: Learn topology If ancestral were observed easy learning problem (learning trees)

89
89 Algorithm Outline Original Tree (T 0,t 0 ) Weights: Branch scores Compute expected pairwise stats

90
90 Pairwise weights Algorithm Outline Find: Weights: Branch scores Compute expected pairwise stats O(N 2 ) pairwise statistics suffice to evaluate all trees

91
91 Max. Spanning Tree Algorithm Outline Find: Construct bifurcation T 1 Weights: Branch scores Compute expected pairwise stats

92
92 New Tree Theorem: L(T 1,t 1 ) L(T 0,t 0 ) Algorithm Outline Construct bifurcation T 1 Find: Repeat until convergence… Weights: Branch scores Compute expected pairwise stats

93
93 Real Life Data -74,227.9-2,916.2Traditional approach 3,578122# pos 3443# sequences Mitochondrial genomes Lysozyme c 1.030.19Difference per position -70,533.5-2,892.1Structural EM Approach Log- likelihood Each position twice as likely

94
94 Overview u Introduction u Parameter Estimation u Model Selection u Structure Discovery u Incomplete Data u Learning from Structured Data

95
95 Bayesian Networks: Problem u Bayesian nets use propositional representation u Real world has objects, related to each other Intelligence Difficulty Grade

96
96 Bayesian Networks: Problem u Bayesian nets use propositional representation u Real world has objects, related to each other Intell_J.Doe Diffic_CS101 Grade_JDoe_CS101 Intell_FGump Diffic_Geo101 Grade_FGump_Geo101 Intell_FGump Diffic_CS101 Grade_FGump_CS101 These “instances” are not independent! A C

97
97 St. Nordaf University Teaches In-course Registered In-course Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 Difficulty Registered Grade Satisfac Intelligence Teaching-ability

98
98 Relational Schema u Specifies types of objects in domain, attributes of each type of object, & types of links between objects Teach Student Intelligence Registration Grade Satisfaction Course Difficulty Professor Teaching-Ability In TakeClasses Links Attributes

99
99 Representing the Distribution u Many possible worlds for a given university l All possible assignments of all attributes of all objects u Infinitely many potential universities l Each associated with a very different set of worlds Need to represent infinite set of complex distributions

100
100 Possible Worlds Prof. SmithProf. Jones Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence High Low Easy A B C Like Hate Like Smart Weak World: assignment to all attributes of all objects in domain High Hard Easy A C B Hate Smart Weak

101
101 Probabilistic Relational Models u Universals: Probabilistic patterns hold for all objects in class u Locality: Represent direct probabilistic dependencies l Links give us potential interactions! Student Intelligence Reg Grade Satisfaction Course Difficulty Professor Teaching-Ability Key ideas: ABC

102
102 Prof. SmithProf. Jones Forrest Gump Jane Doe Welcome to CS101 Welcome to Geo101 PRM Semantics Teaching-ability Difficulty Grade Satisfac Intelligence Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM Grade|Intell,Diffic

103
103 Welcome to CS101 weak / smart The Web of Influence Welcome to Geo101 A C weak smart easy / hard u Objects are all correlated u Need to perform inference over entire model u For large databases, use approximate inference: l Loopy belief propagation

104
104 PRM Learning: Complete Data Prof. SmithProf. Jones Welcome to CS101 Welcome to Geo101 Teaching-ability Difficulty Grade Satisfac Intelligence High Low Easy A B C Like Hate Like Smart Weak Grade|Intell,Diffic u Entire database is single “instance” u Parameters used many times in instance u Introduce prior over parameters u Update prior with sufficient statistics: Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)

105
105 PRM Learning: Incomplete Data Welcome to CS101 Welcome to Geo101 ??? B A C Hi Low Hi ??? u Use expected sufficient statistics u But, everything is correlated: E-step uses (approx) inference over entire model

106
106 A Web of Data Tom Mitchell Professor WebKB Project Sean Slattery Student CMU CS Faculty Contains Advisee-of Project-of Works-on [Craven et al.]

107
107 Professor department extract information computer science machine learning … Standard Approach 0.520.540.560.580.60.620.640.660.68... Page Category Word 1 Word N

108
108 What’s in a Link 0.520.540.560.580.60.620.640.660.68... Page Category Word 1 Word N From-... Page Category Word 1 Word N Link Exists To-

109
109 Discovering Hidden Concepts Internet Movie Database http://www.imdb.com

110
110 Discovering Hidden Concepts Type Internet Movie Database http://www.imdb.com Gender Actor Director Movie Genre Rating Year #Votes Credit-Order MPAA Rating Appeared

111
111 Web of Influence, Yet Again Movies Terminator 2 Batman Batman Forever Mission: Impossible GoldenEye Starship Troopers Hunt for Red October Wizard of Oz Cinderella Sound of Music The Love Bug Pollyanna The Parent Trap Mary Poppins Swiss Family Robinson … Actors Anthony Hopkins Robert De Niro Tommy Lee Jones Harvey Keitel Morgan Freeman Gary Oldman Sylvester Stallone Bruce Willis Harrison Ford Steven Seagal Kurt Russell Kevin Costner Jean-Claude Van Damme Arnold Schwarzenegger … Directors Steven Spielberg Tim Burton Tony Scott James Cameron John McTiernan Joel Schumacher Alfred Hitchcock Stanley Kubrick David Lean Milos Forman Terry Gilliam Francis Coppola …

112
112 Conclusion u Many distributions have combinatorial dependency structure u Utilizing this structure is good u Discovering this structure has implications: l To density estimation l To knowledge discovery u Many applications l Medicine l Biology l Web

113
113 The END u Gal Elidan u Lise Getoor u Moises Goldszmidt u Matan Ninio u Dana Pe’er u Eran Segal u Ben Taskar http://robotics.stanford.edu/~koller/ Thanks to Slides will be available from: http://www.cs.huji.ac.il/~nir/

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google