. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

. PGM: Tirgul 10 Learning Structure I

Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and P(B) vs. joint P(A,B) u Discover structural properties of the domain l Ordering of events l Relevance u Identifying independencies  faster inference u Predict effect of actions l Involves learning causal relationship among variables

Why Struggle for Accurate Structure? u Increases the number of parameters to be fitted u Wrong assumptions about causality and domain structure u Cannot be compensated by accurate fitting of parameters u Also misses causality and domain structure EarthquakeAlarm Set Sound Burglary EarthquakeAlarm Set Sound Burglary Earthquake Alarm Set Sound Burglary Adding an arcMissing an arc

Approaches to Learning Structure u Constraint based l Perform tests of conditional independence l Search for a network that is consistent with the observed dependencies and independencies u Pros & Cons  Intuitive, follows closely the construction of BNs  Separates structure learning from the form of the independence tests  Sensitive to errors in individual tests

Approaches to Learning Structure u Score based l Define a score that evaluates how well the (in)dependencies in a structure match the observations l Search for a structure that maximizes the score u Pros & Cons  Statistically motivated  Can make compromises  Takes the structure of conditional probabilities into account  Computationally hard

Likelihood Score for Structures First cut approach: l Use likelihood function u Recall, the likelihood score for a network structure and parameters is u Since we know how to maximize parameters from now we assume

Likelihood Score for Structure (cont.) Rearranging terms: where  H(X) is the entropy of X  I(X;Y) is the mutual information between X and Y I(X;Y) measures how much “information” each variables provides about the other I(X;Y)  0 I(X;Y) = 0 iff X and Y are independent I(X;Y) = H(X) iff X is totally predictable given Y

Likelihood Score for Structure (cont.) Good news: u Intuitive explanation of likelihood score: l The larger the dependency of each variable on its parents, the higher the score u Likelihood as a compromise among dependencies, based on their strength

Likelihood Score for Structure (cont.) Bad news: u Adding arcs always helps l I(X;Y)  I(X;Y,Z) l Maximal score attained by fully connected networks l Such networks can overfit the data --- parameters capture the noise in the data

Avoiding Overfitting “Classic” issue in learning. Approaches: u Restricting the hypotheses space l Limits the overfitting capability of the learner l Example: restrict # of parents or # of parameters u Minimum description length l Description length measures complexity l Prefer models that compactly describes the training data u Bayesian methods l Average over all possible parameter values l Use prior knowledge

Bayesian Inference  Bayesian Reasoning---compute expectation over unknown G  Assumption: G s are mutually exclusive and exhaustive  We know how to compute P(x[M+1]|G,D) Same as prediction with fixed structure  How do we compute P(G|D) ?

Marginal likelihood Prior over structures Using Bayes rule: P(D) is the same for all structures G Can be ignored when comparing structures Probability of Data Posterior Score

Marginal Likelihood u By introduction of variables, we have that u This integral measures sensitivity to choice of parameters Likelihood Prior over parameters

Marginal Likelihood: Binomial case Assume we observe a sequence of coin tosses…. u By the chain rule we have: recall that where N m H is the number of heads in first m examples.

Marginal Likelihood: Binomials (cont.) We simplify this by using Thus

Binomial Likelihood: Example  Idealized experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) (log P(D))/M

Marginal Likelihood: Example (cont.)  Actual experiment with P(H) = 0.25 -1.3 -1.2 -1.1 -0.9 -0.8 -0.7 -0.6 05101520253035404550 (log P(D))/M M Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5)

Marginal Likelihood: Multinomials The same argument generalizes to multinomials with Dirichlet prior  P(  ) is Dirichlet with hyperparameters  1,…,  K  D is a dataset with sufficient statistics N 1,…,N K Then

Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y u Network structure determines form of marginal likelihood 1234567 Network 1: Two Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],…,Y[7]) XY

Marginal Likelihood: Bayesian Networks HTTHTHH HTHHTTH X Y u Network structure determines form of marginal likelihood 1234567 Network 2: Three Dirichlet marginal likelihoods P(X[1],…,X[7]) P(Y[1],Y[4],Y[6],Y[7]) P(Y[2],Y[3],Y[5]) XY

Idealized Experiment u P(X = H) = 0.5 u P(Y = H|X = H) = 0.5 + pP(Y = H|X = T) = 0.5 - p -1.8 -1.75 -1.7 -1.65 -1.6 -1.55 -1.5 -1.45 -1.4 -1.35 -1.3 1101001000 Independent P = 0.05 P = 0.10 P = 0.15 P = 0.20 (log P(D))/M M

Marginal Likelihood for General Network The marginal likelihood has the form: where u N(..) are the counts from the data   (..) are the hyperparameters for each family given G Dirichlet Marginal Likelihood For the sequence of values of X i when X i ’ s parents have a particular value

Priors  We need: prior counts  (..) for each network structure G u This can be a formidable task l There are exponentially many structures…

BDe Score Possible solution: The BDe prior  Represent prior using two elements M 0, B 0 M 0 - equivalent sample size B 0 - network representing the prior probability of events

BDe Score Intuition: M 0 prior examples distributed by B 0  Set  (x i,pa i G ) = M 0 P(x i,pa i G | B 0 ) Note that pa i G are not the same as the parents of X i in B 0. Compute P(x i,pa i G | B 0 ) using standard inference procedures u Such priors have desirable theoretical properties l Equivalent networks are assigned the same score

Bayesian Score: Asymptotic Behavior Theorem: If the prior P(  |G) is “well-behaved”, then Proof:  For the case of Dirichlet priors, use Stirling’s approximation to  ( ) u General case, defer to incomplete data section

Asymptotic Behavior: Consequences u Bayesian score is consistent As M  the “true” structure G* maximizes the score (almost surely) For sufficiently large M, the maximal scoring structures are equivalent to G* u Observed data eventually overrides prior information l Assuming that the prior assigns positive probability to all cases

Asymptotic Behavior u This score can also be justified by the Minimal Description Length (MDL) principle u This equation explicitly shows the tradeoff between l Fitness to data --- likelihood term l Penalty for complexity --- regularization term

Scores -- Summary u Likelihood, MDL, (log) BDe have the form u BDe requires assessing prior network. It can naturally incorporate prior knowledge and previous experience u BDe is consistent and asymptotically equivalent (up to a constant) to MDL u All are score-equivalent G equivalent to G’  Score(G) = Score(G’)

Optimization Problem Input: l Training data l Scoring function (including priors, if needed) l Set of possible structures H Including prior knowledge about structure Output: l A network (or networks) that maximize the score Key Property: l Decomposability: the score of a network is a sum of terms.

Learning Trees u Trees: l At most one parent per variable u Why trees? l Elegant math  we can solve the optimization problem efficiently (with a greedy algorithm) l Sparse parameterization  avoid overfitting while adapting to the data

Learning Trees (cont.)  Let p(i) denote the parent of X i, or 0 if X i has no parents u We can write the score as u Score = sum of edge scores + constant Score of “empty” network Improvement over “empty” network

Learning Trees (cont) Algorithm: u Construct graph with vertices: 1, 2, …  Set w(i  j) be Score( X j | X i ) - Score(X j ) u Find tree (or forest) with maximal weight l This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion (Kruskal’s maximum spanning tree algorithm) Theorem: This procedure finds the tree with maximal score When score is likelihood, then w(i  j) is proportional to I(X i ; X j ) this is known as the Chow & Liu method

Not every edge in tree is in the the original network Tree direction is arbitrary --- we can’t learn about arc direction Learning Trees: Example Tree learned from alarm data correct arcs spurious arcs PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP

Beyond Trees When we consider more complex network, the problem is not as easy u Suppose we allow two parents u A greedy algorithm is no longer guaranteed to find the optimal network u In fact, no efficient algorithm exists Theorem: Finding maximal scoring network structure with at most k parents for each variables is NP-hard for k > 1

Heuristic Search We address the problem by using heuristic search u Define a search space: l nodes are possible structures l edges denote adjacency of structures u Traverse this space looking for high-scoring structures Search techniques: l Greedy hill-climbing l Best first search l Simulated Annealing l...

Heuristic Search (cont.) u Typical operations: S C E D S C E D Reverse C  E Delete C  E Add C  D S C E D S C E D

Exploiting Decomposability in Local Search u Caching: To update the score of after a local change, we only need to re-score the families that were changed in the last move S C E D S C E D S C E D S C E D

Greedy Hill-Climbing u Simplest heuristic local search l Start with a given network H empty network H best tree H a random network l At each iteration H Evaluate all possible changes H Apply change that leads to best improvement in score H Reiterate l Stop when no modification improves score u Each step requires evaluating approximately n new changes

Greedy Hill-Climbing: Possible Pitfalls u Greedy Hill-Climbing can get struck in: l Local Maxima: H All one-edge changes reduce the score l Plateaus: H Some one-edge changes leave the score unchanged H Happens because equivalent networks received the same score and are neighbors in the search space u Both occur during structure search u Standard heuristics can escape both l Random restarts l TABU search

Equivalence Class Search Idea: u Search the space of equivalence classes u Equivalence classes can be represented by PDAGs (partially ordered graph) Benefits: u The space of PDAGs has fewer local maxima and plateaus u There are fewer PDAGs than DAGs

Equivalence Class Search (cont.) Evaluating changes is more expensive u These algorithms are more complex to implement X Z YX Z YX Z Y Add Y---Z Original PDAG New PDAG Consistent DAG Score

Learning in Practice: Alarm domain 0 0.5 1 1.5 2 0500100015002000250030003500400045005000 KL Divergence M True Structure/BDe M' = 10 Unknown Structure/BDe M' = 10

Model Selection u So far, we focused on single model l Find best scoring model l Use it to predict next example u Implicit assumption: l Best scoring model dominates the weighted sum u Pros: l We get a single structure l Allows for efficient use in our tasks u Cons: l We are committing to the independencies of a particular structure l Other structures might be as probable given the data

Model Averaging u Recall, Bayesian analysis started with l This requires us to average over all possible models

Model Averaging (cont.) u Full Averaging l Sum over all structures l Usually intractable--- there are exponentially many structures u Approximate Averaging l Find K largest scoring structures l Approximate the sum by averaging over their prediction l Weight of each structure determined by the Bayes Factor The actual score we compute

Search: Summary u Discrete optimization problem u In general, NP-Hard l Need to resort to heuristic search l In practice, search is relatively fast (~100 vars in ~10 min): H Decomposability H Sufficient statistics u In some cases, we can reduce the search problem to an easy optimization problem l Example: learning trees

. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

Similar presentations

Presentation on theme: ". PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.

Similar presentations

Presentation on theme: ". PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and."— Presentation transcript:

Similar presentations

About project

Feedback