Download presentation
Presentation is loading. Please wait.
1
. Approximate Inference Slides by Nir Friedman
2
When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded u “Peaked” distributions improbable values are ignored
3
Stochasticity & Approximations u Consider a chain: u P(X i+1 = t | X i = t) = 1- P(X i+1 = f | X i = f) = 1- Computing the probability of X n+1 given X 1, we get X1X1 X2X2 X3X3 X n+1 Even # of flips: Odd # of flips:
4
Plot of P(X n = t | X 1 = t) 0.5 0.6 0.7 0.8 0.9 1 00.050.10.150.20.250.30.350.40.450.5 n = 5 n = 10 n = 20
5
Stochastic Processes u This behavior of a chain (a Markov Process) is called Mixing. u In general Bayes nets there is a similar behavior. If probabilities are far from 0 & 1, then effect of “far” evidence vanishes (and so can be discarded in approximations).
6
“Peaked” distributions u If the distribution is “peaked”, then most of the mass is on few instances u If we can focus on these instances, we can ignore the rest Instances
7
Global conditioning A L C I D J B M E K Fixing value of A & B Fixing values in the beginning of the summation can decrease tables formed by variable elimination. This way space is traded with time. Special case: choose to fix a set of nodes that “break all loops”. This method is called cutset-conditioning. L C I J M E K D a b ba
8
Bounded conditioning A B Fixing value of A & B By examining only the probable assignment of A & B, we perform several simple computations instead of a complex one.
9
Bounded conditioning u Choose A and B so that P(Y,e |a,b) can be computed easily. E.g., a cycle cutset. u Search for highly probable assignments to A,B. l Option 1--- select a,b with high P(a,b). l Option 2--- select a,b with high P(a,b | e). u We need to search for such high mass values and that can be hard.
10
Bounded Conditioning Advantages: u Combines exact inference within approximation u Continuous: more time can be used to examine more cases u Bounds: unexamined mass used to compute error-bars Possible problems: P(a,b) is prior mass not the posterior. If posterior is significantly different P(a,b| e), Computation can be wasted on irrelevant assignments
11
Network Simplifications u In these approaches, we try to replace the original network with a simpler one l the resulting network allows fast exact methods
12
Network Simplifications Typical simplifications: l Remove parts of the network l Remove edges l Reduce the number of values (value abstraction) l Replace a sub-network with a simpler one (model abstraction) u These simplifications are often w.r.t. to the particular evidence and query
13
Stochastic Simulation Suppose our goal is the compute the likelihood of evidence P(e) where e is an assignment to some variables in {X 1,…,X n }. Assume that we can sample instances according to the distribution P (x 1,…,x n ). What is then the probability that a random sample satisfies e? Answer: simply P(e) which is what we wish to compute. Each sample simulates the tossing of a biased coin with probability P(e) of “ Heads ”.
14
Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p with high probability Zeros or ones u How many samples do we need to get a reliable estimation? We will not discuss this issue here.
15
Sampling a Bayesian Network If P (X 1,…,X n ) is represented by a Bayesian network, can we efficiently sample from it? u Idea: sample according to structure of the network l Write distribution using the chain rule, and then sample each variable given its parents
16
Samples: B E A C R Logic sampling P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) e e 0.30.001 b Earthquake Radio Burglary Alarm Call 0.03
17
Samples: B E A C R Logic sampling P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) e e 0.30.001 eb Earthquake Radio Burglary Alarm Call 0.001
18
Samples: B E A C R Logic sampling P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) e e 0.30.001 eab 0.4 Earthquake Radio Burglary Alarm Call
19
Samples: B E A C R Logic sampling P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) e e 0.30.001 eacb Earthquake Radio Burglary Alarm Call 0.8
20
Samples: B E A C R Logic sampling P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) e e 0.30.001 eacb 0.3 Earthquake Radio Burglary Alarm Call
21
Samples: B E A C R Logic sampling P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) e e 0.30.001 eacb r Earthquake Radio Burglary Alarm Call
22
Logic Sampling Let X 1, …, X n be order of variables consistent with arc direction for i = 1, …, n do sample x i from P(X i | pa i ) (Note: since Pa i {X 1,…,X i-1 }, we already assigned values to them) return x 1, …,x n
23
Logic Sampling u Sampling a complete instance is linear in number of variables l Regardless of structure of the network However, if P(e) is small, we need many samples to get a decent estimate
24
Can we sample from P( X i |e) ? If evidence e is in roots of the Bayes network, easily u If evidence is in leaves of the network, we have a problem: l Our sampling method proceeds according to the order of nodes in the network. Z R B A=a X
25
Likelihood Weighting Can we ensure that all of our sample satisfy e? u One simple (but wrong) solution: When we need to sample a variable Y that is assigned value by e, use its specified value. For example: we know Y = 1 Sample X from P(X) Then take Y = 1 Is this a sample from P( X,Y |Y = 1) ? NO. X Y
26
Likelihood Weighting Problem: these samples of X are from P(X) u Solution: Penalize samples in which P(Y=1|X) is small u We now sample as follows: Let x i be a sample from P(x) Let w i = P(Y = 1|X = x i ) X Y
27
Likelihood Weighting Let X 1, …, X n be order of variables consistent with arc direction u w = 1 for i = 1, …, n do if X i = x i has been observed w w P(X i = x i | pa i ) l else sample x i from P(X i | pa i ) return x 1, …,x n, and w
28
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a 0.8 0.05 P(r) r r 0.30.001 b Earthquake Radio Burglary Alarm Call 0.03 Weight = r a = a
29
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) r r 0.30.001 eb Earthquake Radio Burglary Alarm Call 0.001 Weight = r = a
30
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) r r 0.30.001 eb 0.4 Earthquake Radio Burglary Alarm Call Weight = r = a 0.6 a
31
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) r r 0.30.001 ecb Earthquake Radio Burglary Alarm Call 0.05 Weight = r = a a 0.6
32
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) 0.001 P(a) b e 0.98 0.4 0.7 0.01 P(c) a a 0.8 0.05 P(r) r r 0.30.001 ecb r 0.3 Earthquake Radio Burglary Alarm Call Weight = r = a a 0.6 *0.3
33
Likelihood Weighting u Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x u Thus,
34
Summary Approximate inference is needed for large pedigrees. We have seen a few methods today. Some could fit genetic linkage analysis and some do not. There are many other approximation algorithms: Variational methods, MCMC, and others. In next semester’s project of Bioinformatics (236524), we will offer projects that seek to implement some approximation methods and embed them in the superlink software.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.