Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.

Similar presentations


Presentation on theme: "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference."— Presentation transcript:

1 CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference

2 A GENDA Monte Carlo methods O(1/sqrt(N)) standard deviation For Bayesian inference Likelihood weighting Gibbs sampling

3 M ONTE C ARLO I NTEGRATION Estimate large integrals/sums: I =  f(x)p(x) dx I =  f(x)p(x) Using a sample of N i.i.d. samples from p(x) I  1/N  f(x (i) ) Examples:  [a,b] f(x) dx  (b-a)/N  f(x (i) ) E[X] =  x p(x) dx  1/N  x (i) Volume of a set in R n

4 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]?

5 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? E[I-I N ]=I-E[I N ](linearity of expectation)

6 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? E[I-I N ]=I-E[I N ](linearity of expectation) = E[f(x)] - 1/N  E[f(x (i) )] (definition of I and I N )

7 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? E[I-I N ]=I-E[I N ](linearity of expectation) = E[f(x)] - 1/N  E[f(x (i) )] (definition of I and I N ) = 1/N  (E[f(x)]-E[f(x (i) )]) = 1/N  0 (x and x (i) are distributed w.r.t. p(x)) = 0

8 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]?

9 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition)

10 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition) = 1/N 2 Var[  f(x (i) )] (scaling of variance)

11 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition) = 1/N 2 Var[  f(x (i) )] (scaling of variance) = 1/N 2  Var[f(x (i) )] (variance of a sum of independent variables)

12 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition) = 1/N 2 Var[  f(x (i) )] (scaling of variance) = 1/N 2  Var[f(x (i) )] = 1/N Var[f(x)] (i.i.d. sample)

13 M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? 1/N Var[f(x)] Standard deviation: O(1/sqrt(N))

14 A PPROXIMATE I NFERENCE T HROUGH S AMPLING Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

15 A PPROXIMATE I NFERENCE T HROUGH S AMPLING Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed Conditional simulation: To estimate the probability P(H) that a coin picked out of bucket B flips heads: Repeat for i=1,…,N: 1. Pick a coin C out of a random bucket b (i) chosen with probability P(B) 2. h (i) = flip C according to probability P(H|b (i) ) 3. Sample (h (i),b (i) ) comes from distribution P(H,B) Result approximates P(H,B)

16 M ONTE C ARLO I NFERENCE I N B AYES N ETS BN over variables X Repeat for i=1,…,N In top-down order, generate x (i) as follows: Sample x j (i) ~ P(X j | pa Xj (i) ) (RHS is taken by putting parent values in sample into the CPT for X j ) Sample x (1) … x (N) approximates the distribution over X

17 A PPROXIMATE I NFERENCE : M ONTE -C ARLO S IMULATION Sample from the joint distribution BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=0 E=0 A=0 J=1 M=0

18 A PPROXIMATE I NFERENCE : M ONTE -C ARLO S IMULATION As more samples are generated, the distribution of the samples approaches the joint distribution B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0

19 B ASIC METHOD FOR H ANDLING E VIDENCE Inference: given evidence E = e (e.g., J=1), approximate P( X / E | E = e ) Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution

20 R ARE E VENT P ROBLEM : What if some events are really rare (e.g., burglary & earthquake ?) # of samples must be huge to get a reasonable estimate Solution: likelihood weighting Enforce that each sample agrees with evidence While generating a sample, keep track of the ratio of (how likely the sampled value is to occur in the real world) (how likely you were to generate the sampled value)

21 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 w=1

22 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=0 E=1 w=0.008

23 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=0 E=1 A=1 w=0.0023 A=1 is enforced, and the weight updated to reflect the likelihood that this occurs

24 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=0 E=1 A=1 M=1 J=1 w=0.0016

25 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=0 E=0 w=3.988

26 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=0 E=0 A=1 w=0.004

27 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=0 E=0 A=1 M=1 J=1 w=0.0028

28 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=1 E=0 A=1 w=0.00375

29 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=1 E=0 A=1 M=1 J=1 w=0.0026

30 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF 0.95 0.94 0.29 0.001 BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) 0.001 P(E) 0.002 AP(J|…) TFTF 0.90 0.05 AP(M|…) TFTF 0.70 0.01 B=1 E=1 A=1 M=1 J=1 w=5e-7

31 L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375 B=0 E=1 A=1 M=1 J=1 w=0.0016 B=0 E=0 A=1 M=1 J=1 w=0.0028 B=1 E=0 A=1 M=1 J=1 w=0.0026 B=1 E=1 A=1 M=1 J=1 w~=0

32 A NOTHER R ARE -E VENT P ROBLEM B = b given as evidence Probability each b i is rare given all but one setting of A i (say, A i =1) Chance of sampling all 1’s is very low => most likelihood weights will be too low Problem: evidence is not being used to sample A’s effectively (i.e., near P(A i | b )) A1A1 A2A2 A 10 B1B1 B2B2 B 10

33 G IBBS S AMPLING Idea: reduce the computational burden of sampling from a multidimensional distribution P( x )=P(x 1,…,x n ) by doing repeated draws of individual attributes Cycle through j=1,…,n Sample x j ~ P(x j | x[1…j-1,j+1,…n]) Over the long run, the random walk taken by x approaches the true distribution P( x )

34 G IBBS S AMPLING IN BN S Each Gibbs sampling step: 1) pick a variable X i, 2) sample x i ~ P(X i | X /X i ) Look at values of “Markov blanket” of X i : Parents Pa Xi Children Y 1,…,Y k Parents of children (excluding X i ) Pa Y1 /X i, …, Pa Yk /X i X i is independent of rest of network given Markov blanket Sample x i ~P(X i |, Y 1, Pa Y1 /X i, …, Y k, Pa Yk /X i ) = 1/Z P(X i | Pa Xi ) P(Y 1 | Pa Y1 ) *…* P(Y k | Pa Yk ) Product of X i ’s factor and the factors of its children

35 H ANDLING EVIDENCE Simply set each evidence variable to its appropriate value, don’t sample Resulting walk approximates distribution P( X/E | E = e ) Uses evidence more efficiently than likelihood weighting

36 G IBBS SAMPLING ISSUES Demonstrating correctness & convergence requires examining Markov Chain random walk (more later) Need to take many steps before the effects of poor initialization wear off (mixing time) Difficult to tell how much is needed a priori Numerous variants Known as Markov Chain Monte Carlo techniques

37 N EXT TIME Continuous and hybrid distributions


Download ppt "CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference."

Similar presentations


Ads by Google