1 Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

1 Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University hslu@stat.nctu.edu.tw http://tigpbp.iis.sinica.edu.tw/courses.htm

2 Part 1 Introduction to Bayesian Methods

3 Bayes’ Theorem  Conditional Probability:  One Derivation:  Alternative Derivation:  http://en.wikipedia.org/wiki/Bayes'_theorem http://en.wikipedia.org/wiki/Bayes'_theorem

4 False Positive and Negative  Medical diagnosis:  Type I and II Errors: hypothesis testing in statistical inference  http://en.wikipedia.org/wiki/False_positive http://en.wikipedia.org/wiki/False_positive Actual Status Disease (H1)Normal (H0) Diagnosis Test Result Positive (Reject H0) True Positive (Power, 1- β ) False Positive (Type I Error, α ) Negative (Accept H0) False Negative (Type II Error, β ) True Negative (Confidence Level, 1- α )

5 Bayesian Inference (1)  False positives in a medical test  Test accuracy by conditional probabilities: P(Test Positive|Disease) = P(R|H1) = 1-β = 0.99 P(Test Negative|Normal) = P(A|H0) = 1-α = 0.95.  Prior probabilities: P(Disease) = P(H1) = 0.001 P(Normal) = P(H0) = 0.999.

6 Bayesian Inference (2)  Posterior probabilities by Bayes ’ theorem: True Positive Probability = P(Disease|Test Positive) = P(H1|R) = False Positive Probability = P(Normal|Test Positive) = P(H0|R) = (1 − 0.019) = 0.981.

7 Bayesian Inference (3)  Equal Prior probabilities: P(Disease) = P(H1) = P(Normal) = P(H0) = 0.5.  Posterior probabilities by Bayes ’ theorem: True Positive Probability = P(Disease|Test Positive) = P(H1|R) = = P(R|H1) = 1-β!  http://en.wikipedia.org/wiki/Bayesian_inference http://en.wikipedia.org/wiki/Bayesian_inference

8 Bayesian Inference (4)  In the courtroom:  P(Evidence of DNA Match | Guilty) = 1 and P(Evidence of DNA Match | Innocent) = 10 -6.  Based on the evidence other than the DNA match, P(Guilty) = 0.3 and P(Innocent) = 0.7.  By the Bayes Theorem, P(Guilty | Evidence of DNA Match) = = 0.99999766667.

9 Naive Bayes Classifier  Naive Bayes Classifier is a simple probabilistic classifier based on applying Bayes ’ theorem with strong (naive) independence assumptions.  http://en.wikipedia.org/wiki/Naive_Bayes_ classifier http://en.wikipedia.org/wiki/Naive_Bayes_ classifier

10 Naive Bayes Probabilistic Model (1)  The probability model for a classifier is a conditional model P(C|F 1, …,F n ) where C is a dependent class variable and F1, …,Fn are several feature variables.  By Bayes ’ theorem,

11 Naive Bayes Probabilistic Model (2)  Use repeated applications of the definition of conditional probability: P(C,F 1, … F n )=P(C) P(F 1,..F n |C) =P(C)P(F 1 |C)P(F 2,..F n |C,F 1 ) =P(C)P(F 1 |C)P(F 2 |C,F 1 )P(F 3,..,F n |C,F 1,F 2 ) and so forth.  Assume that each F i is conditionally independent of every other F j for i≠j, this means that P(F i |C,F j )=P(F i |C).  So P(C,F 1, … F n ) can be expressed as.

12 Naive Bayes Probabilistic Model (3)  So P(C|F 1, …,F n ) cab be expressed like where Z is constant if the values of the feature variables are known.  Constructing a classifier from the probability model:

13 Bayesian Spam Filtering (1)  Bayesian spam filtering, a form of e- mail filtering, is the process of using a Naive Bayes classifier to identify spam email.  References: http://en.wikipedia.org/wiki/Spam_%28e- mail%29 http://en.wikipedia.org/wiki/Bayesian_spa m_filtering http://www.gfi.com/whitepapers/why- bayesian-filtering.pdf

14 Bayesian Spam Filtering (2)  Probabilistic model: where {words} mean {certain words in spam emails}.  Particular words have particular probabilities of occurring in spam emails and in legitimate emails. For instance, most email users will frequently encounter the word “ Viagra ” in spam emails, but will seldom see it in other emails.

15 Bayesian Spam Filtering (3)  Before mails can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam mails and valid mails.  After generating, each word in the email contributes to the email's spam probability. This contribution is called the posterior probability and is computed using Bayes ’ theorem.  Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.

16 Bayesian Network (1)  Bayesian network is compact representation of probability distributions via conditional independence  For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms  http://en.wikipedia.org/wiki/Bayesian_network http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html http://www.cs.huji.ac.il/~nirf/Nips01- Tutorial/index.html http://en.wikipedia.org/wiki/Bayesian_network http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html http://www.cs.huji.ac.il/~nirf/Nips01- Tutorial/index.html

17 Bayesian Network (2)  Conditional independencies & graphical language capture structure of many real- world distributions  Graph structure provides much insight into domain Allows “ knowledge discovery ” Data + Prior Information Learner 1.00.0 T F F 0.10.9 0.01 0.99 0.10.9 T T F T F RSP(W | S,R) Cloudy Sprinkler Rain Wet Grass

18 Bayesian Network (3) Qualitative part: Directed acyclic graph (DAG)  Nodes - random variables  Edges - direct influence Quantitative part: Set of conditional probability distributions Together: Define a unique distribution in a factored form 1.00.0 T F F 0.10.9 0.01 0.99 0.10.9 T T F T F RS P(W | S,R) Cloudy Sprinkler Rain Wet Grass

19 Inference  Posterior probabilities Probability of any event given any evidence  Most likely explanation Scenario that explains evidence  Rational decision making Maximize expected utility Value of Information  Effect of intervention Earthquake Radio Burglary Alarm Call Radio

20 Example 1 (1) Cloudy Sprinkler Rain Wet Grass

21 Example 1 (2)  By the chain rule of probability, the joint probability of all the nodes in the graph above is P(C, S, R, W) = P(C) * P(S|C) * P(R|C, S) * P(W|C, S, R).  By using conditional independence relationships, we can rewrite this as P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S, R) where we were allowed to simplify the third term because R is independent of S given its parent C, and the last term because W is independent of C given its parents S and R.

22 Example 1 (3)  Bayes theorem: the posterior probability of each explanation where is a normalizing constant, equal to the probability (likelihood) of the data.

23 Example 1 (4)  So we see that it is more likely that the grass is wet because it is raining: the likelihood ratio is 0.708/0.430 = 1.647.

24 Part 2 MLE vs. Bayesian Methods

25 Maximum Likelihood Estimates (MLEs) vs. Bayesian Methods  Binomial Experiments: http://www.math.tau.ac.il/~nin/Courses/ML 04/ml2.ppt http://www.math.tau.ac.il/~nin/Courses/ML 04/ml2.ppt  More Explanations and Examples: http://www.dina.dk/phd/s/s6/learning2.pdf

26 MLE (1)  Binomial Experiments: suppose we toss coin N times and the random variable is  We denote by  the (unknown) probability P(Head). Estimation task:  Given a sequence of toss samples x 1, x 2, …, x N we want to estimate the probabilities P(H)=  and P(T) = 1 - .

27 MLE (2)  The number of heads we see has a binomial distribution and thus  Clearly, the MLE of  is and is also equal to MME of .

28 MLE (3)  Suppose we observe the sequence H, H.  MLE estimate is P(H)=1,P(T)=0.  Should we really believe that tails are impossible at this stage?  Such an estimate can have disastrous effect. If we assume that P(T) = 0, then we are willing to act as though this outcome is impossible.

29 Bayesian Reasoning  In Bayesian reasoning we represent our uncertainty about the unknown parameter  by a probability distribution.  This probability distribution can be viewed as subjective probability This is a personal judgment of uncertainty.

30 Bayesian Inference  P() - prior distribution about the values of   P(x1, …, x N |) - likelihood of binomial experiment given a known value   Given x 1, …, x N, we can compute posterior distribution on   The marginal likelihood is  http://www.dina.dk/phd/s/s6/learning2.pdf http://www.dina.dk/phd/s/s6/learning2.pdf

31 Binomial Example (1)  In binomial experiment, the unknown parameter is  = P(H)  Simplest prior: P() = 1 for 0<<1 (Uniform prior)  Likelihood: where k is number of heads in the sequence  Marginal Likelihood:

32 Binomial Example (2)  Using integration by parts, we have:  Multiply both side by n choose k, we have

33 Binomial Example (3)  The recursion terminates when k = N, Thus,  We conclude that the posterior is

34 Binomial Example (4)  How do we predict (estimate  ) using the posterior?  We can think of this as computing the probability of the next element in the sequence Assumption: if we know , the probability of X N+1 is independent of X1, …, X N

35 Binomial Example (5)  Thus, we conclude that

36 Beta Prior (1)  The uniform priori distribution is a particular case of the Beta Distribution. Its general form is: Where s = and show as.  The expected value of the parameter is:  The uniform is Beta(1,1)

37 Beta Prior (2)  There are important theoretical reasons for using the Beta prior distribution?  One of them has also important practical consequences: it is the conjugate distribution of binomial sampling.  If the prior is and we have observed some data with N1 and N0 cases for the two possible values of the variable, then the posterior is also Beta with parameters  The expected value for the posterior distribution is

38 Beta Prior (3)  The value represent the prior probabilities for the value of the variables based in our past experience.  The value s= is called equivalent sample size measure the importance of our past experience.  Larger values make that prior probabilities have more importance.

39 Beta Prior (4)  When, then we have maximum likelihood estimation

40 Multinomial Experiments  Now, assume that we have a variable X taking values on a finite set {a 1, …,a n } and we have a serious of independent observations of this distribution, (x 1,x 2, …,x m ) and we want to estimate the value θ i =P(a i ), i=1, …,n.  Let N i be the number of cases in the sample in which we have obtained the value a i (i=1, …,n)  The MLE of θ i is  The problems with small samples are completely analogous

41 Dirichlet Prior (1)  We can also follow the Bayesian approach, but the prior distribution is the Dirichlet distribution, a generalization of the Beta distribution for more than 2 cases:(θ 1, …, θ n ).  The expression of D(, …, ) is where s= is the equivalent sample size.

42 Dirichlet Prior (2)  The expected vector is  Greater value of s makes this distribution more concentrated around the mean vector.

43 Dirichlet Posterior  If we have a set of data with counts (N 1, …,N n ), then the posterior distribution is also Dirichlet with parameters  The Bayesian estimation of probabilities are: where,.

44 Multinomial Example  Imagine that we have an urn with balls of different colors: red(R), blue(B) and green(G); but on an unknown quantity.  Assume that we picked up balls with replacement, with the following sequence: (B,B,R,R,B).  If we assume a Dirichlet prior distribution with parameters: D(1,1,1), then the estimated frequencies for red,blue and green : (3/8, 4/8, 1/8)  Observe, as green has a positive probability, even if never appears in the sequence.

45 Part 3 An Example in Genetics

46 Example 1 in Genetics (1)  Two linked loci with alleles A and a, and B and b A, B: dominant a, b: recessive  A double heterozygote AaBb will produce gametes of four types: AB, Ab, aB, ab F (Female) 1- r ’ r ’ (female recombination fraction) M (Male) 1-r r (male recombination fraction) A Bb a B A b a a B b A A B b a 46

47 Example 1 in Genetics (2)  r and r ’ are the recombination rates for male and female  Suppose the parental origin of these heterozygote is from the mating of. The problem is to estimate r and r ’ from the offspring of selfed heterozygotes.  Fisher, R. A. and Balmukand, B. (1928). The estimation of linkage from the offspring of selfed heterozygotes. Journal of Genetics, 20, 79 – 92.  http://en.wikipedia.org/wiki/Genetics http://www2.isye.gatech.edu/~brani/isyebayes/b ank/handout12.pdf http://en.wikipedia.org/wiki/Genetics http://www2.isye.gatech.edu/~brani/isyebayes/b ank/handout12.pdf 47

48 Example 1 in Genetics (3) MALE AB (1-r)/2 ab (1-r)/2 aB r/2 Ab r/2 FEMALEFEMALE AB (1-r ’ )/2 AABB (1-r) (1-r ’ )/4 aABb (1-r) (1-r ’ )/4 aABB r (1-r ’ )/4 AABb r (1-r ’ )/4 ab (1-r ’ )/2 AaBb (1-r) (1-r ’ )/4 aabb (1-r) (1-r ’ )/4 aaBb r (1-r ’ )/4 Aabb r (1-r ’ )/4 aB r ’ /2 AaBB (1-r) r ’ /4 aabB (1-r) r ’ /4 aaBB r r ’ /4 AabB r r ’ /4 Ab r ’ /2 AABb (1-r) r ’ /4 aAbb (1-r) r ’ /4 aABb r r ’ /4 AAbb r r ’ /4 48

49 Example 1 in Genetics (4)  Four distinct phenotypes: A*B*, A*b*, a*B* and a*b*.  A*: the dominant phenotype from (Aa, AA, aA).  a*: the recessive phenotype from aa.  B*: the dominant phenotype from (Bb, BB, bB).  b* : the recessive phenotype from bb.  A*B*: 9 gametic combinations.  A*b*: 3 gametic combinations.  a*B*: 3 gametic combinations.  a*b*: 1 gametic combination.  Total: 16 combinations. 49

50 Example 1 in Genetics (5) 50

51 Example 1 in Genetics (6) Hence, the random sample of n from the offspring of selfed heterozygotes will follow a multinomial distribution: 51

52 Bayesian for Example 1 in Genetics (1)  To simplify computation, we let  The random sample of n from the offspring of selfed heterozygotes will follow a multinomial distribution:

53 Bayesian for Example 1 in Genetics (2)  If we assume a Dirichlet prior distribution with parameters: to estimate probabilities for A*B*, A*b*,a*B* and a*b*.  Recall that  A*B*: 9 gametic combinations.  A*b*: 3 gametic combinations.  a*B*: 3 gametic combinations.  a*b*: 1 gametic combination. We consider

54 Bayesian for Example 1 in Genetics (3)  Suppose that we observe the data of y = (y1, y2, y3, y4) = (125, 18, 20, 24).  So the posterior distribution is also Dirichlet with parameters D(134, 21, 23, 25)  The Bayesian estimation for probabilities are: =(0.660, 0.103, 0.113, 0.123)

55 Bayesian for Example 1 in Genetics (4)  Consider the original model,  The random sample of n also follow a multinomial distribution:  We will assume a Beta prior distribution:

56 Bayesian for Example 1 in Genetics (5)  The posterior distribution becomes  The integration in the above denominator, does not have a close form.

57 Bayesian for Example 1 in Genetics (6)  How to solve this problem? Monte Carlo Markov Chains (MCMC) Method!  What value is appropriate for

58 Part 4 Monte Carlo Methods

59 Monte Carlo Methods (1)  Consider the game of solitaire: what ’ s the chance of winning with a properly shuffled deck?  http://en.wikipedia.or g/wiki/Monte_Carlo_ method http://en.wikipedia.or g/wiki/Monte_Carlo_ method  http://nlp.stanford.ed u/local/talks/mcmc_2 004_07_01.ppt http://nlp.stanford.ed u/local/talks/mcmc_2 004_07_01.ppt ? Lose WinLose Chance of winning is 1 in 4!

60 Monte Carlo Methods (2)  Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards  Insight: why not just play a few hands, and see empirically how many do in fact win?  More generally, can approximate a probability density function using only samples from that density

61 Monte Carlo Methods (3)  Given a very large set X and a distribution f(x) over it.  We draw a set of N i.i.d. random samples.  We can then approximate the distribution using these samples. X f(x)

62 Monte Carlo Methods (4)  We can also use these samples to compute expectations:  And even use them to find a maximum:

63 Monte Carlo Example , find  Solution:  Use Monte Carlo method to approximation

64 Exercises  Write your own programs similar to those examples presented in this talk.  Write programs for those examples mentioned at the reference web pages.  Write programs for the other examples that you know. 64

1 Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

Similar presentations

Presentation on theme: "1 Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

Similar presentations

Presentation on theme: "1 Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University"— Presentation transcript:

Similar presentations

About project

Feedback