Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.

Bayesian Learning

Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of Real Valued Function Bayes Optimal Classifier Joint Distributions Naive Bayes Classifier

Variables Boolean random variables: cavity might be true or false Discrete random variables: weather might be sunny, rainy, cloudy, snow – P(Weather=sunny) – P(Weather=rainy) – P(Weather=cloudy) – P(Weather=snow) Continuous random variables: the temperature has continuous values

4 Basic Concepts Sample space: the set of possible outcomes of an experiment. Event: a subset of the sample space. If S is a finite sample space of equally likely outcomes, and E is an event, that is, a subset of S, then the probability of E is

Probability Before the evidence is obtained; prior probability – P(a) the prior probability that the proposition is true – P(cavity)=0.1 After the evidence is obtained; posterior probability – P(a|b) – The probability of a given that all we know is b – P(cavity|toothache)=0.8

Examples 4/9 6 5/9

Axioms of Probability All probabilities are between 0 and 1. For any proposition a, 0 ≤ P(a) ≤ 1 The probability of disjunction is given by

Axioms of Probability Product rule

Theorem of total probability If events A 1,..., A n are mutually exclusive with then

Bayes’s rule

Example 1.Choose one of the two boxes at random. 2.Select one of the balls in this box at random If a red ball is selected, what is the probability that this ball is from the first box? 11

Solution E : a red ball is selected : a blue ball is selected F : a ball is selected from the first box : a ball is selected from the second box p(F | E)=p(F∩E)/p(E) p(E | F)=p(E∩F)/p(F)=7/9, p(F)=p( )=1/2  p(E∩F)=7/18 p(E | )=p(E∩ )/p( )= 3/7  p(E∩ )=3/14 E=(E∩F) U (E∩ )  p(E)=p(E∩F)+p(E∩ )=38/63 12 = 49/76 ≈ 0.645

Bayes Theorem Suppose that E and F are events from a sample space S such that p(E)≠0 and p(F)≠0. Then 13 Note: F and are mutually exclusive and cover the entire sample space S, that is F U =S. F E

Example Suppose one person in 100,000 has a particular rare disease for which there is a fairly accurate diagnostic test. This test is correct 99% of the time when given to someone with the disease; it is correct 99.5% of the time when given to someone who does not have the disease. a)p(someone who tests positive has the disease)? b)p(someone who tests negative does not have the disease)? 14 Only 0.2% of people who test positive actually have the disease. 99.99999% of people who test negative really do not have the disease.

Bayes Theorem P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h

Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP :

Choosing Hypotheses

Maximum Likelihood (ML) If assume P(h i )=P(h j ) for all h i and h j, then can further simplify, and choose the Maximum likelihood (ML) hypothesis

Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result (+) in only 98% of the cases in which the disease is actually present, and a correct negative result (-) in only 97% of the cases in which the disease is not present Furthermore, 0.008 of the entire population have this cancer

Example

Normalization The result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method

Brute-Force Bayes Concept Learning For each hypothesis h in H, calculate the posterior probability Output the hypothesis h MAP with the highest posterior probability

Brute-Force Bayes Concept Learning Given no prior knowledge that one hypothesis is more likely than another, what values should we specify for P(h)? What choice shall we make for P(D|h) ? The algorithm may require significant computation, because it applies Bayes theorem to each hypothesis in H to calculate P(h|D)

Assumptions of Special MAP Assumptions – The training data D is noise free (i.e., d i =c(x i )) – The target concept c is contained in the hypothesis space H – There is no any hypothesis is more probable than any other Outcome – Choose P(h) to be uniform distribution – P(D|h)=1 if h consistent with D – P(D|h)=0 otherwise

Assumptions of Special MAP Version space VS H,D is the subset of consistent Hypotheses from H with the training examples in D

Assumptions of Special MAP if h is consistent with D if h is inconsistent with D

Maximum Likelihood Maximum likelihood (ML) hypothesis

Maximum Likelihood Suppose one wishes to determine just how biased an unfair coin is. Call the probability of tossing a HEAD p. The goal then becomes to determine p. Suppose the coin is tossed 80 times: i.e., the sample might be something like x 1 = H, x 2 = T,..., x 80 = T, and the count of the number of HEADS "H" is observed. http://en.wikipedia.org/wiki/Maximum_likelihood

Maximum Likelihood The probability of tossing TAILS is 1 − p (so here p is θ above). Suppose the outcome is 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p = 1/3, one which gives HEADS with probability p = 1/2 and another which gives HEADS with probability p = 2/3.

Maximum Likelihood The likelihood is maximized when p = 2/3, and so this is the maximum likelihood estimate for p. Continuous Distribution

For the normal distribution which has probability density function The corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is Continuous Distribution http://en.wikipedia.org/wiki/Maximum_likelihood

Continuous Distribution Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm.

Continuous Distribution Similarly we differentiate the log likelihood with respect to σ and equate to zero:

Maximum Likelihood of Real Valued Function Learning a continuous-valued target function (neural network, linear regression) attempt to minimize the sum of squared errors over training data

Maximum Likelihood of Real Valued Function Learning a real-valued function. The target function f corresponds to the solid line. The training examples are assumed to have Normal distributed noise e i with zero mean added to the true value f(x i ),that is, d i =f(x i )+e i. The dashed line corresponds to the linear function that minimizes the sum of squared errors. Therefore, it is the maximum likelihood, given five training examples

Maximum Likelihood of real valued function Probability densities over continuous variable, such as e. The probability density P(x 0 ) is defined. P(x 0 ) is the limit as ε goes to zero of 1/ ε times the probability that x will take on a value in the interval [x 0, x 0 +ε)

Maximum Likelihood of real valued function Starting with our earlier definition, using lower case p to refer to the probability density We assume a fixed set of training instance and therefore consider the data D to be corresponding sequence of target values D= ， where d i =f(x i )+e i. Assume the training examples are mutually independent h, then

Maximum Likelihood of real valued function If e i obeys a normal distribution with zero mean and unknown standard deviation σ. Then d i obeys mean f(x i ). Thus p(d i |h) obeys a normal distribution with mean and standard f(x i ) deviation σ. We assume d i given that h is the correct description of the target function f, we will also substitute  =f(x i )=h(x i )

Maximum Likelihood of real valued function h ML Is the one that minimizes the sum of the square errors between the observed training d i The results based on the normal distribution assumption.

Why normal distribution? Mathematically straightforward analysis Good approximation to many types of noise in physical system Large number of independent, identically distributed random variable itself obeys a Normal distribution Minimizing the sum of squared errors in a common approach in many neural network, curve fitting, and other approaches to approximating real-valued functions

Bayes optimal Classifier A weighted majority classifier What is the most probable classification of the new instance given the training data? – The most probable classification of the new instance is obtained by combining the prediction of all hypothesis, weighted by their posterior probabilities If the classification of new example can take any value v j from some set V, then the probability P(v j |D) that the correct classification for the new instance is v j : Bayes optimal classification

Bayes Optimal Classifier Example New instance V={+,-} P(h 1 |D)=0.4, P(-|h 1 )=0, P(+|h 1 )=1 P(h 2 |D)=0.3, P(-|h 2 )=1, P(+|h 2 )=0 P(h 3 |D)=0.3, P(-|h 3 )=1, P(+|h 3 )=0 Output

Naive Bayes Classifier A new instance is described by the tuple of attribute values The learner is asked to predict the target value, or classification, for the new instance The Bayesian approach to classifying the new instance is to assign the most probable target value, v MAP, Bayes theorem to rewrite

Naive Bayes Classifier Estimate P(a i |v j ) is much easier than estimate P(a 1,...,a n |v j ) Bayes assumption of conditional independence is satisfied, this naive Bayes classification v NB is identical to the MAP classification

DayOutlookTemperatureHumidityWindPlay Tennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

Estimating Probabilities We estimate P(Wind=strong|PlayTennis=no) by the fraction n c /n. It provides poor estimation when n c is small. m-estimate of probability, n is number of training examples for which v=v j, n c number of examples for which v=v j and a=a i, p is prior estimate and m is weight given to prior and m is a constant called the equivalent sample size

Learning to Classify Text Target concept: Electronic news articles that I find interesting Consider an instance space X consisting of all possible text documents. We are given training examples of some unknown target function f(x), which can take on any value from some finite set V. V={like,dislike}

Learning to Classify Text Two main design issues – How to represent text document in terms of attribute values – How to estimate the probabilities Representation of arbitrary text document – We define an attribute for each word position in the document and define the value of that attribute to be the English found in that position Assume we have 1000 text documents, in which 700 is marked as dislike and 300 is marked as like. Determine whether like or dislike a new document – This is an example document for the naive Bayes classifier. This document contains only one paragraph, or two sentences

Learning to Classify Text Independent assumption is reasonable? Practical?

Learning to Classify Text Estimate P(v i ) and P(a i =w k |v i ). The class conditional probabilities is more problematic because we have to estimate one such probability term for each combination of text position. Additional reasonable assumption that reduces the number of probabilities that must be estimated. Assuming that the attributes are independent and identically distributed. P(a i =w k |v i )=P(w k |v j ) (Position Independent) Estimate P(w k |v j ), as

Learning to Classify Text Learn_Naive_Bayes_Text( Examples, V ) Examples is a set of text documents along with their target values 。 V is the set of all possible target values. This function learns the probability terms P(w k |v j ) and P(v j ). P(w k |v j ) is the probability that a randomly drawn word from a document in class v j will be the English word w k – Collect all words, punctuation, and other token that occur in examples – Vocabulary  the set of distinct words and other tokens occurring in any text document from Examples – Calculate the required P(v j ) and P(w k |v j ) probability terms For each target value v j in V do – docs j  the subset of documents from Examples for which the target value is v j – P(v j )  |docs j | / |Examples| – Text j  a single document created by concatenating all member of docs j – n  total number of distinct word positions in Text j – For each word w k in Vocabulary » n k  number of times word w k in Text j » P(w k |v j )  (n k +1) / (n+|Vocabulary|)

Learning to Classify Text Classify_Naive_Bayes_Text( Doc ) Return the estimated target value for the document Doc ， a i denotes the word found in the ith position within Doc – positions  all word positions in Doc that contain tokens found in Vocabulary – Return v NB ，

Bayesian Belief Networks A Bayesian belief network defines the notion of condition of conditional independence. Let X, Y, and Z be three discrete- valued random variable. The set of variables X 1...X l is conditionally independent of the set of variable Y 1...Y m given the set of variables Z 1...Z n if This allow the naive Bayes classifier to calculate

Representation A Bayesian belief network represents the joint probability distribution for a set of variable. For example, Figure represents the joint probability distribution over Boolean variable Storm, Lighting, Thunder, ForestFire, Campfire and BusTourGroup.

Representation The joint probability for any desired assignment of values to the tuple of network variables (y 1...y n ) can be computed by the formula The set of local conditional probability tables for all the variables, together with the set of conditional independence assumptions described by the network

Gradient Ascent Training of Bayesian Networks Let w ijk denote a single entry in one of the conditional probability tables. That is, w ijk denote the conditional probability that the network variable Y i will take on the value y ij given that its immediate parents U i take on the values given by u ik For example, w ijk is the top right entry in the conditional probability table. Y i is variable Campfire ， U i is its parents ， y ij =True ， and u ik =

Gradient Ascent Training of Bayesian Networks The gradient of lnP(D|h) is given by the derivatives for each of the w ijk. For example, in order to calculate derivatives of lnP(D|h), we need to calculate P(Campfire=True, Storm=False, BusTourGroup=False|d) for each d in D

Gradient Ascent Training of Bayesian Networks Use abbreviation P h (D) to represent P(D|h) Assuming the training examples d in the data set D are drawn independently.

Gradient Ascent Training of Bayesian Networks

Update the weight We also renormalize the weight to assure that all the above constraint are satisfied. That is, sum of w ijk is 1 Another method, EM algorithm

EM Algorithm Background – Only a subset of relevant instance features might be observable – EM algorithm can be used even for variable whose values is never directly observed, provided the general form of the probability distribution governing these variables is known

Estimating Means of k Gaussians Consider a problem in which the data D is a set of instances generated by a probability distribution that is a mixture of k distinct normal Distribution Simple version: Assume all the variances of k distinct normal distribution are same.

Estimating Means of k Gaussians Input: observed data instances x 1,...,x m Output: The mean values of k Gaussian distribution h=. Find the hypothesis h that maximizes p(D|h)

Estimating Means of k Gaussians Description of each instance as the triple, where x i is the observed value of the ith instance and z i1 and z i2 indicate which of the two normal distribution used to generate the x i EM searches for a maximum likelihood hypothesis by repeatedly re- estimating the expected values of the hidden variables z ij given its current hypothesis Example – Initializes h= – Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis h= holds. – Calculate a new maximum likelihood hypothesis h’=, assuming the value taken on by each hidden variable is its expected value E[z ij ] calculated in Step 1. Then replace the hypothesis h= by the new hypothesis h’= and iterate.

Estimating Means of k Gaussians E[z ij ] is the probability of that instance x i was generated by the jth Normal Distribution Derive a new maximum likelihood hypothesis h’=.

General Statement of EM Algorithm We wish to estimate some set of parameters  (For example,  = ， for ). Z denote the unobserved data in these instances, and let Y=X  Z denote the full data. Y is random variable because it is defined in terms of the random variable Z

General Statement of EM Algorithm EM searches for the maximum likelihood hypothesis h’ by seeking the h’ that maximizes E[lnP(Y|h’)] – P(Y|h’) is the likelihood of the full data Y given hypothesis h’ – Maximizing lnP(Y|h’) also maximizes P(Y|h’) – We take the expected values E[lnP(Y|h’)] over the probability distribution governing the random variable Y. EM algorithm uses its current hypothesis h in place of the actual parameters  to estimate the distribution governing Y Define function Q(h’|h)=E[lnP(Y|h’)|h,X]

General Statement of EM Algorithm EM algorithm – Estimate step: Calcuate Q(h’|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h’|h)  E[lnP(Y|h’)|h,X] – Maximization step: Replace hypothesis h by the hypothesis h’ that maximizes this function h  argmax h’ Q(h’|h) When function Q is continuous, the EM algorithm converges to a stationary point of the likelihood function P(Y|h’)

Derivation of the k Means Algorithm Problem – Estimating the means of a mixture of k Normal distribution  = – Observed data X={ } – The hidden variable Z={ } indicate which of the k Normal distributions was used to generate x i. Derive an expression for Q(h’|h)that applies to our k-means problem – Single instance

Derivation of the k Means Algorithms Given this probability for a single instance p(y i |h’), the logarithm of the probability ln P(Y|h’) for m instances in the data is Expected value

Derivation of the k Means Algorithms Find values maximize this Q function We have and

Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.

Similar presentations

Presentation on theme: "Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.

Similar presentations

Presentation on theme: "Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of."— Presentation transcript:

Similar presentations

About project

Feedback