Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.

Similar presentations


Presentation on theme: "Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum."— Presentation transcript:

1 Bayesian Learning

2 Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum Likelihood of real valued function Bayes optimal Classifier Joint distributions Naive Bayes Classifier

3 Uncertainty Our main tool is the probability theory, which assigns to each sentence numerical degree of belief between 0 and 1 It provides a way of summarizing the uncertainty

4 Variables Boolean random variables: cavity might be true or false Discrete random variables: weather might be sunny, rainy, cloudy, snow – P(Weather=sunny) – P(Weather=rainy) – P(Weather=cloudy) – P(Weather=snow) Continuous random variables: the temperature has continuous values

5 Where do probabilities come from? Frequents: – From experiments: form any finite sample, we can estimate the true fraction and also calculate how accurate our estimation is likely to be Subjective: – Agent’s believe Objectivist: – True nature of the universe, that the probability up heads with probability 0.5 is a probability of the coin

6 Axioms of Probability Before the evidence is obtained; prior probability – P(a) the prior probability that the proposition is true – P(cavity)=0.1 After the evidence is obtained; posterior probability – P(a|b) – The probability of a given that all we know is b – P(cavity|toothache)=0.8

7 Axioms of Probability All probabilities are between 0 and 1. For any proposition a 0 ≤ P(a) ≤ 1 P(true)=1, P(false)=0 The probability of disjunction is given by

8 Axioms of Probability Product rule

9 Theorem of total probability If events A 1,..., A n are mutually exclusive with then

10 Bayes’s rule

11 Bayes Theorem P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h

12 Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP :

13 Choosing Hypotheses

14 Maximum Likehood (ML) If assume P(h i )=P(h j ) for all h i and h j, then can further simplify, and choose the Maximum likelihood (ML) hypothesis

15 Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result (+) in only 98% of the cases in which the disease is actually present, and a correct negative result (-) in only 97% of the cases in which the disease is not present Furthermore, 0.008 of the entire population have this cancer

16 Example

17 Normalization The result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method

18 Brute-Force Bayes Concept Learning For each hypothesis h in H, calculate the posterior probability Output the hypothesis h MAP with the highest posterior probability

19 Brute-Force Bayes Concept Learning Given no prior knowledge that one hypothesis is more likely than another, what values should we specify for P(h)? What choice shall we make for P(D|h) ? The algorithm may require significant computation, because it applies Bayes theorem to each hypothesis in H to calcualte P(h|D)

20 Assumptions of Special MAP Assumptions – The training data D is noise free (i.e., d i =c(x i )) – The target concept c is contained in the hypothesis space H – There is no any hypothesis is more probable than any other Outcome – Choose P(h) to be uniform distribution – P(D|h)=1 if h consistent with D – P(D|h)=0 otherwise

21 Assumptions of Special MAP Version space VS H,D is the subset of consistent Hypotheses from H with the training examples in D

22 Assumptions of Special MAP if h is consistent with D if h is inconsistent with D

23 MAP Hypotheses and Consistent Learner Consistent learner provided it outputs a hypothesis that commits zero errors over the training examples Every consistent learner outputs a MAP hypothesis – A uniform prior probability distribution over H (P(h i )=P(h j )) for all i and j – Noise free training data

24 MAP Hypotheses and Consistent Learner Find-S outputs a maximally specific hypothesis from the version space, its output hypothesis will be a MAP hypothesis relative to any prior probability distribution that favors more specific hypotheses Suppose H is any probability distribution P(H) over H that assigns P(h i )>P(h j ) if h i is more specific than h j

25 Maximum Likelihood of real valued function Learning a continuous-valued target function (neural network, linear regression) attempt to minimize the sum of squared errors over training data

26 Maximum Likelihood of real valued function Learning a real-valued function. The target function f corresponds to the solid line. The training examples are assumed to have Normal distributed noise e i with zero mean added to the true value f(x i ),that is, d i =f(x i )+e i. The dashed line corresponds to the linear function that minimizes the sum of squared errors. Therefore, it is the maximum likehood, given five training examples

27 Maximum Likelihood of real valued function Probability densities over continuous variable, such as e. The probability density P(x 0 ) is defined. P(x 0 ) is the limit as ε goes to zero of 1/ ε times the probability that x will take on a value in the interval [x 0, x 0 +ε)

28 Maximum Likelihood of real valued function Starting with our earlier definition, using lower case p to refer to the probability density We assume a fixed set of training instance and therefore consider the data D to be corresponding sequence of target values D= , where d i =f(x i )+e i. Assume the training examples are mutually independent h, then

29 Maximum Likelihood of real valued function If e i obeys a normal distribution with zero mean and unknown standard deviation σ. Then d i obeys mean f(x i ). Thus p(d i |h) obeys a normal distribution with mean and standard f(x i ) deviation σ. We assume d i given that h is the correct description of the target function f, we will also substitute  =f(x i )=h(x i )

30 Maximum Likelihood of real valued function h ML Is the one that minimizes the sum of the square errors between the observed training d i The results based on the normal distribution assumption.

31 Why normal distribution? Mathematically straightforward analysis Good approximation to many types of noise in physical system Large number of independent, identically distributed random variable itself obeys a Normal distribution Minimizing the sum of squared errors in a common approach in many neural network, curve fitting, and other approaches to approximating real-valued functions

32 Maximum Likelihood Hypothesis for Predicting Probability Consider the setting in which we wish to learn a nondeterministic function f: X  {0,1} We might expect f to be probabilistics For example, for neural network learning, we might output the f(x)=1 with probability 92% – f’: X  [0,1] , f’=P(f(x)=1)

33 Maximum Likelihood Hypothesis for Predicting Probability Brute-Force – Collecting d i is the observed 0 or 1 for f(x i ) Assume training data D is of the form D={... }

34 Maximum Likelihood Hypothesis for Predicting Probability h ML

35 Gradient Search to Maximize Likelihood in a Neural Net Let us use G(h,D) to denote this quantity. The partial derivatives of G(h,D) with respect to weight from w jk input k to unit j is To keep our analysis simple, suppose our neural network is constructed from as single layer of sigmoid units. In this case, we have

36 Gradient Search to Maximize Likelihood in a Neural Net We seek to maximize rather than minimize P(D|h), we perform gradient ascent rather than gradient descent search Similar to the gradient search of BackPropagation

37 Bayes optimal Classifier A weighted majority classifier What is he most probable classification of the new instance given the training data? – The most probable classification of the new instance is obtained by combining the prediction of all hypothesis, weighted by their posterior probabilities If the classification of new example can take any value v j from some set V, then the probability P(v j |D) that the correct classification for the new instance is v j, is just: Bayes optimal classification

38 Bayes optimal Classifier Example New instance V={+,-} P(h 1 |D)=0.4, P(-|h 1 )=0, P(+|h 1 )=1 P(h 2 |D)=0.3, P(-|h 2 )=1, P(+|h 2 )=0 P(h 3 |D)=0.3, P(-|h 3 )=1, P(+|h 2 )=0 Output

39 Gibbs Algorithm Bayes optimal classifier provides best result, but can be expensive if many hypotheses Gibbs algorithm: – Choose one hypothesis at random, according to P(h|D) – Use this to classify new instance

40 Gibbs Algorithm Suppose correct, uniform prior distribution over H, then – Pick any hypothesis at random.. – Its expected error no worse than twice Bayes optimal

41 Naive Bayes Classifier A new instance is described by the tuple of attribute values The learner is asked to predict the target value, or classification, for the new instance The Bayesian approach to classifying the new instance is to assign the most probable target value, v MAP, Bayes theorem to rewrite

42 Naive Bayes Classifier Estimate P(a i |v j ) is much easier than estimate P(a 1,...,a n |v j ) Bayes assumption of conditional independence is satisfied, this naive Bayes classification v NB is identical to the MAP classification

43 Naive Bayes Classifier Play tennis example provides 14 instances. Each instance has four attributes – P(yes)=9/14=0.64 – P(no)=5/14=0.36 – P(strong|yes)=3/9=0.33 – P(strong|no)=3/5=0.60 – P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)=0.0 053 – P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)=0.0206 – v NB =no

44 Estimating Probabilities We estimate P(Wind=strong|PlayTennis=no) by the fraction n c /n. It provides poor estimation when n c is small. m-estimate of probability, n is number of training examples for which v=v j, n c number of examples for which v=v j and a=a i, p is prior estimate and m is weight given to prior.

45 Learning to Classify Text Target concept: Electronic news articles that I find interesting Consider an instance space X consisting of all possible text documents. We are given training examples of some unknown target function f(x), which can take on any value from some finite set V. V={like,dislike}

46 Learning to Classify Text Two main design issues – How to represent text document in terms of attribute values – How to estimate the probabilities Representation of arbitrary text document – We define an attribute for each word position in the document and define the value of that attribute to be the English found in that position Assume we have 1000 text documents, in which 700 is marked as dislike and 300 is marked as like. Determine whether like or dislike a new document – This is an example document for the naive Bayes classifier. This document contains only one paragraph, or two sentences

47 Learning to Classify Text Independent assumption is reasonable? Practical?

48 Learning to Classify Text Estimate P(v i ) and P(a i =w k |v i ). The class conditional probabilities is more problematic because we have to estimate one such probability term for each combination of text position. Additional reasonable assumption that reduces the number of probabilities that must be estimated. Assuming that the attributes are independent and identically distributed. P(a i =w k |v i )=P(w k |v j ) (Position Independent) Estimate P(w k |v j ), as

49 Learning to Classify Text Learn_Naive_Bayes_Text( Examples, V ) Examples is a set of text documents along with their target values 。 V is the set of all possible target values. This function learns the probability terms P(w k |v j ) and P(v j ). P(w k |v j ) is the probability that a randomly drawn word from a document in class v j will be the English word w k – Collect all words, punctuation, and other token that occur in examples – Vocabulary  the set of distinct words and other tokens occurring in any text document from Examples – Calculate the required P(v j ) and P(w k |v j ) probability terms For each target value v j in V do – docs j  the subset of documents from Examples for which the target value is v j – P(v j )  |docs j | / |Examples| – Text j  a single document created by concatenating all member of docs j – n  total number of distinct word positions in Text j – For each word w k in Vocabulary » n k  number of times word w k in Text j » P(w k |v j )  (n k +1) / (n+|Vocabulary|)

50 Learning to Classify Text Classify_Naive_Bayes_Text( Doc ) Return the estimated target value for the document Doc , a i denotes the word found in the ith position within Doc – positions  all word positions in Doc that contain tokens found in Vocabulary – Return v NB ,

51 Bayesian Belief Networks A Bayesian belief networks defines the notion of condition of conditional independence. Let X, Y, and Z be three discrete- valued random variable. The set of variables X 1...X l is conditionally independent of the set of variable Y 1...Y m given the set of variables Z 1...Z n if This allow the naive Bayes classifier to calcuate

52 Representation A Bayesian belief network represents the joint probability distribution for a set of variable. For example, Figure represents the joint probability distribution over boolean variable Storm, Lighting, Thunder, ForestFire, Campfire and BusTourGroup.

53 Representation The joint probability for any desired assignment of values to the tuple of network variables (y 1...y n ) can be computed by the formula The set of local conditional probability tables for all the variables, together with the set of conditional independence assumptions described by the network

54 Gradient Ascent Training of Bayesian Networks Let w ijk denote a single entry in one of the conditional probability tables. That is w ijk denote the conditional probability that the network variable Y i will take on the value y ij given that its immediate parents U i take on the values given by u ik For example, w ijk is the top right entry in the conditional probability table. Y i is variable Campfire , U i is its parents , y ij =True , and u ik =

55 Gradient Ascent Training of Bayesian Networks The gradient of lnP(D|h) is given by the derivatives for each of the w ijk. For example, in order to calculate derivatives of lnP(D|h), we need to calculate P(Campfire=True, Storm=False, BusTourGroup=False|d) for each d in D

56 Gradient Ascent Training of Bayesian Networks Use abbreviation P h (D) to represent P(D|h) Assuming the training examples d in the data set D are drawn independently.

57 Gradient Ascent Training of Bayesian Networks

58 Update the weight We also renormalize the weight to assure that all the above constraint are satisfied. That is, sum of w ijk is 1 Another method, EM algorithm

59 EM Algorithm Background – Only a subset of relevant instance features might be observable – EM algorithm can be used even for variable whose values is never directly observed, provided the general form of the probability distribution governing these variables is known

60 Estimating Means of k Gaussians Consider a problem in which the data D is a set of instances generated by a probability distribution that is a mixture of k distinct normal Distribution Simple version: Assume all the variances of k distinct normal distribution are same.

61 Estimating Means of k Gaussians Input: observed data instances x 1,...,x m Output: The mean values of k gaussian distribution h=. Find the hypothesis h that maximizes p(D|h)

62 Estimating Means of k Gaussians Description of each instance as the triple, where x i is the observed value of the ith instance and z i1 and z i2 indicate which of the two normal distribution used to generate the x i EM searches for a maximum likelihood hypothesis by repeatedly re- estimating the expected values of the hidden variables z ij given its current hypothesis Example – Initializes h= – Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis h= holds. – Calculate a new maximum likelihood hypothesis h’=, assuming the value taken on by each hidden variable is its expected value E[z ij ] calculated in Step 1. Then replace the hypothesis h= by the new hypothesis h’= and iterate.

63 Estimating Means of k Gaussians E[z ij ] is the probability of that instance x i was generated by the jth Normal Distribution Derive a new maximum likelihood hypothesis h’=.

64 General Statement of EM Algorithm We wish to estimate some set of parameters  (For example,  = , for ). Z denote the unobserved data in these instances, and let Y=X  Z denote the full data. Y is random variable because it is defined in terms of the random variable Z

65 General Statement of EM Algorithm EM searches for the maximum likehood hypothesis h’ by seeking the h’ that maximizes E[lnP(Y|h’)] – P(Y|h’) is the likelihood of the full data Y given hypothesis h’ – Maximizing lnP(Y|h’) also maximizes P(Y|h’) – We take the expected values E[lnP(Y|h’)] over the probability distribution governing the random variable Y. EM algorithm uses its current hypothesis h in place of the actual parameters  to estimate the distribution governing Y Define function Q(h’|h)=E[lnP(Y|h’)|h,X]

66 General Statement of EM Algorithm EM algorithm – Estimate step: Calcuate Q(h’|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h’|h)  E[lnP(Y|h’)|h,X] – Maximization step: Replace hypothesis h by the hypothesis h’ that maximizes this function h  argmax h’ Q(h’|h) When function Q is continuous, the EM algorithm converges to a stationary point of the likelihood function P(Y|h’)

67 Derivation of the k Means Algorithm Problem – Estimating the means of a mixture of k Normal distribution  = – Observed data X={ } – The hidden variable Z={ } indicate which of the k Normal distributions was used to generate x i. Derive an expression for Q(h’|h)that applies to our k-means problem – Single instance

68 Derivation of the k Means Algorithms Given this probability for a single instance p(y i |h’), the logarithm of the probability ln P(Y|h’) for m instances in the data is Expected value

69 Derivation of the k Means Algorithms Find values maximize this Q function We have and


Download ppt "Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum."

Similar presentations


Ads by Google