Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayesian Learning Provides practical learning algorithms
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
Bayesian Classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Probability.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Evaluating Hypotheses
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, March 6, 2000 William.
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Bayes Classification.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum likelihood (ML)
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Chapter 6 Bayesian Learning
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Bayesian Learning. 2 Bayesian Reasoning Basic assumption –The quantities of interest are governed by probability distribution –These probability + observed.
Bayesian Learning Provides practical learning algorithms
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Lecture 1.31 Criteria for optimal reception of radio signals.
Review of Probability.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Naive Bayes Classifier
Computer Science Department
Where are we in CS 440? Now leaving: sequential, deterministic reasoning Entering: probabilistic reasoning and machine learning.
Data Mining Lecture 11.
Review of Probability and Estimators Arun Das, Jason Rebello
Bayesian Inference, Basics
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Bayesian Learning

Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum Likelihood of real valued function Bayes optimal Classifier Joint distributions Naive Bayes Classifier

Uncertainty Our main tool is the probability theory, which assigns to each sentence numerical degree of belief between 0 and 1 It provides a way of summarizing the uncertainty

Variables Boolean random variables: cavity might be true or false Discrete random variables: weather might be sunny, rainy, cloudy, snow – P(Weather=sunny) – P(Weather=rainy) – P(Weather=cloudy) – P(Weather=snow) Continuous random variables: the temperature has continuous values

Where do probabilities come from? Frequents: – From experiments: form any finite sample, we can estimate the true fraction and also calculate how accurate our estimation is likely to be Subjective: – Agent’s believe Objectivist: – True nature of the universe, that the probability up heads with probability 0.5 is a probability of the coin

Axioms of Probability Before the evidence is obtained; prior probability – P(a) the prior probability that the proposition is true – P(cavity)=0.1 After the evidence is obtained; posterior probability – P(a|b) – The probability of a given that all we know is b – P(cavity|toothache)=0.8

Axioms of Probability All probabilities are between 0 and 1. For any proposition a 0 ≤ P(a) ≤ 1 P(true)=1, P(false)=0 The probability of disjunction is given by

Axioms of Probability Product rule

Theorem of total probability If events A 1,..., A n are mutually exclusive with then

Bayes’s rule

Bayes Theorem P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h

Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP :

Choosing Hypotheses

Maximum Likehood (ML) If assume P(h i )=P(h j ) for all h i and h j, then can further simplify, and choose the Maximum likelihood (ML) hypothesis

Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result (+) in only 98% of the cases in which the disease is actually present, and a correct negative result (-) in only 97% of the cases in which the disease is not present Furthermore, of the entire population have this cancer

Example

Normalization The result of Bayesian inference depends strongly on the prior probabilities, which must be available in order to apply the method

Brute-Force Bayes Concept Learning For each hypothesis h in H, calculate the posterior probability Output the hypothesis h MAP with the highest posterior probability

Brute-Force Bayes Concept Learning Given no prior knowledge that one hypothesis is more likely than another, what values should we specify for P(h)? What choice shall we make for P(D|h) ? The algorithm may require significant computation, because it applies Bayes theorem to each hypothesis in H to calcualte P(h|D)

Assumptions of Special MAP Assumptions – The training data D is noise free (i.e., d i =c(x i )) – The target concept c is contained in the hypothesis space H – There is no any hypothesis is more probable than any other Outcome – Choose P(h) to be uniform distribution – P(D|h)=1 if h consistent with D – P(D|h)=0 otherwise

Assumptions of Special MAP Version space VS H,D is the subset of consistent Hypotheses from H with the training examples in D

Assumptions of Special MAP if h is consistent with D if h is inconsistent with D

MAP Hypotheses and Consistent Learner Consistent learner provided it outputs a hypothesis that commits zero errors over the training examples Every consistent learner outputs a MAP hypothesis – A uniform prior probability distribution over H (P(h i )=P(h j )) for all i and j – Noise free training data

MAP Hypotheses and Consistent Learner Find-S outputs a maximally specific hypothesis from the version space, its output hypothesis will be a MAP hypothesis relative to any prior probability distribution that favors more specific hypotheses Suppose H is any probability distribution P(H) over H that assigns P(h i )>P(h j ) if h i is more specific than h j

Maximum Likelihood of real valued function Learning a continuous-valued target function (neural network, linear regression) attempt to minimize the sum of squared errors over training data

Maximum Likelihood of real valued function Learning a real-valued function. The target function f corresponds to the solid line. The training examples are assumed to have Normal distributed noise e i with zero mean added to the true value f(x i ),that is, d i =f(x i )+e i. The dashed line corresponds to the linear function that minimizes the sum of squared errors. Therefore, it is the maximum likehood, given five training examples

Maximum Likelihood of real valued function Probability densities over continuous variable, such as e. The probability density P(x 0 ) is defined. P(x 0 ) is the limit as ε goes to zero of 1/ ε times the probability that x will take on a value in the interval [x 0, x 0 +ε)

Maximum Likelihood of real valued function Starting with our earlier definition, using lower case p to refer to the probability density We assume a fixed set of training instance and therefore consider the data D to be corresponding sequence of target values D= , where d i =f(x i )+e i. Assume the training examples are mutually independent h, then

Maximum Likelihood of real valued function If e i obeys a normal distribution with zero mean and unknown standard deviation σ. Then d i obeys mean f(x i ). Thus p(d i |h) obeys a normal distribution with mean and standard f(x i ) deviation σ. We assume d i given that h is the correct description of the target function f, we will also substitute  =f(x i )=h(x i )

Maximum Likelihood of real valued function h ML Is the one that minimizes the sum of the square errors between the observed training d i The results based on the normal distribution assumption.

Why normal distribution? Mathematically straightforward analysis Good approximation to many types of noise in physical system Large number of independent, identically distributed random variable itself obeys a Normal distribution Minimizing the sum of squared errors in a common approach in many neural network, curve fitting, and other approaches to approximating real-valued functions

Maximum Likelihood Hypothesis for Predicting Probability Consider the setting in which we wish to learn a nondeterministic function f: X  {0,1} We might expect f to be probabilistics For example, for neural network learning, we might output the f(x)=1 with probability 92% – f’: X  [0,1] , f’=P(f(x)=1)

Maximum Likelihood Hypothesis for Predicting Probability Brute-Force – Collecting d i is the observed 0 or 1 for f(x i ) Assume training data D is of the form D={... }

Maximum Likelihood Hypothesis for Predicting Probability h ML

Gradient Search to Maximize Likelihood in a Neural Net Let us use G(h,D) to denote this quantity. The partial derivatives of G(h,D) with respect to weight from w jk input k to unit j is To keep our analysis simple, suppose our neural network is constructed from as single layer of sigmoid units. In this case, we have

Gradient Search to Maximize Likelihood in a Neural Net We seek to maximize rather than minimize P(D|h), we perform gradient ascent rather than gradient descent search Similar to the gradient search of BackPropagation

Bayes optimal Classifier A weighted majority classifier What is he most probable classification of the new instance given the training data? – The most probable classification of the new instance is obtained by combining the prediction of all hypothesis, weighted by their posterior probabilities If the classification of new example can take any value v j from some set V, then the probability P(v j |D) that the correct classification for the new instance is v j, is just: Bayes optimal classification

Bayes optimal Classifier Example New instance V={+,-} P(h 1 |D)=0.4, P(-|h 1 )=0, P(+|h 1 )=1 P(h 2 |D)=0.3, P(-|h 2 )=1, P(+|h 2 )=0 P(h 3 |D)=0.3, P(-|h 3 )=1, P(+|h 2 )=0 Output

Gibbs Algorithm Bayes optimal classifier provides best result, but can be expensive if many hypotheses Gibbs algorithm: – Choose one hypothesis at random, according to P(h|D) – Use this to classify new instance

Gibbs Algorithm Suppose correct, uniform prior distribution over H, then – Pick any hypothesis at random.. – Its expected error no worse than twice Bayes optimal

Naive Bayes Classifier A new instance is described by the tuple of attribute values The learner is asked to predict the target value, or classification, for the new instance The Bayesian approach to classifying the new instance is to assign the most probable target value, v MAP, Bayes theorem to rewrite

Naive Bayes Classifier Estimate P(a i |v j ) is much easier than estimate P(a 1,...,a n |v j ) Bayes assumption of conditional independence is satisfied, this naive Bayes classification v NB is identical to the MAP classification

Naive Bayes Classifier Play tennis example provides 14 instances. Each instance has four attributes – P(yes)=9/14=0.64 – P(no)=5/14=0.36 – P(strong|yes)=3/9=0.33 – P(strong|no)=3/5=0.60 – P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)= – P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)= – v NB =no

Estimating Probabilities We estimate P(Wind=strong|PlayTennis=no) by the fraction n c /n. It provides poor estimation when n c is small. m-estimate of probability, n is number of training examples for which v=v j, n c number of examples for which v=v j and a=a i, p is prior estimate and m is weight given to prior.

Learning to Classify Text Target concept: Electronic news articles that I find interesting Consider an instance space X consisting of all possible text documents. We are given training examples of some unknown target function f(x), which can take on any value from some finite set V. V={like,dislike}

Learning to Classify Text Two main design issues – How to represent text document in terms of attribute values – How to estimate the probabilities Representation of arbitrary text document – We define an attribute for each word position in the document and define the value of that attribute to be the English found in that position Assume we have 1000 text documents, in which 700 is marked as dislike and 300 is marked as like. Determine whether like or dislike a new document – This is an example document for the naive Bayes classifier. This document contains only one paragraph, or two sentences

Learning to Classify Text Independent assumption is reasonable? Practical?

Learning to Classify Text Estimate P(v i ) and P(a i =w k |v i ). The class conditional probabilities is more problematic because we have to estimate one such probability term for each combination of text position. Additional reasonable assumption that reduces the number of probabilities that must be estimated. Assuming that the attributes are independent and identically distributed. P(a i =w k |v i )=P(w k |v j ) (Position Independent) Estimate P(w k |v j ), as

Learning to Classify Text Learn_Naive_Bayes_Text( Examples, V ) Examples is a set of text documents along with their target values 。 V is the set of all possible target values. This function learns the probability terms P(w k |v j ) and P(v j ). P(w k |v j ) is the probability that a randomly drawn word from a document in class v j will be the English word w k – Collect all words, punctuation, and other token that occur in examples – Vocabulary  the set of distinct words and other tokens occurring in any text document from Examples – Calculate the required P(v j ) and P(w k |v j ) probability terms For each target value v j in V do – docs j  the subset of documents from Examples for which the target value is v j – P(v j )  |docs j | / |Examples| – Text j  a single document created by concatenating all member of docs j – n  total number of distinct word positions in Text j – For each word w k in Vocabulary » n k  number of times word w k in Text j » P(w k |v j )  (n k +1) / (n+|Vocabulary|)

Learning to Classify Text Classify_Naive_Bayes_Text( Doc ) Return the estimated target value for the document Doc , a i denotes the word found in the ith position within Doc – positions  all word positions in Doc that contain tokens found in Vocabulary – Return v NB ,

Bayesian Belief Networks A Bayesian belief networks defines the notion of condition of conditional independence. Let X, Y, and Z be three discrete- valued random variable. The set of variables X 1...X l is conditionally independent of the set of variable Y 1...Y m given the set of variables Z 1...Z n if This allow the naive Bayes classifier to calcuate

Representation A Bayesian belief network represents the joint probability distribution for a set of variable. For example, Figure represents the joint probability distribution over boolean variable Storm, Lighting, Thunder, ForestFire, Campfire and BusTourGroup.

Representation The joint probability for any desired assignment of values to the tuple of network variables (y 1...y n ) can be computed by the formula The set of local conditional probability tables for all the variables, together with the set of conditional independence assumptions described by the network

Gradient Ascent Training of Bayesian Networks Let w ijk denote a single entry in one of the conditional probability tables. That is w ijk denote the conditional probability that the network variable Y i will take on the value y ij given that its immediate parents U i take on the values given by u ik For example, w ijk is the top right entry in the conditional probability table. Y i is variable Campfire , U i is its parents , y ij =True , and u ik =

Gradient Ascent Training of Bayesian Networks The gradient of lnP(D|h) is given by the derivatives for each of the w ijk. For example, in order to calculate derivatives of lnP(D|h), we need to calculate P(Campfire=True, Storm=False, BusTourGroup=False|d) for each d in D

Gradient Ascent Training of Bayesian Networks Use abbreviation P h (D) to represent P(D|h) Assuming the training examples d in the data set D are drawn independently.

Gradient Ascent Training of Bayesian Networks

Update the weight We also renormalize the weight to assure that all the above constraint are satisfied. That is, sum of w ijk is 1 Another method, EM algorithm

EM Algorithm Background – Only a subset of relevant instance features might be observable – EM algorithm can be used even for variable whose values is never directly observed, provided the general form of the probability distribution governing these variables is known

Estimating Means of k Gaussians Consider a problem in which the data D is a set of instances generated by a probability distribution that is a mixture of k distinct normal Distribution Simple version: Assume all the variances of k distinct normal distribution are same.

Estimating Means of k Gaussians Input: observed data instances x 1,...,x m Output: The mean values of k gaussian distribution h=. Find the hypothesis h that maximizes p(D|h)

Estimating Means of k Gaussians Description of each instance as the triple, where x i is the observed value of the ith instance and z i1 and z i2 indicate which of the two normal distribution used to generate the x i EM searches for a maximum likelihood hypothesis by repeatedly re- estimating the expected values of the hidden variables z ij given its current hypothesis Example – Initializes h= – Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis h= holds. – Calculate a new maximum likelihood hypothesis h’=, assuming the value taken on by each hidden variable is its expected value E[z ij ] calculated in Step 1. Then replace the hypothesis h= by the new hypothesis h’= and iterate.

Estimating Means of k Gaussians E[z ij ] is the probability of that instance x i was generated by the jth Normal Distribution Derive a new maximum likelihood hypothesis h’=.

General Statement of EM Algorithm We wish to estimate some set of parameters  (For example,  = , for ). Z denote the unobserved data in these instances, and let Y=X  Z denote the full data. Y is random variable because it is defined in terms of the random variable Z

General Statement of EM Algorithm EM searches for the maximum likehood hypothesis h’ by seeking the h’ that maximizes E[lnP(Y|h’)] – P(Y|h’) is the likelihood of the full data Y given hypothesis h’ – Maximizing lnP(Y|h’) also maximizes P(Y|h’) – We take the expected values E[lnP(Y|h’)] over the probability distribution governing the random variable Y. EM algorithm uses its current hypothesis h in place of the actual parameters  to estimate the distribution governing Y Define function Q(h’|h)=E[lnP(Y|h’)|h,X]

General Statement of EM Algorithm EM algorithm – Estimate step: Calcuate Q(h’|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h’|h)  E[lnP(Y|h’)|h,X] – Maximization step: Replace hypothesis h by the hypothesis h’ that maximizes this function h  argmax h’ Q(h’|h) When function Q is continuous, the EM algorithm converges to a stationary point of the likelihood function P(Y|h’)

Derivation of the k Means Algorithm Problem – Estimating the means of a mixture of k Normal distribution  = – Observed data X={ } – The hidden variable Z={ } indicate which of the k Normal distributions was used to generate x i. Derive an expression for Q(h’|h)that applies to our k-means problem – Single instance

Derivation of the k Means Algorithms Given this probability for a single instance p(y i |h’), the logarithm of the probability ln P(Y|h’) for m instances in the data is Expected value

Derivation of the k Means Algorithms Find values maximize this Q function We have and