2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.

Slides:

Advertisements

Similar presentations

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Advertisements

Bayesian Learning Provides practical learning algorithms

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Supervised Learning Recap

Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.

Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.

Naïve Bayes Classifier

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Visual Recognition Tutorial

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Machine Learning CMPT 726 Simon Fraser University

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.

Visual Recognition Tutorial

Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.

Learning Bayesian Networks

Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.

Bayes Classification.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Crash Course on Machine Learning

NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

A survey on using Bayes reasoning in Data Mining Directed by : Dr Rahgozar Mostafa Haghir Chehreghani.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Text Classification, Active/Interactive learning.

Naive Bayes Classifier

Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.

Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Statistical Learning (From data to distributions).

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.

Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.

Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

Classification Techniques: Bayesian Classification

CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.

Chapter 6 Bayesian Learning

INTRODUCTION TO Machine Learning 3rd Edition

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

1 Bayesian Learning Machine Learning by Mitchell-Chp. 6 Ethem Chp. 3 (Skip 3.6) Pattern Recognition & Machine Learning by Bishop Chp. 1 Berrin Yanikoglu.

Lecture 2: Statistical learning primer for biologists

Bayesian Learning Provides practical learning algorithms

1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners

CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.

Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.

Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.

Naive Bayes Classifier

Bayesian Rule & Gaussian Mixture Models

Computer Science Department

Data Mining Lecture 11.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Parametric Methods Berlin Chen, 2005 References:

Machine Learning: Lecture 6

Machine Learning: UNIT-3 CHAPTER-1

Naive Bayes Classifier

Presentation transcript:

2D1431 Machine Learning Bayesian Learning

Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes classifier Bayes optimal classifier Bayesian belief networks Expectation maximization (EM) algorithm

Handwritten characters classification

Gray level pictures:object classification

Gray level pictures: human action classification

Literature & Software T. Mitchell: chapter 6 S. Russell & P. Norvig, “Artificial Intelligence – A Modern Approach” : chapters R.O. Duda, P.E. Hart, D.G. Stork, “Pattern Classification 2 nd ed.” : chapters 2+3 David Heckerman: “A Tutorial on Learning with Bayesian Belief Networks” Bayes Net Toolbox for Matlab (free), Kevin Murphy

Bayes Theorem P(h|D) = P(D|h) P(h) / P(D) P(D) : prior probability of the data D, evidence P(h) : prior probability of the hypothesis h, prior P(h|D) : posterior probability of the hypothesis given the data D, posterior P(D|h) : probability of the data D given the hypothesis h, likelihood of the data

Bayes Theorem P(h|D) = P(D|h) P(h) / P(D) posterior = likelihood x prior / evidence By observing the data D we can convert the prior probability P(h) to the a posteriori probability (posterior) P(h|D) The posterior is probability that h holds after data D has been observed. The evidence P(D) can be viewed merely as a scale factor that guarantees that the posterior probabilities sum to one.

Choosing Hypotheses P(h|D) = P(D|h) P(h) / P(D) Generally want the most probable hypothesis given the training data Maximum a posteriori hypothesis h MAP h MAP = argmax h  H P(h|D) = argmax h  H P(D|h) P(h) / P(D) = argmax h  H P(D|h) P(h) If the priors of hypothesis are equally likely P(h i )=P(h j ) then one can choose the maximum likelihood (ML) hypothesis h ML = argmax h  H P(D|h)

Bayes Theorem Example A patient takes a lab test and the result is positive. The test returns a correct positive (  ) result in 98% of the cases in which the disease is actually present, and a correct negative (  ) result in 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have the disease. Hypotheses : disease, ¬disease priors P(h) : P(disease) = 0.008, P(¬ disease)=0.992 likelihoods P(D|h): P(  |disease)=0.98, P(  |disease)=0.02 P(  |¬disease)=0.03, P(  |¬disease)=0.97 Maximum posteriors argmax P(h|D): P(disease|  )~ P(  |disease)P(disease)= P(¬ disease|  )~ P(  |¬ disease) P(¬ disease) = P(disease|  ) = /( ) = 0.21 P(¬ disease|  ) = /( ) = 0.79

Basic Formula for Probabilities Product rule: P(A  B) = P(A) P(B) Sum rule: P(A  B) = P(A) + P(B) - P(A  B) Theorem of total probability: if A 1, A 2, …, A n are mutually exclusive events   i P(A i ) = 1, then P(B) =  i P(B|A i ) P(A i )

Bayes Theorem Example P(x 1,x 2 |  1,  2,  ) = 1/(2  ) exp -  i (x i -  i ) 2 /2   h={  1,  2,  } D={x 1,…,x m }

Gaussian Probability Function P(D|  1,  2,  ) =  m P(x m |  1,  2,  ) Maximum likelihood hypothesis h ML h ML = argmax  1,  2,  P(D|  1,  2,  ) Trick: maximize log-likelihood log P(D|  1,  2,  ) =  m log P(x m |  1,  2,  ) =  m log (1/(2  ) exp -  i (x m i -  i ) 2 /2  2 = -M log (2  ) -  m  i (x m i -  i ) 2 /2  2

Gaussian Probability Function  log P(D|  1,  2,  )/   i = 0  m x m i -  i = 0   i ML = 1/M  m x m i = E[x m ]  log P(D|  1,  2,  )/   = 0  ML  m  i (x m i -  i ) 2 / 2M = E[(  i (x m i -  i ) 2 ] / 2 Maximum likelihood hypothesis h ML = {  iML,  ML }

Maximum Likelihood Hypothesis  ML = (0.20, -0.14)  ML = 1.42

Bayes Decision Rule x = examples of class c 1 o = examples of class c 2 {2,2}{2,2} {1,1}{1,1}

Bayes Decision Rule Assume we have two Gaussians distributions associated to two separate classes c 1, c 2. P(x|c i ) = P(x|  i,  i )= 1/(2  ) exp -  i (x i -  i ) 2 /2   Bayes decision rule (max posterior probability) Decide c 1 if P(c 1 |x) > P(c 2 |x) otherwise decide c 2. if P(c 1 ) = P(c 2 ) use maximum likelihood P(x|c i ) else use maximum posterior P(c i |x) = P(x|c i ) P(c i )

Bayes Decision Rule c2c2 c1c1

Two-Category Case Discriminant functions: if g(x) > 0 then c 1 else c 2 g(x) = P(c 1 |x) – P(c 2 |x) = P(x|c 1 ) P(c 1 ) - P(x|c 1 ) P(c 1 ) g(x) = log P(c 1 |x) – log P(c 2 |x) = log P(x|c 1 )/P(x|c 2 ) - log P(c 1 )/ P(c 2 ) Gaussian probability functions with identical  i g(x) = (x-  2 ) 2 /2  2 - (x-  1 ) 2 /2  2 + log P(c 1 ) – log P(c 2 ) decision surface is a line/hyperplane

Learning a Real Valued Function Consider a real-valued target function f Noisy training examples d i = f(x i ) + e i e i is a random variable drawn from a Gaussian distribution with zero mean. The maximum likelihood hypothesis h ML is the one that minimizes the squared sum of errors h ML = argmin h  H  i (d i – h(x i )) 2 f e h ML

Learning a Real Valued Function h ML = argmax h  H P(D|h) = argmax h  H  i P(x i |h) = argmax h  H  i (2  ) -0.5 exp -(d i -h(x i )) 2 /2  2 maximizing logarithm log P(D|h) h ML = argmax h  H  i –0.5 log(2  ) -(d i -h(x i )) 2 /2  2 = argmax h  H  i -(d i - h(x i )) 2 = argmin h  H  i (d i – h(x i )) 2

Learning to Predict Probabilities Predicting survival probability of a patient Training examples where d i is 0 or 1 Objective: train a neural network to output a probability h(x i ) = p(d i =1) given x i Maximum likelihood hypothesis: h ML = argmax h  H  i d i ln h(x i ) + (1-d i ) ln (1-h(x i )) maximize cross entropy between d i and h(x i ) Weight update rule for synapses w k to output neuron h(x i ) w k = w k +   i (d i -h(x i )) x k Compare to standard BP weight update rule w k = w k +   i h(x i )(1-h(x i )) (d i -h(x i )) x k

Most Probable Classification So far we sought the most probable hypothesis h MAP ? What is most probable classification of a new instance x given the data D? h MAP (x) is not the most probable classification, although often a sufficiently good approximation of it. Consider three possible hypotheses: P(h 1 |D) = 0.4, P(h 2 |D) = 0.3, P(h 3 |D) = 0.3 Given a new instance x, h 1 (x)=+, h 2 (x)=-, h 3 (x)=- h MAP (x) = h 1 (x) = + most probable classification: P(+)=P(h 1 |D)=0.4 P(-)=P(h 2 |D) + P(h 3 |D) = 0.6

Bayes Optimal Classifier c max = argmax c j  C  h i  H P(c j |h i ) P(h i |D) Example: P(h 1 |D) = 0.4, P(h 2 |D) = 0.3, P(h 3 |D) = 0.3 P(+|h 1 )=1, P(-|h 1 )=0 P(+|h 2 )=0, P(-|h 2 )=1 P(+|h 3 )=0, P(-|h 3 )=1 therefore  h i  H P(+|h i ) P(h i |D) = 0.4  h i  H P(- |h i ) P(h i |D) = 0.6 argmax c j  C  h i  H P(v j |h i ) P(h i |D) = -

MAP vs. Bayes Method The maximum posterior hypothesis estimates a point h MAP in the hypothesis space H. Bayes method instead estimates and uses a complete distribution P(h|D). The difference appears when inference MAP or Bayes method are used for inference of unseen instances and one compares the distributions P(x|D) MAP: P(x|D) = h MAP (x) with h ML = argmax h  H P(h|D) Bayes: P(x|D) =  h i  H P(x|h i ) P(h i |D) For reasonable prior distributions P(h) MAP and Bayes solution are equivalent in the asymptotic limit of infinite training data D.

Naïve Bayes Classifier popular, simple learning algorithm moderate or large training set available assumption: attributes that describe instances are conditionally independent given classification (in practice works surprisingly well even if assumption is violated) Applications: diagnosis text classification (newsgroup articles 20 newsgroups, 1000 documents per newsgroup, classification accuracy 89%)

Naïve Bayes Classifier Assume discrete target function F: X  C, where each instance x described by attributes Most probable value of f(x) is: c MAP = argmax c j  C P(c j | ) = argmax c j  C P( |c j ) P(c j ) / P( ) = argmax c j  C P( |c j ) P(c j ) Naïve Bayes assumption: P( |c j ) =  i P(a i |c j ) c NB = argmax c j  C P(c j )  i P(a i |c j )

Naïve Bayes Learning Algorithm Naïve_Bayes_Learn(examples) for each target value cj estimate P(c j ) for each attribute value ai estimate of each attribute a estimate P(a i |c j ) Classify_New_Instance(x) c NB = argmax c j  C P(c j )  ai  x P(a i |c j )

Naïve Bayes Example Consider PlayTennis and new instance Compute c NB = argmax c j  C P(c j )  ai  x P(a i |c j ) playtennis (9+,5-) P(yes) = 9/14, P(no) = 5/14 wind=strong (3+,3-) P(strong|yes) = 3/9, P(strong|no) 3/5 … P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes)= P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no)= 0.021

Estimating Probabilities What if none (n c =0) of the training instances with target value c j have attribute a i ? P(a i |c j ) = n c /n = 0 and P(c j )  ai  x P(a i |c j ) = 0 Solution: Bayesian estimate for P(a i |c j ) P(a i |c j ) = (n c + mp)/(n + m) n : number of training examples for which c=cj n c : number of examples for which c=cj and a=ai p : prior estimate of P(a i |c j ) m : weight given to prior (number of “virtual” examples)

Bayesian Belief Networks naïve assumption of conditional independency too restrictive full probability distribution intractable due to lack of data Bayesian belief networks describe conditional independence among subsets of variables allows combining prior knowledge about causal relationships among variables with observed data

Conditional Independence Definition: X is conditionally independent of Y given Z is the probability distribution governing X is independent of the value of Y given the value of Z, that is, if  x i,y j,z k P(X=x i |Y=y j,Z=z k ) = P(X=x i |Z=z k ) or more compactly P(X|Y,Z) = P(X|Z) Example: Thunder is conditionally independent of Rain given Lightning P(Thunder |Rain, Lightning) = P(Thunder |Lightning) Notice: P(Thunder |Rain)  P(Thunder) Naïve Bayes uses conditional independence to justify: P(X,Y|Z) = P(X|Y,Z) P(Y|Z) = P(X|Z) P(Y|Z)

Bayesian Belief Network Network represents a set of conditional independence assertions: Each node is conditionally independent of its non-descendants, given its immediate predecessors. (directed acyclic graph) Storm BusTour Group LightningCampfire Forestfire Thunder S,BS,¬B¬S,BS, ¬B C ¬C Campfire

Bayesian Belief Network Network represents joint probability distribution over all variables P(Storm,BusGroup,Lightning,Campfire,Thunder,Forestfire) P(y 1,…,y n ) =  i=1 n P(y i |Parents(Y i )) joint distribution is fully defined by graph plus P(y i |Parents(Y i )) Storm BusTour Group LightningCampfire ForestfireThunder S,BS,¬B¬S,BS, ¬B C ¬C Campfire P(C|S,B)

Expectation Maximization EM when to use data is only partially observable unsupervised clustering: target value unobservable supervised learning: some instance attributes unobservable applications training Bayesian Belief Networks unsupervised clustering learning hidden Markov models

Generating Data from Mixture of Gaussians Each instance x generated by choosing one of the k Gaussians at random Generating an instance according to that Gaussian

EM for Estimating k Means Given: instances from X generated by mixture of k Gaussians unknown means of the k Gaussians don’t know which instance xi was generated by which Gaussian Determine: maximum likelihood estimates of Think of full description of each instance as y i = z ij is 1 if xi generated by j-th Gaussian x i observable z ij unobservable

EM for Estimating k Means EM algorithm: pick random initial h= then iterate E step: Calculate the expected value E[z ij ] of each hidden variable zij, assuming the current hypothesis h= holds. E[z ij ] = p(x=x i |  =  j ) /  n=1 2 p(x=x i |  =  j ) = exp(-(x i -  j ) 2 /2  2 ) /  n=1 2 exp(-(x i -  n ) 2 /2  2 ) M step: Calculate a new maximum likelihood hypothesis h’= assuming the value taken on by each hidden variable z ij is its expected value E[z ij ] calculated in the E- step. Replace h= by h’=  j =  i=1 m E[z ij ] xi /  i=1 m E[z ij ]

EM Algorithm Converges to local maximum likelihood and provides estimates of hidden variables z ij. In fact local maximum in E [ln (P(Y|h)] Y is complete (observable plus non-observable variables) data Expected valued is taken over possible values of unobserved variables in Y

General EM Problem Given: observed data X = {x 1,…,x m } unobserved data Z = {z 1,…,z m } parameterized probability distribution P(Y|h) where Y = {y 1,…,y m } is the full data y i = h are the parameters Determine: h that (locally) maximizes E[ln P(Y|h)] Applications: train Bayesian Belief Networks unsupervised clustering hidden Markov models

General EM Method Define likelihood function Q(h’|h) which calculates Y = X  Z using observed X and current parameters h to estimate Z Q(h’|h) = E[ ln( P(Y|h’) | h, X] EM algorithm: Estimation (E) step: Calculate Q(h’|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h’|h) = E[ ln( P(Y|h’) | h, X] Maximization (M) step: Replace hypothesis h by the hypothesis h’ that maximizes this Q function. h = argmax h’  H Q(h’|h)