1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 5. Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.

Slides:



Advertisements
Similar presentations
2. Concept Learning 2.1 Introduction
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayesian Learning Provides practical learning algorithms
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Rutgers CS440, Fall 2003 Review session. Rutgers CS440, Fall 2003 Topics Final will cover the following topics (after midterm): 1.Uncertainty & introduction.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
CS 484 – Artificial Intelligence1 Announcements Homework 8 due today, November 13 ½ to 1 page description of final project due Thursday, November 15 Current.
Naïve Bayes Classifier
1er. Escuela Red ProTIC - Tandil, de Abril, Instance-Based Learning 4.1 Introduction Instance-Based Learning: Local approximation to the.
Chapter 20 of AIMA KAIST CS570 Lecture note
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Probability and Bayesian Networks
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, March 6, 2000 William.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Bayesian Learning Rong Jin.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Bayes Classification.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. Concept Learning Reference : Ch2 in Mitchell’s book 1. Concepts: Inductive learning hypothesis General-to-specific.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
A survey on using Bayes reasoning in Data Mining Directed by : Dr Rahgozar Mostafa Haghir Chehreghani.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Bayesian Networks Martin Bachler MLA - VO
Machine Learning Chapter 6. Bayesian Learning Tom M. Mitchell.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
Learning from observations
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
INTRODUCTION TO Machine Learning 3rd Edition
1 Bayesian Learning. 2 Bayesian Reasoning Basic assumption –The quantities of interest are governed by probability distribution –These probability + observed.
Bayesian Learning Provides practical learning algorithms
Machine Learning 5. Parametric Methods.
Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Tuesday, September 28, 1999.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 28 February 2007.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Bayesian Learning. Probability Bayes Rule Choosing Hypotheses- Maximum a Posteriori Maximum Likelihood - Bayes Concept Learning Maximum Likelihood of.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
1 1)Bayes’ Theorem 2)MAP, ML Hypothesis 3)Bayes optimal & Naïve Bayes classifiers IES 511 Machine Learning Dr. Türker İnce (Lecture notes by Prof. T. M.
Naive Bayes Classifier
Computer Science Department
Data Mining Lecture 11.
Instructor : Saeed Shiry & Mitchell Ch. 6
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Presentation transcript:

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities for hypotheses –Practical approach to certain learning problems –Provide useful perspective for understanding learning algorithms

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Drawbacks: –Typically requires initial knowledge of many probabilities –In some cases, significant computational cost required to determine the Bayes optimal hypothesis (linear in the number of candidate hypotheses)

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.2 Bayes Theorem Best hypothesis  most probable hypothesis Notation P(h): prior probability of hypothesis h P(D): prior probability that dataset D be observed P(D|h):prior probability of D given h P(h|D):posterior probability of h

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Bayes Theorem P(h|D) = P(D|h) P(h) / P(D) Maximum a posteriori hypothesis h MAP  argmax h  H P(h|D) = argmax h  H P(D|h) P(h) Maximum likelihood hypothesis h ML = argmax h  H P(D|h) = h MAP if we assume P(h)=constant

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Example P(cancer) = P(  cancer) = P(+|cancer) = 0.98 P(- |cancer) = 0.02 P(+|  cancer) = 0.03 P(- |  cancer) = 0.97 For a new patient the lab test returns a positive result. Should be diagnose cancer or not? P(+|cancer)P(cancer)= P(-|  cancer)P(  cancer)=  h MAP =  cancer

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.3 Bayes Theorem and Concept Learning What is the relationship between Bayes theorem and concept learning? –Brute Force Bayes Concept Learning 1. For each hypothesis h  H calculate P(h|D) 2. Output h MAP  argmax h  H P(h|D)

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning –We must choose P(h) and P(D|h) from prior knowledge Let’s assume: 1. The training data D is noise free 2. The target concept c is contained in H 3. We consider a priori all the hypotheses equally probable  P(h) = 1/|H|  h  H

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Since the data is assumed noise free: P(D|h)=1 if d i =h(x i )  d i  D P(D|h)=0otherwise Brute-force MAP learning –If h is inconsistent with D: P(h|D) = P(D|h).P(h) / P(D) = 0.P(h) / P(D) = 0 –If h is consistent with D: P(h|D) = 1. (1/|H|) / (|VS H,D | / |H|) = 1 / |VS H,D |

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning  P(D|h)=1 / |VS H,D | if h is consistent with D P(D|h)=0 otherwise Every consistent hypothesis is a MAP hypothesis Consistent Learners –Learning algorithms whose outputs are hypotheses that commit zero errors over the training examples (consistent hypotheses)

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Under the assumed conditions, Find-S is a consistent learner The Bayesian framework allows to characterize the behavior of learning algorithms, identifying P(h) and P(D|h) under which they output optimal (MAP) hypotheses

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 6.4 Maximum Likelihood and LSE Hypotheses Learning a continuous-valued target function (regression or curve fitting) H = Class of real-valued functions defined over X h : X   L learns f : X   (x i,d i )  D d i = f(x i ) +  i i=1,m f : noise-free target function  : white noise N(0,  )

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Under these assumptions, any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a ML hypothesis: h ML = argmax h  H p(D|h) = argmax h  H  i=1,m p(d i |h) = argmax h  H  i=1,m exp{-[d i -h(x i )] 2 /2  2 } = argmin h  H  i=1,m [d i -h(x i )] 2 = h LSE

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.5 ML Hypotheses for Predicting Probabilities –We wish to learn a nondetermnistic function f : X  {0,1} that is, the probabilities that f(x)=0 and f(x)=1 –Training data D = (x i,d i ) –We assume that any particular instance x i is independent of hypothesis h

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Then P(D|h) =  i=1,m P(x i,d i |h) =  i=1,m P(d i |h, x i ) P(x i ) P(d i |h,x i ) = h(x i ) if d i =1 P(d i |h,x i ) =1-h(x i ) if d i =0  P(d i |h,x i ) = h(x i ) d i [1-h(x i )] 1-d i

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning h ML = argmax h  H  i=1,m h(x i ) d i [1-h(x i )] 1-d i = argmax h  H  i=1,m d i log[h(x i )] + [1-d i ] log[1-h(x i )] = argmin h  H [Cross Entropy] Cross Entropy  -  i=1,m d i log[h(x i )] + [1-d i ] log[1-h(x i )]

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.6 Minimum Description Length Principle h MAP = argmax h  H P(D|h) P(h) = argmin h  H {-log 2 P(D|h)-log 2 P(h)}  short hypotheses are preferred Description Length L C (h): Number of bits required to encode message h using code C

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning –- log 2 P(h)  L C H (h): Description length of h under the optimal (most compact) encoding of H –- log 2 P(D|h)  L C D |h (D|h): Description length of training data D given hypothesis h  h MAP = argmin h  H {L C H (h) + L C D |h (D|h)} MDL Principle: Choose h MDL = argmin h  H {L C 1 (h) + L C 2 (D|h)}

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.7 Bayes Optimal Classifier What is the most probable classification of a new instance given the training data? Answer:argmax v j  V  h  H P(v j |h) P(h|D) where v j  V are the possible classes  Bayes Optimal Classifier

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.9 Naïve Bayes Classifier Given the instance x=(a 1,a 2,...,a n ) v MAP = argmax v j  V P(x|v j ) P(v j ) The Naïve Bayes Classifier assumes conditional independence of attribute values : v NB = argmax v j  V P(v j )  i=1,n P(a i |v j )

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.10 An Example: Learning to Classify Text Task: “Filter WWW pages that discuss ML topics” Instance space X contains all possible text documents Training examples are classified as “like” or “dislike” How to represent an arbitrary document? Define an attribute for each word position Define the value of the attribute to be the English word found in that position

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning v NB = argmax v j  V P(v j )  i=1,N words P(a i |v j ) V  {like,dislike} a i  distinct words in English  We must estimate ~ 2 x x N words conditional probabilities P(a i |v j ) This can be reduced to 2 x terms by considering P(a i =w k |v j ) = P(a m =w k |v j )  i,j,k,m

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning –How to choose the conditional probabilities? m-estimate: P(w k |v j ) = (n k + 1) / (N words + |Vocabulary|) n k : number of times word w k is found |Vocabulary| : total number of distinct words Concrete example: Assigning articles to 20 usenet newsgroups  Accuracy: 89%

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.11 Bayesian Belief Networks Bayesian belief networks assume conditional independence only between subsets of the attributes –Conditional independence Discrete-valued random variables X,Y,Z X is conditionally independent of Y given Z if P(X |Y,Z)= P(X |Z)

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Representation A Bayesian network represents the joint probability distribution of a set of variables Each variable is represented by a node Conditional independence assumptions are indicated by a directed acyclic graph Variables are conditionally independent of its nondescendents in the network given its inmediate predecessors

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning The joint probabilities are calculated as P(Y 1,Y 2,...,Y n ) =  i=1,n P [Y i |Parents(Y i )] The values P [Y i |Parents(Y i )] are stored in tables associated to nodes Y i Example: P(Campfire=True|Storm=True,BusTourGroup=True)=0.4

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Inference We wish to infer the probability distribution for some variable given observed values for (a subset of) the other variables Exact (and sometimes approximate) inference of probabilities for an arbitrary BN is NP-hard There are numerous methods for probabilistic inference in BN (for instance, Monte Carlo), which have been shown to be useful in many cases

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning Learning Bayesian Belief Networks Task: Devising effective algorithms for learning BBN from training data –Focus of much current research interest –For given network structure, gradient ascent can be used to learn the entries of conditional probability tables –Learning the structure of BBN is much more difficult, although there are successful approaches for some particular problems