Probabilistic inference

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Introduction of Probabilistic Reasoning and Bayesian Networks
Generative learning methods for bags of features
Visual Recognition Tutorial
Bag-of-features models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Overview Full Bayesian Learning MAP learning
Beyond bags of features: Part-based models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –One exception: games with multiple moves In particular, the Bayesian.
Bayesian network inference
… Hidden Markov Models Markov assumption: Transition model:
Review: Bayesian learning and inference
Assuming normally distributed data! Naïve Bayes Classifier.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Probability.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bag-of-features models
Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.
Generative learning methods for bags of features
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Review: Probability Random variables, events Axioms of probability
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Exercise Session 10 – Image Categorization
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Naive Bayes Classifier
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Bayesian Networks Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
Bayes Nets & HMMs Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
Bayesian inference, Naïve Bayes model
Bayesian and Markov Test
ICS 280 Learning in Graphical Models
Ch3: Model Building through Regression
Lecture 15: Text Classification & Naive Bayes
Machine Learning. k-Nearest Neighbor Classifiers.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 07: BAYESIAN ESTIMATION
Speech recognition, machine learning
CS639: Data Management for Data Science
Speech recognition, machine learning
Presentation transcript:

Probabilistic inference Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence E = e Partially observable, stochastic, episodic environment Examples: X = {spam, not spam}, e = email message X = {zebra, giraffe, hippo}, e = image features Bayes decision theory: The agent has a loss function, which is 0 if the value of X is guessed correctly and 1 otherwise The estimate of X that minimizes expected loss is the one that has the greatest posterior probability P(X = x | e) This is the Maximum a Posteriori (MAP) decision Expected loss of a decision: P(decision is correct) * 0 + P(decision is wrong) * 1

MAP decision Value of x that has the highest posterior probability given the evidence e

MAP decision Value of x that has the highest posterior probability given the evidence e posterior likelihood prior

MAP decision Value of x that has the highest posterior probability given the evidence e Maximum likelihood (ML) decision: posterior likelihood prior

Example: Naïve Bayes model Suppose we have many different types of observations (symptoms, features) F1, …, Fn that we want to use to obtain evidence about an underlying hypothesis H MAP decision involves estimating If each feature can take on k values, how many entries are in the joint probability table?

Example: Naïve Bayes model Suppose we have many different types of observations (symptoms, features) F1, …, Fn that we want to use to obtain evidence about an underlying hypothesis H MAP decision involves estimating We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: If each feature can take on k values, what is the complexity of storing the resulting distributions?

Naïve Bayes Spam Filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message)

Naïve Bayes Spam Filter MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) We have P(spam | message)  P(message | spam)P(spam) and ¬P(spam | message)  P(message | ¬spam)P(¬spam)

Naïve Bayes Spam Filter We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) The message is a sequence of words (w1, …, wn) Bag of words representation The order of the words in the message is not important Each word is conditionally independent of the others given message class (spam or not spam)

Naïve Bayes Spam Filter We need to find P(message | spam) P(spam) and P(message | ¬spam) P(¬spam) The message is a sequence of words (w1, …, wn) Bag of words representation The order of the words in the message is not important Each word is conditionally independent of the others given message class (spam or not spam) Our filter will classify the message as spam if

Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/

Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/

Bag of words illustration US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/

Naïve Bayes Spam Filter posterior prior likelihood

Parameter estimation In order to classify a message, we need to know the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam) These are the parameters of the probabilistic model How do we obtain the values of these parameters? prior P(word | spam) P(word | ¬spam) spam: 0.33 ¬spam: 0.67

Parameter estimation How do we obtain the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)? Empirically: use training data This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data: # of word occurrences in spam messages P(word | spam) = total # of words in spam messages d: index of training document, i: index of a word

Parameter estimation How do we obtain the prior P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)? Empirically: use training data Parameter smoothing: dealing with words that were never seen or seen too few times Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did # of word occurrences in spam messages P(word | spam) = total # of words in spam messages

Summary of model and parameters Naïve Bayes model: Model parameters: Likelihood of spam Likelihood of ¬spam prior P(w1 | spam) P(w2 | spam) … P(wn | spam) P(w1 | ¬spam) P(w2 | ¬spam) … P(wn | ¬spam) P(spam) P(¬spam)

Bag-of-word models for images Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)

Bag-of-word models for images Extract image features

Bag-of-word models for images Extract image features

Bag-of-word models for images Extract image features Learn “visual vocabulary”

Bag-of-word models for images Extract image features Learn “visual vocabulary” Map image features to visual words

Bayesian decision making: Summary Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem: given some evidence E = e, what is P(X | e)? Learning problem: estimate the parameters of the probabilistic model P(X | E) given a training sample {(x1,e1), …, (xn,en)}