Bayesian Parameter Estimation Liad Serruya. Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning.

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

A Tutorial on Learning with Bayesian Networks

Image Modeling & Segmentation

A Bayesian Approach to Recognition Moshe Blank Ita Lifshitz Reverend Thomas Bayes

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Biointelligence Laboratory, Seoul National University

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

Object class recognition using unsupervised scale-invariant learning Rob Fergus Pietro Perona Andrew Zisserman Oxford University California Institute of.

1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.

Unsupervised Learning of Visual Object Categories Michael Pfeiffer

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.

What is Statistical Modeling

Visual Recognition Tutorial

EE-148 Expectation Maximization Markus Weber 5/11/99.

Beyond bags of features: Part-based models Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Statistical Recognition Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and Kristen Grauman.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Object Recognition by Parts Object recognition started with line segments. - Roberts recognized objects from line segments and junctions. - This led to.

Transferring information using Bayesian priors on object categories Li Fei-Fei 1, Rob Fergus 2, Pietro Perona 1 1 California Institute of Technology, 2.

Object class recognition using unsupervised scale-invariant learning Rob Fergus Pietro Perona Andrew Zisserman Oxford University California Institute of.

Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.

Object Class Recognition by Unsupervised Scale-Invariant Learning R. Fergus, P. Perona, and A. Zisserman Presented By Jeff.

Visual Recognition Tutorial

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Object Recognition by Parts Object recognition started with line segments. - Roberts recognized objects from line segments and junctions. - This led to.

Crash Course on Machine Learning

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Statistical Decision Theory

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

MSRI workshop, January 2005 Object Recognition Collected databases of objects on uniform background (no occlusions, no clutter) Mostly focus on viewpoint.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

BCS547 Neural Decoding.

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

Lecture 2: Statistical learning primer for biologists

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Machine Learning 5. Parametric Methods.

Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Univariate Gaussian Case (Cont.)

Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Univariate Gaussian Case (Cont.)

Object Recognition by Parts

Classification of unlabeled data:

Latent Variables, Mixture Models and EM

Object Recognition by Parts

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Object Recognition by Parts

Object Recognition by Parts

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Unsupervised Learning of Models for Recognition

Unsupervised learning of models for recognition

LECTURE 09: BAYESIAN LEARNING

LECTURE 07: BAYESIAN ESTIMATION

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

EM Algorithm and its Applications

Object Recognition by Parts

Data Exploration and Pattern Recognition © R. El-Yaniv

Mathematical Foundations of BME Reza Shadmehr

Object Recognition with Interest Operators

Presentation transcript:

Bayesian Parameter Estimation Liad Serruya

Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning

Computer Vision Learning a new category typically requires 1000, if not 10000, of training images. The number of training examples has to be 5 to 10 times the number of object parameters => large training sets. The penalty for using small training sets is over fitting. These have to be collected, and sometimes manually segmented and aligned – a tedious and expensive task.

Humans It is believed that humans can recognize between 5000 and object categories Learning a new category is both fast and easy, sometimes requiring very few training examples. When learning a new category we take advantage of prior experience. The appearance of the categories we know and, more importantly, the variability in their appearance, gives us important information on what to expect in a new category.

Why is it hard? Given an image, decide whether or not it contains an object of a specific class. Difficulties:  Size and Intra-class variation  Background clutter  Occlusion  Scale and lighting variations Figures from

Minimum of supervision Learn to recognize object class given a set of class and background pictures  Without preprocessing  Labeling  Segmentation  Alignment  Scale normalized  Images may contain clutter Figures from

Agenda Introduction Bayesian decision theory  Maximum Likelihood (ML)  Bayesian Estimation (BE)  ML vs. BE  Expectation Maximization Algorithm (EM) Scale-Invariant Learning Bayesian “One-Shot” Learning

Bayesian Decision Theory We are given a training set X of samples of class c. Given a query image, want to know the probability it belongs to the class, We know that the class has some fixed distribution, with unknown parameters θ, that is is known Bayes rule tells us: What can we do about ?

Maximum Likelihood Estimation Concept of likelihood – Probability of an event dependent on model parameters – Likelihood of the parameters given the data

MLE cont. The aim of maximum likelihood estimation is to find the parameter values that makes the observed data most likely. Because the likelihood of the parameters given the data is defined to be equal to the probability of the data given the parameters.

Simple Example of MLE Assume that p is a certain value (0.5) Wish to find the MLE of p, given a specific dataset. This test is essentially asking: is there evidence that the coin is biased? How do we do this? We find the value for p that makes the observed data most likely.

Example cont. n = 100 (total number of tosses) h = 56 (total number of heads) 44 (tail) Imagine that p was 0.5 => We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p: PL Graphs from

Maximum Likelihood Estimation What can we do about ? Choose parameter value, that make the training data most probable: This slide was taken from

Bayesian Estimation The Bayesian Estimation approach considers θ as a random variable. Before we observe the training data, the parameters are described by a prior which is typically very broad. Once the data is observed, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This slide was taken from

Bayesian Estimation Unlike ML, Bayesian estimation does not choose a specific value for, but instead performs a weighted average over all possible values of. Why is it more accurate then ML? This slide was taken from

Maximal Likelihood vs Bayesian ML and Bayesian estimations are asymptotically equivalent and “consistent”. ML is typically computationally easier. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions. This slide was taken from

Expectation Maximization (EM) EM is an iterative optimization method to estimate some unknown parameters, given measurement data, but not given some “hidden” variables. We want to maximize the posterior probability of the parameters given the data U, marginalizing over : This slide was taken from

Expectation Maximization (EM) Choose an initial parameter Guess of unknown hidden data E-Step: Estimate unobserved data using M-Step: Compute Maximum Likelihood Estimate parameter using estimated data Observed Data Guess of parameters This slide was taken from

Expectation Maximization (EM) Alternate between estimating the unknowns and the hidden variables J. EM algorithm converges to a local maximum This slide was taken from

Agenda Introduction Bayesian decision theory Scale-Invariant Learning  Overview  Model Structure  Results Bayesian “One-Shot” Learning

Scale-Invariant Learning Object Class Recognition By Scale Invariant Learning – Proc. of the IEEE Conf on Computer Vision and Pattern Recognition, 2003 Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition – International Journal of Computer Vision, in print, 2006 Rob FergusPietro PeronaA. Zisserman

Overview The goal of the research is to try and get computers to recognize different categories of object in images. The computer must be capable of learning a new category looks like. Identify new instances in a query image.

How to do it? There are three main issues we need to consider: 1. Representation - How to represent the object 2. Learning - Using this representation, how to learn a particular object category 3. Recognition - How to use the model that learned to find further instances in query images. This slide was taken from

Representation We choose to model objects as a constellation of parts. By modeling the location and appearance of these consistent regions across a set of training images for a category, we obtain a model of the category itself. Figure from

Notation X – Shape: A – Appearance: S – Scale: h – Hypothesis: Locations of the features Representations of the features Vector of feature scales Which part is represented by which observed feature. This slide was taken from

Bayesian Decision Query image we have identified N interesting features with locations X, scales S, and appearances A. We now make a Bayesian decision, R: We apply threshold to the likelihood ratio R to decide whether the input image belongs to the class. This slide was taken from

vs. Bayes rule: posterior ratio likelihood ratioprior ratio This slide was taken from

Zebra Non-zebra Decision boundary This slide was taken from

The Likelihoods The term can be factored into: We apply threshold to the likelihood ratio R to decide whether the input image belongs to the class. This slide was taken from

Appearance Foreground model Clutter model Gaussian This slide was taken from

Shape Foreground model Clutter model Gaussian Uniform This slide was taken from

Relative Scale Gaussian Log (scale) Uniform Log (scale) Foreground model Clutter model This slide was taken from

Detection Probability Foreground model Clutter model Probability of detection Poisson probability density function on the number of detections This slide was taken from

Feature Detection Kadir-Brady feature detector Detects salient regions over different scales and locations Choose N most salient regions Each feature contains scale and location information This slide was taken from

Feature Representation Normalization Feature contents is rescaled to a 11x11 pixel patch Reduce data dimension from 121 to 15 dimensions using PCA method Result is the appearance vector for the part This slide was taken from

Learning Want to estimate model parameters: Using EM method find that will best explain the training set images, i.e. maximize the likelihood: This slide was taken from

Learning procedure Find regions, their location, scale & appearance over all training Initialize model parameters Use EM and iterate to convergence:  E-step: Compute assignments for which regions are foreground / background  M-step: Update model parameters Trying to maximize likelihood – consistency in shape & appearance This info was taken from

Illustrates This slide was taken from

Illustrates This slide was taken from

Illustrates This slide was taken from

Recognition Recognition proceeds in the same manner as learning Take a query image Find the salient regions new image salient regions Figures from

Recognition (continue) Take the model training, lots of motorbike images Find the assignment of regions that fits the model best. Figures from

Experiments Some Samples Figures from

Sample Model – Motorbikes Figures from

Background images evaluated Figures from

Sample Model – Faces Figures from

Sample Model – Spotted Cats Figures from

Sample Model – Airplanes Figures from

Sample Model – Cars from rear Figures from

Confusion Table How good is a model for object class A is for distinguishing images of class B from background images? This slide was taken from

Comparison of Results Results for scale-invariant learning/recognition: Comparison to other methods: This slide was taken from

Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning  Bayesian Framework  Experiments  Results  Summery

Bayesian “One-Shot” Learning A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories – Proc. ICCV Rob FergusPietro Perona Li Fei-Fei

Prior knowledge about objects This slide was taken from

Incorporating prior knowledge Bayesian methods allow us to use a “prior” information p(θ) about the nature of objects. Given the new observations we can update our knowledge into a “posterior” p(θ|x) This slide was taken from

Bayesian Framework Given test image, we want to make a Bayesian decision by comparing: P(object | test, train) P(test | object, train) p(Object) ∫P(test | θ, object) p(θ | object, train) dθ Bayes Rule Expansion by parametrization vs. P(clutter | test, train) Previous Work: This slide was taken from

Bayesian Framework Given test image, we want to make a Bayesian decision by comparing: P(object | test, train) P(test | object, train) p(Object) ∫P(test | θ, object) p(θ | object, train) dθ Bayes Rule Expansion by parametrization vs. P(clutter | test, train) One-Shot learning: P(train | θ, object)p(θ) This slide was taken from

Maximum Likelihood vs. Bayesian Learning Maximum Likelihood Bayesian Learning This slide was taken from

Experimental setup Learn three object categories using ML approach and Bayesian learn models using both Bayesian and ML approaches and evaluate their performance on the test set Estimate the prior hyper-parameters Use VBEM to learn new object category from few images This slide was taken from

Dataset Images Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”

Experiments Prior distr. Learning new category using EM method This slide was taken from

Prior Hyper-Parameters Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”

Performance Results – Motorbikes 1 training image 5 training images Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”

Performance Results – Motorbikes Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”

Performance Results – Face Model 1 training image 5 training images Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”

Performance Results – Face Model Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”

Results Comparison Algorithm# training imagesLearning speedError rate Burl, et al. Weber, et al. Fergus, et al. 200 ~ 400Hours5.6% – 10% Bayesian One-Shot1 ~ 5< 1 min8% – 15% This slide was taken from

Summery Learning categories with one example is possible Decreased # of training example from ~300 to o1~5 Bayesian treatment Priors from unrelated categories are useful This slide was taken from

References Object Class Recognition By Scale Invariant Learning – Fergus, Perona, Zisserman – 2003 Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition – Fergus, Perona, Zisserman – 2005 A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Fei-Fei, Fergus, Perona Unsupervised Learning of Models for Recognition – Weber, Welling, Perona – 2000 Moshe Blank Ita Lifshitz (Wizmann) – slides model.html#experiments model.html#experiments Pattern Classification / Dua & Hart Chapt. 3

Binomial Probability Distribution n = total number of coin tosses h = number of heads obtained p = probability of obtaining a head on any one toss