Bayesian Parameter Estimation Liad Serruya
Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning
Computer Vision Learning a new category typically requires 1000, if not 10000, of training images. The number of training examples has to be 5 to 10 times the number of object parameters => large training sets. The penalty for using small training sets is over fitting. These have to be collected, and sometimes manually segmented and aligned – a tedious and expensive task.
Humans It is believed that humans can recognize between 5000 and object categories Learning a new category is both fast and easy, sometimes requiring very few training examples. When learning a new category we take advantage of prior experience. The appearance of the categories we know and, more importantly, the variability in their appearance, gives us important information on what to expect in a new category.
Why is it hard? Given an image, decide whether or not it contains an object of a specific class. Difficulties: Size and Intra-class variation Background clutter Occlusion Scale and lighting variations Figures from
Minimum of supervision Learn to recognize object class given a set of class and background pictures Without preprocessing Labeling Segmentation Alignment Scale normalized Images may contain clutter Figures from
Agenda Introduction Bayesian decision theory Maximum Likelihood (ML) Bayesian Estimation (BE) ML vs. BE Expectation Maximization Algorithm (EM) Scale-Invariant Learning Bayesian “One-Shot” Learning
Bayesian Decision Theory We are given a training set X of samples of class c. Given a query image, want to know the probability it belongs to the class, We know that the class has some fixed distribution, with unknown parameters θ, that is is known Bayes rule tells us: What can we do about ?
Maximum Likelihood Estimation Concept of likelihood – Probability of an event dependent on model parameters – Likelihood of the parameters given the data
MLE cont. The aim of maximum likelihood estimation is to find the parameter values that makes the observed data most likely. Because the likelihood of the parameters given the data is defined to be equal to the probability of the data given the parameters.
Simple Example of MLE Assume that p is a certain value (0.5) Wish to find the MLE of p, given a specific dataset. This test is essentially asking: is there evidence that the coin is biased? How do we do this? We find the value for p that makes the observed data most likely.
Example cont. n = 100 (total number of tosses) h = 56 (total number of heads) 44 (tail) Imagine that p was 0.5 => We can tabulate the likelihood for different parameter values to find the maximum likelihood estimate of p: PL Graphs from
Maximum Likelihood Estimation What can we do about ? Choose parameter value, that make the training data most probable: This slide was taken from
Bayesian Estimation The Bayesian Estimation approach considers θ as a random variable. Before we observe the training data, the parameters are described by a prior which is typically very broad. Once the data is observed, we can make use of Bayes’ formula to find posterior. Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This slide was taken from
Bayesian Estimation Unlike ML, Bayesian estimation does not choose a specific value for, but instead performs a weighted average over all possible values of. Why is it more accurate then ML? This slide was taken from
Maximal Likelihood vs Bayesian ML and Bayesian estimations are asymptotically equivalent and “consistent”. ML is typically computationally easier. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions. This slide was taken from
Expectation Maximization (EM) EM is an iterative optimization method to estimate some unknown parameters, given measurement data, but not given some “hidden” variables. We want to maximize the posterior probability of the parameters given the data U, marginalizing over : This slide was taken from
Expectation Maximization (EM) Choose an initial parameter Guess of unknown hidden data E-Step: Estimate unobserved data using M-Step: Compute Maximum Likelihood Estimate parameter using estimated data Observed Data Guess of parameters This slide was taken from
Expectation Maximization (EM) Alternate between estimating the unknowns and the hidden variables J. EM algorithm converges to a local maximum This slide was taken from
Agenda Introduction Bayesian decision theory Scale-Invariant Learning Overview Model Structure Results Bayesian “One-Shot” Learning
Scale-Invariant Learning Object Class Recognition By Scale Invariant Learning – Proc. of the IEEE Conf on Computer Vision and Pattern Recognition, 2003 Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition – International Journal of Computer Vision, in print, 2006 Rob FergusPietro PeronaA. Zisserman
Overview The goal of the research is to try and get computers to recognize different categories of object in images. The computer must be capable of learning a new category looks like. Identify new instances in a query image.
How to do it? There are three main issues we need to consider: 1. Representation - How to represent the object 2. Learning - Using this representation, how to learn a particular object category 3. Recognition - How to use the model that learned to find further instances in query images. This slide was taken from
Representation We choose to model objects as a constellation of parts. By modeling the location and appearance of these consistent regions across a set of training images for a category, we obtain a model of the category itself. Figure from
Notation X – Shape: A – Appearance: S – Scale: h – Hypothesis: Locations of the features Representations of the features Vector of feature scales Which part is represented by which observed feature. This slide was taken from
Bayesian Decision Query image we have identified N interesting features with locations X, scales S, and appearances A. We now make a Bayesian decision, R: We apply threshold to the likelihood ratio R to decide whether the input image belongs to the class. This slide was taken from
vs. Bayes rule: posterior ratio likelihood ratioprior ratio This slide was taken from
Zebra Non-zebra Decision boundary This slide was taken from
The Likelihoods The term can be factored into: We apply threshold to the likelihood ratio R to decide whether the input image belongs to the class. This slide was taken from
Appearance Foreground model Clutter model Gaussian This slide was taken from
Shape Foreground model Clutter model Gaussian Uniform This slide was taken from
Relative Scale Gaussian Log (scale) Uniform Log (scale) Foreground model Clutter model This slide was taken from
Detection Probability Foreground model Clutter model Probability of detection Poisson probability density function on the number of detections This slide was taken from
Feature Detection Kadir-Brady feature detector Detects salient regions over different scales and locations Choose N most salient regions Each feature contains scale and location information This slide was taken from
Feature Representation Normalization Feature contents is rescaled to a 11x11 pixel patch Reduce data dimension from 121 to 15 dimensions using PCA method Result is the appearance vector for the part This slide was taken from
Learning Want to estimate model parameters: Using EM method find that will best explain the training set images, i.e. maximize the likelihood: This slide was taken from
Learning procedure Find regions, their location, scale & appearance over all training Initialize model parameters Use EM and iterate to convergence: E-step: Compute assignments for which regions are foreground / background M-step: Update model parameters Trying to maximize likelihood – consistency in shape & appearance This info was taken from
Illustrates This slide was taken from
Illustrates This slide was taken from
Illustrates This slide was taken from
Recognition Recognition proceeds in the same manner as learning Take a query image Find the salient regions new image salient regions Figures from
Recognition (continue) Take the model training, lots of motorbike images Find the assignment of regions that fits the model best. Figures from
Experiments Some Samples Figures from
Sample Model – Motorbikes Figures from
Background images evaluated Figures from
Sample Model – Faces Figures from
Sample Model – Spotted Cats Figures from
Sample Model – Airplanes Figures from
Sample Model – Cars from rear Figures from
Confusion Table How good is a model for object class A is for distinguishing images of class B from background images? This slide was taken from
Comparison of Results Results for scale-invariant learning/recognition: Comparison to other methods: This slide was taken from
Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning Bayesian Framework Experiments Results Summery
Bayesian “One-Shot” Learning A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories – Proc. ICCV Rob FergusPietro Perona Li Fei-Fei
Prior knowledge about objects This slide was taken from
Incorporating prior knowledge Bayesian methods allow us to use a “prior” information p(θ) about the nature of objects. Given the new observations we can update our knowledge into a “posterior” p(θ|x) This slide was taken from
Bayesian Framework Given test image, we want to make a Bayesian decision by comparing: P(object | test, train) P(test | object, train) p(Object) ∫P(test | θ, object) p(θ | object, train) dθ Bayes Rule Expansion by parametrization vs. P(clutter | test, train) Previous Work: This slide was taken from
Bayesian Framework Given test image, we want to make a Bayesian decision by comparing: P(object | test, train) P(test | object, train) p(Object) ∫P(test | θ, object) p(θ | object, train) dθ Bayes Rule Expansion by parametrization vs. P(clutter | test, train) One-Shot learning: P(train | θ, object)p(θ) This slide was taken from
Maximum Likelihood vs. Bayesian Learning Maximum Likelihood Bayesian Learning This slide was taken from
Experimental setup Learn three object categories using ML approach and Bayesian learn models using both Bayesian and ML approaches and evaluate their performance on the test set Estimate the prior hyper-parameters Use VBEM to learn new object category from few images This slide was taken from
Dataset Images Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”
Experiments Prior distr. Learning new category using EM method This slide was taken from
Prior Hyper-Parameters Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”
Performance Results – Motorbikes 1 training image 5 training images Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”
Performance Results – Motorbikes Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”
Performance Results – Face Model 1 training image 5 training images Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”
Performance Results – Face Model Graphs from “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories”
Results Comparison Algorithm# training imagesLearning speedError rate Burl, et al. Weber, et al. Fergus, et al. 200 ~ 400Hours5.6% – 10% Bayesian One-Shot1 ~ 5< 1 min8% – 15% This slide was taken from
Summery Learning categories with one example is possible Decreased # of training example from ~300 to o1~5 Bayesian treatment Priors from unrelated categories are useful This slide was taken from
References Object Class Recognition By Scale Invariant Learning – Fergus, Perona, Zisserman – 2003 Weakly Supervised Scale-Invariant Learning of Models for Visual Recognition – Fergus, Perona, Zisserman – 2005 A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Fei-Fei, Fergus, Perona Unsupervised Learning of Models for Recognition – Weber, Welling, Perona – 2000 Moshe Blank Ita Lifshitz (Wizmann) – slides model.html#experiments model.html#experiments Pattern Classification / Dua & Hart Chapt. 3
Binomial Probability Distribution n = total number of coin tosses h = number of heads obtained p = probability of obtaining a head on any one toss