# A Bayesian Approach to Recognition Moshe Blank Ita Lifshitz Reverend Thomas Bayes 1702-1761.

## Presentation on theme: "A Bayesian Approach to Recognition Moshe Blank Ita Lifshitz Reverend Thomas Bayes 1702-1761."— Presentation transcript:

A Bayesian Approach to Recognition Moshe Blank Ita Lifshitz Reverend Thomas Bayes 1702-1761

Agenda Bayesian decision theory  Maximum Likelihood  Bayesian Estimation Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning

Bayesian Decision Theory We are given a training set T of samples of class c. Given a query image x, want to know the probability it belongs to the class, p(x) We know that the class has some fixed distribution, with unknown parameters θ, that is p(x|θ) is known Bayes rule tells us: p(x|T) = ∫ p(x,θ|T)dθ = ∫ p(x|θ)p(θ|T)dθ What can we do about p(θ|T)?

Maximum Likelihood Estimation What can we do about p(θ|T)? Choose parameter value θ ML, that make the training data most probable: θ ML = arg max P(T|θ) p(θ|T) = δ( θ – θ ML ) ∫ p(x|θ)p(θ|T)dθ = p(x| θ ML )

ML Illustration Assume that the points of T are drawn from some normal distribution with known variance and unknown mean

Bayesian Estimation The Bayesian Estimation approach considers θ as a random variable. Before we observe the training data, the parameters are described by a prior p(θ) which is typically very broad. Once the data is observed, we can make use of Bayes’ formula to find posterior p(θ|T). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior.

Bayesian Estimation Unlike ML, Bayesian estimation does not choose a specific value for θ, but instead performs a weighted average over all possible values of θ. Why is it more accurate then ML?

Maximal Likelihood vs Bayesian ML and Bayesian estimations are asymptotically equivalent and “consistent”. ML is typically computationally easier. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions.

Agenda Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning

Objective Given an image, decide whether or not it contains an object of a specific class.

Main Issues Representation Learning Recognition

Approaches to Recognition Photometric properties – filter subspaces, neural networks, principal analysis… Geometric constraints between low level object features – alignment, geometric invariance, geometric hashing… Object Model

Fischler & Elschlager, 1973 f Yuille, ‘91 f Brunelli & Poggio, ‘93 f Lades, v.d. Malsburg et al. ‘93 f Cootes, Lanitis, Taylor et al. ‘95 f Amit & Geman, ‘95, ‘99 f Perona et al. ‘95, ‘96, ‘98, ‘00, ‘02 Model: constellation of Parts

Perona’s Approach Objects are represented as a probabilistic constellation of rigid parts (features). The variability within a class is represented by a joint probability density function on the shape of the constellation and the appearance of the parts.

Agenda Bayesian decision theory Recognition Simple probabilistic model  Model parameterization  Feature Selection  Learning Mixture model More advanced probabilistic model “One-Shot” Learning

Weber, Weilling, Perona - 2000 Unsupervised Learning of Models for Recognition Towards Automatic Discovery of Object Categories

Unsupervised Learning Learn to recognize object class given a set of class and background pictures, without preprocessing – labeling, segmentation, alignment.

Model Description Each object is constructed of F parts, each of a certain type. Relations between the part locations define the shape of the object.

Image Model Image is transformed into a collection of parts Objects are modeled as sub collections

Model Parameterization Given an image we detect potential object parts, to obtain the following observable:

Hypothesis When presented with an un-segmented and unlabeled image, we do not know which parts correspond to the foreground. Assuming the image contains the object, use vector of indices h to indicate which of the observables correspond to a foreground point (i.e. real part of the object). We call h hypothesis since it is a guess on the structure of the object. h = (h 1, …, h T ) is not observable.

Additional Hidden Variables We denote by the locations of the unobserved object parts. b = sign(h) – binary vector indicates which parts were detected n = number of background parts detected of each type

Probabilistic Model We can now define a generative probabilistic model for the object class using the probability density function:

Model Details Since n, b are determined by Xo, h, we have: By Bayesian rule:

Model Details Full table of joint probabilities (for small F) or F independent detection rate probabilities for large F

Model Details Poisson probability density function with parameter Mt for detection of feature of type t

Model Details Uniform probability over all hypotheses consistent with n and b

Model Details Where - coordinates of all foreground detections, and - coordinates of all background detections

Sample object classes

Invariance to Translation Rotation and Scale There is no use in modeling the shape of the object in terms of absolute pixel positions of the features. We apply a transformation on features’ coordinates to make the shape invariant to translation, rotation and scale. But the feature detector must be invariant to the transformations as well!

Automatic Part Selection Find points of interest in all training images Apply Vector Quantization and clustering to get 100 total candidate patterns.

Automatic Part Selection Points of interestpatterns

Method Scheme Part Selection Model Learning Test

Automatic Part Selection Find subset of candidate parts of (small) size F to be used in the model that gives the best performance in the learning phase. 57% 87% 51%

Learning Goal: Find θ = {μ, Σ, p(b), M} which best explains the observed (training) data μ, Σ – expectation and covariance parameters of the joint Gaussian modeling the shape of the foreground b – random variable denoting whether each of the parts of the model is detected or not M – average number of background detections for each of the parts

Learning Goal: Find θ = {μ, Σ, p(b), M} which best explains the observed (training) data, i.e. maximize the likelihood arg max p( X o | θ ) θ Done using the EM method

Expectation Maximization (EM) EM is an iterative optimization method to estimate some unknown parameters θ, given measurement data, but not given some “hidden” variables J. We want to maximize the posterior probability of the parameters θ given the data U, marginalizing over J:

Expectation Maximization (EM) Choose an initial parameter θ 0 Guess of unknown hidden data E-Step: Estimate unobserved data using θ k M-Step: Compute Maximum Likelihood Estimate parameter θ k+1 using estimated data Observed Data Guess of parameters θ k

Expectation Maximization (EM) alternate between estimating the unknowns θ and the hidden variables J. EM algorithm converges to a local maximum

Method Scheme Part Selection Model Learning Test

Recognition Using the maximum a posteriori approach we consider the ratio R = where h 0 is the null hypothesis – which explains all parts as background noise. We accept the image as belonging to the class if R is above a certain threshold.

Database Two classes – faces and cars 100 training images for each class 100 test images for each class Images vary in scale, location of the object, lighting conditions Images have cluttered background No manual preprocessing

Learning Results

Model Performance Average training and testing errors measured as 1-Area(ROC) Suggests 4 parts model for faces and 5 parts model for cars as optimal.

Multiple use of parts Part ‘C’ has high variance along the vertical direction – can be detected in several locations – bumper, license plate or roof. Part Labels:

Recognition Results Average success rate (at even False Positive and False Negative ratios): Faces: 93.5% Cars: 86.5%

Agenda Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning

Mixture Model Gaussian model works good for homogenous classes, but real life objects can be far from homogenous. Can we extend our approach to multi- model classes?

Mixture Model An object is modeled using Ω different components, each is a probabilistic model: Each component “sees the whole picture”. Components are trained together.

Database Faces with different viewing angles – 0 °, 15 °, …, 90 ° Cars – rear view and side view Tree leaves – of several types

Tuning of the mixture components Each training image was assigned to the component which responds to it the most, i.e. one that maximizes.

Results Misclassification error at even false positive and false negative rate for training and test sets Zero false alarm detection rate (ZFA-DR).

Separately trained components Two components trained independently on two subclasses of the cars class. When merged into a mixture model with p(w) = 0.5, gave worse results than two- components model trained on both subclasses simultaniously.

Agenda Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model  Feature Selection  Model parameterization  Results “One-Shot” Learning

Fergus, Perona, Zisserman Object Class Recognition By Scale Invariant Learning - Proc. of the IEEE Conf on Computer Vision and Pattern Recognition - 2003

Object Class Recognition By Scale Invariant Learning Extended version of previous model (by weber et al.) New feature detector Probabilistic model for appearance instead of feature types

Feature Detection Kadir-Brady feature detector Detects salient regions over different scales and locations Choose N most salient regions Each feature contains scale and location information

Notation X – Shape : Locations of the features A – Appearance : Representations of the features S – Scale : Vector of feature scales h – Hypothesis :Which part is represented by which observed feature.

Feature Appearance Feature contents is rescaled to a 11x11 pixel patch Normalization Reduce data dimension from 121 to 15 dimensions using PCA method Result is the appearance vector for the part 11x11 patch c1c1 c2c2 Normalize Projection onto PCA basis c 15

Recognition Assuming we learned the model parameters θ. Given an image we extract X, S, A and can make a Bayesian decision: We apply threshold to the likelihood ratio R to decide whether the input image belongs to the class.

Recognition The term p(X, S, A | θ) can be factored into: Each of the terms has a closed (computable) form given the model parameters θ

Part appearance pdf Foreground model Clutter model Gaussian

Shape pdf Foreground model Clutter model Gaussian Uniform

Relative Scale pdf Gaussian Log(scale) Uniform Log(scale) Foreground model Clutter model

Detection Probability pdf Foreground model Clutter model Probability of detection 0.80.750.9 Poisson probability density function on the number of detections

Learning Want to estimate model parameters: Using EM method find that will best explain the training set images, i.e. maximize the likelihood:

Sample Model

Confusion Table How good is a model for object class A is for distinguishing images of class B from background images?

Comparison of Results Average performance of the models at ROC equal error rates: Scale invariant learning:

Agenda Bayesian decision theory Recognition Simple probabilistic model Mixture model More advanced probabilistic model “One-Shot” Learning

Fei-Fei, Fergus, Perona A bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Proc. ICCV. 2003

Small Training Set Humans can learn a new category using very few training examples. Rule-of-thumb in computer learning tells us that number of training examples should be 5-10 times the number of model parameters. Can computers do better?

Incorporating prior knowledge Bayesian methods allow us to use a “prior” information p(θ) about the nature of objects. Given the new observations we can update our knowledge into a “posterior” p(θ|x)

Bayesian Decision Given test image, we want to make a Bayesian decision by comparing: P(object | test, train) vs. P(clutter | test, train) P(test | object, train) p(Object) ∫P(test | θ, object) p(θ | object, train) dθ

Bayesian Decision ∫ P(test | θ, object) p(θ | object, train) dθ Until now we used the ML approach – approximating p(θ) by a delta function centered at the θ ML = arg max p(θ). This will not work for small training set.

Maximum Likelihood vs. Bayesian Learning Maximum Likelihood Bayesian Learning

Experimental setup Learn three object categories using ML approach Estimate the prior hyper-parameters Use VBEM to learn new object category from few images

Prior Hyper-Parameters

Performance Results – Motorbikes 1 training image 5 training images

Performance Results – Motorbikes

Performance Results – Face Model 1 training image 5 training images

Performance Results – Face Model

Results Comparison Algorithm# training images Learning speedError rate Burl, et al. Weber, et al. Fergus, et al. 200~400Hours5.6 -10 % Bayesian One-Shot1 ~ 5< 1 min8 –15 %

References Object Class Recognition By Scale Invariant Learning – Fergus, Perona, Zisserman - 2003 A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories - Fei-Fei, Fergus, Perona - 2003 Towards Automatic Discovery of Object Categories – Weber, Welling, Perona – 2000 Unsupervised Learning of Models for Recognition – Weber, Welling, Perona – 2000 Recognition of Planar Object Classes – Burl, Perona – 1996 Pattern Classification and Scene Analysis – Duda, Hart – 1973

Download ppt "A Bayesian Approach to Recognition Moshe Blank Ita Lifshitz Reverend Thomas Bayes 1702-1761."

Similar presentations