# Learning and Vision: Discriminative Models

## Presentation on theme: "Learning and Vision: Discriminative Models"— Presentation transcript:

Learning and Vision: Discriminative Models
Chris Bishop and Paul Viola

Part II: Algorithms and Applications
Part I: Fundamentals Part II: Algorithms and Applications Support Vector Machines Face and pedestrian detection AdaBoost Faces Building Fast Classifiers Trading off speed for accuracy… Face and object detection Memory Based Learning Simard Moghaddam

History Lesson 1950’s Perceptrons are cool
Very simple learning rule, can learn “complex” concepts Generalized perceptrons are better -- too many weights 1960’s Perceptron’s stink (M+P) Some simple concepts require exponential # of features Can’t possibly learn that, right? 1980’s MLP’s are cool (R+M / PDP) Sort of simple learning rule, can learn anything (?) Create just the features you need MLP’s stink Hard to train : Slow / Local Minima Perceptron’s are cool

Why did we need multi-layer perceptrons?
Problems like this seem to require very complex non-linearities. Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.

Why an exponential number of features?
14th Order??? 120 Features N=21, k=5 --> 65,000 features

MLP’s vs. Perceptron MLP’s are hard to train…
Takes a long time (unpredictably long) Can converge to poor minima MLP are hard to understand What are they really doing? Perceptrons are easy to train… Type of linear programming. Polynomial time. One minimum which is global. Generalized perceptrons are easier to understand. Polynomial functions.

Perceptron Training is Linear Programming
Polynomial time in the number of variables and in the number of constraints. What about linearly inseparable?

Rebirth of Perceptrons
How to train effectively Linear Programming (… later quadratic programming) Though on-line works great too. How to get so many features inexpensively?!? Kernel Trick How to generalize with so many features? VC dimension. (Or is it regularization?) Support Vector Machines

Lemma 1: Weight vectors are simple
The weight vector lives in a sub-space spanned by the examples… Dimensionality is determined by the number of examples not the complexity of the space.

Lemma 2: Only need to compare examples
This is called the kernel matrix.

Simple Kernels yield Complex Features

But Kernel Perceptrons Can Generalize Poorly

Perceptron Rebirth: Generalization
Too many features … Occam is unhappy Perhaps we should encourage smoothness? This not a convergent. Smoother

Linear Program is not unique
The linear program can return any multiple of the correct weight vector... Slack variables & Weight prior - Force the solution toward zero

Definition of the Margin
Geometric Margin: Gap between negatives and positives measured perpendicular to a hyperplane Classifier Margin

Require non-zero margin
Allows solutions with zero margin Enforces a non-zero margin between examples and the decision boundary.

Constrained Optimization
Find the smoothest function that separates data Quadratic Programming (similar to Linear Programming) Single Minima Polynomial Time algorithm

Constrained Optimization 2

SVM: examples

SVM: Key Ideas Augment inputs with a very large feature set
Polynomials, etc. Use Kernel Trick(TM) to do this efficiently Enforce/Encourage Smoothness with weight penalty Introduce Margin Find best solution using Quadratic Programming

SVM: Zip Code recognition
Data dimension: 256 Feature Space: 4 th order roughly 100,000,000 dims

The Classical Face Detection Process
Larger Scale Smallest Scale 50,000 Locations/Scales

Classifier is Learned from Labeled Data
Training Data 5000 faces All frontal 108 non faces Faces are normalized Scale, translation Many variations Across individuals Illumination Pose (rotation both in plane and out) This situation with negative examples is actually quite common… where negative examples are free.

Key Properties of Face Detection
Each image contains thousand locs/scales Faces are rare per image 1000 times as many non-faces as faces Extremely small # of false positives: 10-6 As I said earlier the classifier is evaluated 50,000 Faces are quite rare…. Perhaps 1 or 2 faces per image A reasonable goal is to make the false positive rate less than the true positive rate...

Sung and Poggio

Rowley, Baluja & Kanade First Fast System - Low Res to Hi

Osuna, Freund, and Girosi

Support Vectors

P, O, & G: First Pedestrian Work
Flaws: very poor model for feature selection. Little real motivation for features at all. Why not use the original image.

On to AdaBoost Given a set of weak classifiers
None much better than random Iteratively combine classifiers Form a linear combination Training error converges to 0 quickly Test error is related to training margin

AdaBoost Freund & Shapire Weak Classifier 1 Weights Increased Weak
Final classifier is linear combination of weak classifiers

Features = Weak Classifiers Each round selects the optimal feature given: Previous selected features Exponential Loss

Boosted Face Detection: Image Features
“Rectangle filters” Similar to Haar wavelets Papageorgiou, et al. For real problems results are only as good as the features used... This is the main piece of ad-hoc (or domain) knowledge Rather than the pixels, we have selected a very large set of simple functions Sensitive to edges and other critcal features of the image ** At multiple scales Since the final classifier is a perceptron it is important that the features be non-linear… otherwise the final classifier will be a simple perceptron. We introduce a threshold to yield binary features Unique Binary Features

Feature Selection For each round of boosting:
Evaluate each rectangle filter on each example Sort examples by filter values Select best threshold for each filter (min Z) Select best filter/threshold (= Feature) Reweight examples M filters, T thresholds, N examples, L learning time O( MT L(MTN) ) Naïve Wrapper Method O( MN ) Adaboost feature selector

Example Classifier for Face Detection
A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive... ROC curve for 200 feature classifier

Building Fast Classifiers
Given a nested set of classifier hypothesis classes Computational Risk Minimization vs false neg determined by % False Pos % Detection 50 In general simple classifiers, while they are more efficient, they are also weaker. We could define a computational risk hierarchy (in analogy with structural risk minimization)… A nested set of classifier classes The training process is reminiscent of boosting… - previous classifiers reweight the examples used to train subsequent classifiers The goal of the training process is different - instead of minimizing errors minimize false positives FACE IMAGE SUB-WINDOW Classifier 1 F T NON-FACE Classifier 3 Classifier 2

Other Fast Classification Work
Simard Rowley (Faces) Fleuret & Geman (Faces)

Cascaded Classifier 50% 20% 2% IMAGE SUB-WINDOW 1 Feature 5 Features 20 Features FACE F F F NON-FACE NON-FACE NON-FACE A 1 feature classifier achieves 100% detection rate and about 50% false positive rate. A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative) using data from previous stage. A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

Comparison to Other Systems
(94.8) Roth-Yang-Ahuja 94.4 Schneiderman-Kanade 89.9 90.1 89.2 86.0 83.2 Rowley-Baluja-Kanade 93.7 91.8 91.1 90.8 90.0 88.8 85.2 78.3 Viola-Jones 422 167 110 95 78 65 50 31 10 False Detections Detector

Output of Face Detector on Test Images

Profile Detection Facial Feature Localization Demographic Analysis

Feature Localization Surprising properties of our framework
The cost of detection is not a function of image size Just the number of features Learning automatically focuses attention on key regions Conclusion: the “feature” detector can include a large contextual region around the feature

Feature Localization Features

Profile Detection

More Results

Profile Features

Thanks to Andrew Moore One-Nearest Neighbor …One nearest neighbor for fitting is described shortly… Similar to Join The Dots with two Pros and one Con. PRO: It is easy to implement with multivariate inputs. CON: It no longer interpolates locally. PRO: An excellent introduction to instance-based learning…

1-Nearest Neighbor is an example of…. Instance-based learning
Thanks to Andrew Moore 1-Nearest Neighbor is an example of…. Instance-based learning x y1 x y2 x y3 . xn yn A function approximator that has been around since about 1910. To make a prediction, search database for similar datapoints, and fit with the local points. Four things make a memory based learner: A distance metric How many nearby neighbors to look at? A weighting function (optional) How to fit with the local points?

Nearest Neighbor Four things make a memory based learner:
Thanks to Andrew Moore Nearest Neighbor Four things make a memory based learner: A distance metric Euclidian How many nearby neighbors to look at? One A weighting function (optional) Unused How to fit with the local points? Just predict the same output as the nearest neighbor.

Multivariate Distance Metrics
Thanks to Andrew Moore Multivariate Distance Metrics Suppose the input vectors x1, x2, …xn are two dimensional: x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ). One can draw the nearest-neighbor regions in input space. Dist(xi,xj) = (xi1 – xj1)2 + (xi2 – xj2)2 Dist(xi,xj) =(xi1 – xj1)2+(3xi2 – 3xj2)2 The relative scalings in the distance metric affect region shapes.

Euclidean Distance Metric
Thanks to Andrew Moore Euclidean Distance Metric Or equivalently, where Other Metrics… Mahalanobis, Rank-based, Correlation-based (Stanfill+Waltz, Maes’ Ringo system…)

Notable Distance Metrics
Thanks to Andrew Moore Notable Distance Metrics

Simard: Tangent Distance

Simard: Tangent Distance

FERET Photobook Moghaddam & Pentland (1995)

Thanks to Baback Moghaddam Euclidean (Standard) “Eigenfaces” Turk & Pentland (1992) Moghaddam & Pentland (1995) Projects all the training faces onto a universal eigenspace to “encode” variations (“modes”) via principal components (PCA) Uses inverse-distance as a similarity measure for matching & recognition

Euclidean Similarity Measures
Thanks to Baback Moghaddam Metric (distance-based) Similarity Measures template-matching, normalized correlation, etc Disadvantages Assumes isotropic variation (that all variations are equi-probable) Can not distinguish incidental changes from the critical ones Particularly bad for Face Recognition in which so many are incidental! for example: lighting and expression

PCA-Based Density Estimation Moghaddam & Pentland ICCV’95
Thanks to Baback Moghaddam Solve for minimal KL divergence residual for the orthogonal subspace: Perform PCA and factorize into (orthogonal) Gaussians subspaces: See Tipping & Bishop (97) for an ML derivation within a more general factor analysis framework (PPCA)

dual subspaces for dyads (image pairs)
Thanks to Baback Moghaddam Bayesian Face Recognition Moghaddam et al ICPR’96, FG’98, NIPS’99, ICCV’99 Intrapersonal Extrapersonal dual subspaces for dyads (image pairs) Equate “similarity” with posterior on Moghaddam ICCV’95 PCA-based density estimation

Intra-Extra (Dual) Subspaces
Thanks to Baback Moghaddam Intra-Extra (Dual) Subspaces specs light mouth smile Intra Extra Note that the Extra and Standard subspaces are qualitatively the same (eg. both have beard-encoders, not present in the Intra space). In fact, their 1st eigenvector is almost exactly the same (bangs in the forehead) Standard PCA

Intra-Extra Subspace Geometry
Thanks to Baback Moghaddam The lower-order eigenvectors/eigenvalues allow us to “visualize” these two distributions and to judge their (relative) size and orientation at least. The key thing is that we’re effectively talking about two “pencils” which only intersect near the origin (of delta space). Discriminating between these two subspaces (a binary classification task) is often much easier than discriminating between N faces (N-ary classification). In fact, not only easier, but very often much more effective. Two “pancake” subspaces with different orientations intersecting near the origin. If each is in fact Gaussian, then the optimal discriminant is hyperquadratic

Bayesian Similarity Measure
Thanks to Baback Moghaddam Bayesian (MAP) Similarity priors can be adjusted to reflect operational settings or used for Bayesian fusion (evidential “belief” from another level of inference) Likelihood (ML) Similarity Intra-only (ML) recognition is only slightly inferior to MAP (by few %). Therefore, if you had to pick only one subspace to work in, you should pick Intra – and not standard eigenfaces!

FERET Identification: Pre-Test
Thanks to Baback Moghaddam Bayesian (Intra-Extra) Standard (Eigenfaces)

Official 1996 FERET Test Bayesian (Intra-Extra) Standard (Eigenfaces)
Thanks to Baback Moghaddam Bayesian (Intra-Extra) Standard (Eigenfaces)

..let’s leave distance metrics for now, and go back to….
Thanks to Andrew Moore One-Nearest Neighbor Objection: That noise-fitting is really objectionable. What’s the most obvious way of dealing with it?

k-Nearest Neighbor Four things make a memory based learner:
Thanks to Andrew Moore k-Nearest Neighbor Four things make a memory based learner: A distance metric Euclidian How many nearby neighbors to look at? k A weighting function (optional) Unused How to fit with the local points? Just predict the average output among the k nearest neighbors.

k-Nearest Neighbor (here k=9)
Thanks to Andrew Moore k-Nearest Neighbor (here k=9) A magnificent job of noise-smoothing. Three cheers for 9-nearest-neighbor. But the lack of gradients and the jerkiness isn’t good. Appalling behavior! Loses all the detail that join-the-dots and 1-nearest-neighbor gave us, yet smears the ends. Fits much less of the noise, captures trends. But still, frankly, pathetic compared with linear regression. K-nearest neighbor for function fitting smoothes away noise, but there are clear deficiencies. What can we do about all the discontinuities that k-NN gives us?

Kernel Regression Four things make a memory based learner:
Thanks to Andrew Moore Kernel Regression Four things make a memory based learner: A distance metric Scaled Euclidian How many nearby neighbors to look at? All of them A weighting function (optional) wi = exp(-D(xi, query)2 / Kw2) Nearby points to the query are weighted strongly, far points weakly. The KW parameter is the Kernel Width. Very important. How to fit with the local points? Predict the weighted average of the outputs: predict = Σwiyi / Σwi

Kernel Regression in Pictures
Thanks to Andrew Moore Kernel Regression in Pictures Take this dataset… ..and do a kernel prediction with xq (query) = 310, Kw = 50.

Thanks to Andrew Moore Varying the Query xq = 150 xq = 395

Varying the kernel width
Thanks to Andrew Moore Varying the kernel width xq = 310 KW = 50 (see the double arrow at top of diagram) xq = 310 (the same) KW = 100 KW = 150 Increasing the kernel width Kw means further away points get an opportunity to influence you. As Kwinfinity, the prediction tends to the global average.

Kernel Regression Predictions
Thanks to Andrew Moore Kernel Regression Predictions KW=10 KW=20 KW=80 Increasing the kernel width Kw means further away points get an opportunity to influence you. As Kwinfinity, the prediction tends to the global average.

Kernel Regression on our test cases
Thanks to Andrew Moore Kernel Regression on our test cases KW=1/32 of x-axis width. It’s nice to see a smooth curve at last. But rather bumpy. If Kw gets any higher, the fit is poor. Quite splendid. Well done, kernel regression. The author needed to choose the right KW to achieve this. KW=1/16 axis width. Nice and smooth, but are the bumps justified, or is this overfitting? Choosing a good Kw is important. Not just for Kernel Regression, but for all the locally weighted learners we’re about to see.

Weighting functions Let d=D(xi,xquery)/KW
Thanks to Andrew Moore Weighting functions Let d=D(xi,xquery)/KW Then here are some commonly used weighting functions… (we use a Gaussian)

Thanks to Andrew Moore Kernel Regression can look bad KW = Best. Clearly not capturing the simple structure of the data.. Note the complete failure to extrapolate at edges. Also much too local. Why wouldn’t increasing Kw help? Because then it would all be “smeared”. Three noisy linear segments. But best kernel regression gives poor gradients. Time to try something more powerful…

Locally Weighted Regression
Thanks to Andrew Moore Locally Weighted Regression Kernel Regression: Take a very very conservative function approximator called AVERAGING. Locally weight it. Locally Weighted Regression: Take a conservative function approximator called LINEAR REGRESSION. Locally weight it. Let’s Review Linear Regression….

Unweighted Linear Regression
Thanks to Andrew Moore Unweighted Linear Regression You’re lying asleep in bed. Then Nature wakes you. YOU: “Oh. Hello, Nature!” NATURE: “I have a coefficient β in mind. I took a bunch of real numbers called x1, x2 ..xN thus: x1=3.1,x2=2, …xN=4.5. For each of them (k=1,2,..N), I generated yk= βxk+εk where εk is a Gaussian (i.e. Normal) random variable with mean 0 and standard deviation σ. The εk’s were generated independently of each other. Here are the resulting yi’s: y1=5.1 , y2=4.2 , …yN=10.2” You: “Uh-huh.” Nature: “So what do you reckon β is then, eh?” WHAT IS YOUR RESPONSE?

Locally Weighted Regression
Thanks to Andrew Moore Locally Weighted Regression Four things make a memory-based learner: A distance metric Scaled Euclidian How many nearby neighbors to look at? All of them A weighting function (optional) wk = exp(-D(xk, xquery)2 / Kw2) Nearby points to the query are weighted strongly, far points weakly. The Kw parameter is the Kernel Width. How to fit with the local points? First form a local linear model. Find the β that minimizes the locally weighted sum of squared residuals: Then predict ypredict=βT xquery

How LWR works Query Thanks to Andrew Moore
Linear regression not flexible but trains like lightning. Locally weighted regression is very flexible and fast to train.

LWR on our test cases KW = 1/16 of x-axis width.
Thanks to Andrew Moore LWR on our test cases KW = 1/16 of x-axis width. KW = 1/32 of x-axis width. KW = 1/8 of x-axis width. Nicer and smoother, but even now, are the bumps justified, or is this overfitting?

Features, Features, Features
In almost every case: Good Features beat Good Learning Learning beats No Learning Critical classifier ratio: AdaBoost >> SVM This is not to say that a