Presentation is loading. Please wait.

Presentation is loading. Please wait.

Viola 2003 Learning and Vision: Discriminative Models Chris Bishop and Paul Viola.

Similar presentations

Presentation on theme: "Viola 2003 Learning and Vision: Discriminative Models Chris Bishop and Paul Viola."— Presentation transcript:

1 Viola 2003 Learning and Vision: Discriminative Models Chris Bishop and Paul Viola

2 Viola 2003 Part II: Algorithms and Applications Part I: Fundamentals Part II: Algorithms and Applications Support Vector Machines –Face and pedestrian detection AdaBoost –Faces Building Fast Classifiers –Trading off speed for accuracy… –Face and object detection Memory Based Learning –Simard –Moghaddam

3 Viola 2003 History Lesson 1950’s Perceptrons are cool –Very simple learning rule, can learn “complex” concepts –Generalized perceptrons are better -- too many weights 1960’s Perceptron’s stink (M+P) –Some simple concepts require exponential # of features Can’t possibly learn that, right? 1980’s MLP’s are cool (R+M / PDP) –Sort of simple learning rule, can learn anything (?) –Create just the features you need 1990 MLP’s stink –Hard to train : Slow / Local Minima 1996 Perceptron’s are cool

4 Viola 2003 Why did we need multi-layer perceptrons? Problems like this seem to require very complex non-linearities. Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.

5 Viola 2003 Why an exponential number of features? 14th Order??? 120 Features N=21, k=5 --> 65,000 features

6 Viola 2003 MLP’s vs. Perceptron MLP’s are hard to train… –Takes a long time (unpredictably long) –Can converge to poor minima MLP are hard to understand –What are they really doing? Perceptrons are easy to train… –Type of linear programming. Polynomial time. –One minimum which is global. Generalized perceptrons are easier to understand. –Polynomial functions.

7 Viola 2003 Perceptron Training is Linear Programming Polynomial time in the number of variables and in the number of constraints. What about linearly inseparable?

8 Viola 2003 Rebirth of Perceptrons How to train effectively –Linear Programming (… later quadratic programming) –Though on-line works great too. How to get so many features inexpensively?!? –Kernel Trick How to generalize with so many features? –VC dimension. (Or is it regularization?) Support Vector Machines

9 Viola 2003 Lemma 1: Weight vectors are simple The weight vector lives in a sub-space spanned by the examples… –Dimensionality is determined by the number of examples not the complexity of the space.

10 Viola 2003 Lemma 2: Only need to compare examples

11 Viola 2003 Simple Kernels yield Complex Features

12 Viola 2003 But Kernel Perceptrons Can Generalize Poorly

13 Viola 2003 Perceptron Rebirth: Generalization Too many features … Occam is unhappy –Perhaps we should encourage smoothness? Smoother

14 Viola 2003 The linear program can return any multiple of the correct weight vector... Linear Program is not unique Slack variables & Weight prior - Force the solution toward zero

15 Viola 2003 Definition of the Margin Geometric Margin: Gap between negatives and positives measured perpendicular to a hyperplane Classifier Margin

16 Viola 2003 Require non-zero margin Allows solutions with zero margin Enforces a non-zero margin between examples and the decision boundary.

17 Viola 2003 Constrained Optimization Find the smoothest function that separates data –Quadratic Programming (similar to Linear Programming) Single Minima Polynomial Time algorithm

18 Viola 2003 Constrained Optimization 2

19 Viola 2003 SVM: examples

20 Viola 2003 SVM: Key Ideas Augment inputs with a very large feature set –Polynomials, etc. Use Kernel Trick(TM) to do this efficiently Enforce/Encourage Smoothness with weight penalty Introduce Margin Find best solution using Quadratic Programming

21 Viola 2003 SVM: Zip Code recognition Data dimension: 256 Feature Space: 4 th order –roughly 100,000,000 dims

22 Viola 2003 The Classical Face Detection Process Smallest Scale Larger Scale 50,000 Locations/Scales

23 Viola 2003 Classifier is Learned from Labeled Data Training Data –5000 faces All frontal –10 8 non faces –Faces are normalized Scale, translation Many variations –Across individuals –Illumination –Pose (rotation both in plane and out)

24 Viola 2003 Key Properties of Face Detection Each image contains 10 - 50 thousand locs/scales Faces are rare 0 - 50 per image –1000 times as many non-faces as faces Extremely small # of false positives: 10 -6

25 Viola 2003 Sung and Poggio

26 Viola 2003 Rowley, Baluja & Kanade First Fast System - Low Res to Hi

27 Viola 2003 Osuna, Freund, and Girosi

28 Viola 2003 Support Vectors

29 Viola 2003 P, O, & G: First Pedestrian Work

30 Viola 2003 On to AdaBoost Given a set of weak classifiers –None much better than random Iteratively combine classifiers –Form a linear combination –Training error converges to 0 quickly –Test error is related to training margin

31 Viola 2003 AdaBoost Weak Classifier 1 Weights Increased Weak classifier 3 Final classifier is linear combination of weak classifiers Weak Classifier 2 Freund & Shapire

32 Viola 2003 AdaBoost Properties

33 Viola 2003 AdaBoost: Super Efficient Feature Selector Features = Weak Classifiers Each round selects the optimal feature given: –Previous selected features –Exponential Loss

34 Viola 2003 Boosted Face Detection: Image Features “Rectangle filters” Similar to Haar wavelets Papageorgiou, et al. Unique Binary Features

35 Viola 2003


37 Feature Selection For each round of boosting: –Evaluate each rectangle filter on each example –Sort examples by filter values –Select best threshold for each filter (min Z) –Select best filter/threshold (= Feature) –Reweight examples M filters, T thresholds, N examples, L learning time –O( MT L(MTN) ) Naïve Wrapper Method –O( MN ) Adaboost feature selector

38 Viola 2003 Example Classifier for Face Detection ROC curve for 200 feature classifier A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive...

39 Viola 2003 Building Fast Classifiers Given a nested set of classifier hypothesis classes Computational Risk Minimization vs falsenegdetermined by % False Pos % Detection 0 50 50 100 FACE IMAGE SUB-WINDOW Classifier 1 F T NON-FACE Classifier 3 T F NON-FACE F T Classifier 2 T F NON-FACE

40 Viola 2003 Other Fast Classification Work Simard Rowley (Faces) Fleuret & Geman (Faces)

41 Viola 2003 Cascaded Classifier 1 Feature 5 Features F 50% 20 Features 20%2% FACE NON-FACE F F IMAGE SUB-WINDOW A 1 feature classifier achieves 100% detection rate and about 50% false positive rate. A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative) –using data from previous stage. A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

42 Viola 2003 Comparison to Other Systems (94.8)Roth-Yang-Ahuja 94.4Schneiderman-Kanade 89.990. Kanade 93.791.891.190.890.190.088.885.278.3Viola-Jones 422167110957865503110 Detector False Detections

43 Viola 2003 Output of Face Detector on Test Images

44 Viola 2003 Solving other “Face” Tasks Facial Feature Localization Demographic Analysis Profile Detection

45 Viola 2003 Feature Localization Surprising properties of our framework –The cost of detection is not a function of image size Just the number of features –Learning automatically focuses attention on key regions Conclusion: the “feature” detector can include a large contextual region around the feature

46 Viola 2003 Feature Localization Features Learned features reflect the task

47 Viola 2003 Profile Detection

48 Viola 2003 More Results

49 Viola 2003 Profile Features

50 Viola 2003 One-Nearest Neighbor …One nearest neighbor for fitting is described shortly… Similar to Join The Dots with two Pros and one Con. PRO: It is easy to implement with multivariate inputs. CON: It no longer interpolates locally. PRO: An excellent introduction to instance-based learning… Thanks to Andrew Moore

51 Viola 2003 1-Nearest Neighbor is an example of…. Instance-based learning Four things make a memory based learner: A distance metric How many nearby neighbors to look at? A weighting function (optional) How to fit with the local points? x 1 y 1 x 2 y 2 x 3 y 3. x n y n A function approximator that has been around since about 1910. To make a prediction, search database for similar datapoints, and fit with the local points. Thanks to Andrew Moore

52 Viola 2003 Nearest Neighbor Four things make a memory based learner: 1.A distance metric Euclidian 2.How many nearby neighbors to look at? One 3.A weighting function (optional) Unused 4.How to fit with the local points? Just predict the same output as the nearest neighbor. Thanks to Andrew Moore

53 Viola 2003 Multivariate Distance Metrics Suppose the input vectors x1, x2, …xn are two dimensional: x 1 = ( x 11, x 12 ), x 2 = ( x 21, x 22 ), …x N = ( x N1, x N2 ). One can draw the nearest-neighbor regions in input space. Dist(x i,x j ) = (x i1 – x j1 ) 2 + (x i2 – x j2 ) 2 Dist(x i,x j ) =(x i1 – x j1 ) 2 +(3x i2 – 3x j2 ) 2 The relative scalings in the distance metric affect region shapes. Thanks to Andrew Moore

54 Viola 2003 Euclidean Distance Metric Other Metrics… Mahalanobis, Rank-based, Correlation-based (Stanfill+Waltz, Maes’ Ringo system…) where Or equivalently, Thanks to Andrew Moore

55 Viola 2003 Notable Distance Metrics Thanks to Andrew Moore

56 Viola 2003 Simard: Tangent Distance

57 Viola 2003 Simard: Tangent Distance

58 Viola 2003 FERET Photobook Moghaddam & Pentland (1995) Thanks to Baback Moghaddam

59 Viola 2003 Eigenfaces Moghaddam & Pentland (1995) Normalized Eigenfaces Thanks to Baback Moghaddam

60 Viola 2003 Euclidean (Standard) “Eigenfaces” Turk & Pentland (1992) Moghaddam & Pentland (1995) Projects all the training faces onto a universal eigenspace to “encode” variations (“modes”) via principal components (PCA) Uses inverse-distance as a similarity measure for matching & recognition Thanks to Baback Moghaddam

61 Viola 2003 Metric (distance-based) Similarity Measures –template-matching, normalized correlation, etc Disadvantages –Assumes isotropic variation (that all variations are equi-probable) –Can not distinguish incidental changes from the critical ones –Particularly bad for Face Recognition in which so many are incidental! for example: lighting and expression Euclidean Similarity Measures Thanks to Baback Moghaddam

62 Viola 2003 PCA-Based Density Estimation Moghaddam & Pentland ICCV’95 Perform PCA and factorize into (orthogonal) Gaussians subspaces: See Tipping & Bishop (97) for an ML derivation within a more general factor analysis framework (PPCA) Solve for minimal KL divergence residual for the orthogonal subspace: Thanks to Baback Moghaddam

63 Viola 2003 Bayesian Face Recognition Moghaddam et al ICPR’96, FG’98, NIPS’99, ICCV’99 Moghaddam ICCV’95 PCA-based density estimation Intrapersonal Extrapersonal dual subspaces for dyads (image pairs) Equate “similarity” with posterior on Thanks to Baback Moghaddam

64 Viola 2003 Intra-Extra (Dual) Subspaces specs lightmouthsmile Intra Extra Standard PCA Thanks to Baback Moghaddam

65 Viola 2003 Intra-Extra Subspace Geometry Two “pancake” subspaces with different orientations intersecting near the origin. If each is in fact Gaussian, then the optimal discriminant is hyperquadratic Thanks to Baback Moghaddam

66 Viola 2003 Bayesian (MAP) Similarity –priors can be adjusted to reflect operational settings or used for Bayesian fusion (evidential “belief” from another level of inference) Likelihood (ML) Similarity Bayesian Similarity Measure Intra-only (ML) recognition is only slightly inferior to MAP (by few %). Therefore, if you had to pick only one subspace to work in, you should pick Intra – and not standard eigenfaces! Thanks to Baback Moghaddam

67 Viola 2003 FERET Identification: Pre-Test Bayesian (Intra-Extra) Standard (Eigenfaces) Thanks to Baback Moghaddam

68 Viola 2003 Official 1996 FERET Test Bayesian (Intra-Extra) Standard (Eigenfaces) Thanks to Baback Moghaddam

69 Viola 2003 One-Nearest Neighbor Objection: That noise-fitting is really objectionable. What’s the most obvious way of dealing with it?..let’s leave distance metrics for now, and go back to…. Thanks to Andrew Moore

70 Viola 2003 k-Nearest Neighbor Four things make a memory based learner: 1.A distance metric Euclidian 2.How many nearby neighbors to look at? k 3.A weighting function (optional) Unused 4.How to fit with the local points? Just predict the average output among the k nearest neighbors. Thanks to Andrew Moore

71 Viola 2003 k-Nearest Neighbor (here k=9) K-nearest neighbor for function fitting smoothes away noise, but there are clear deficiencies. What can we do about all the discontinuities that k-NN gives us? A magnificent job of noise- smoothing. Three cheers for 9- nearest-neighbor. But the lack of gradients and the jerkiness isn’t good. Appalling behavior! Loses all the detail that join-the-dots and 1-nearest-neighbor gave us, yet smears the ends. Fits much less of the noise, captures trends. But still, frankly, pathetic compared with linear regression. Thanks to Andrew Moore

72 Viola 2003 Kernel Regression Four things make a memory based learner: 1.A distance metric Scaled Euclidian 2.How many nearby neighbors to look at? All of them 3.A weighting function (optional) w i = exp(-D(x i, query) 2 / K w 2 ) Nearby points to the query are weighted strongly, far points weakly. The K W parameter is the Kernel Width. Very important. 4.How to fit with the local points? Predict the weighted average of the outputs: predict = Σw i y i / Σw i Thanks to Andrew Moore

73 Viola 2003 Kernel Regression in Pictures Take this dataset…..and do a kernel prediction with x q (query) = 310, K w = 50. Thanks to Andrew Moore

74 Viola 2003 Varying the Query x q = 150x q = 395 Thanks to Andrew Moore

75 Viola 2003 Varying the kernel width Increasing the kernel width K w means further away points get an opportunity to influence you. As K w  infinity, the prediction tends to the global average. x q = 310 K W = 50 (see the double arrow at top of diagram) x q = 310 (the same) K W = 100 x q = 310 (the same) K W = 150 Thanks to Andrew Moore

76 Viola 2003 Kernel Regression Predictions Increasing the kernel width K w means further away points get an opportunity to influence you. As K w  infinity, the prediction tends to the global average. K W =10K W =20K W =80 Thanks to Andrew Moore

77 Viola 2003 Kernel Regression on our test cases KW=1/32 of x-axis width. It’s nice to see a smooth curve at last. But rather bumpy. If Kw gets any higher, the fit is poor. KW=1/32 of x-axis width. Quite splendid. Well done, kernel regression. The author needed to choose the right K W to achieve this. KW=1/16 axis width. Nice and smooth, but are the bumps justified, or is this overfitting? Choosing a good K w is important. Not just for Kernel Regression, but for all the locally weighted learners we’re about to see. Thanks to Andrew Moore

78 Viola 2003 Weighting functions Let d=D(x i,x query )/K W Then here are some commonly used weighting functions… (we use a Gaussian) Thanks to Andrew Moore

79 Viola 2003 Kernel Regression can look bad KW = Best. Clearly not capturing the simple structure of the data.. Note the complete failure to extrapolate at edges. KW = Best. Also much too local. Why wouldn’t increasing Kw help? Because then it would all be “smeared”. KW = Best. Three noisy linear segments. But best kernel regression gives poor gradients. Time to try something more powerful… Thanks to Andrew Moore

80 Viola 2003 Locally Weighted Regression Kernel Regression: Take a very very conservative function approximator called AVERAGING. Locally weight it. Locally Weighted Regression: Take a conservative function approximator called LINEAR REGRESSION. Locally weight it. Let’s Review Linear Regression…. Thanks to Andrew Moore

81 Viola 2003 Unweighted Linear Regression You’re lying asleep in bed. Then Nature wakes you. YOU: “Oh. Hello, Nature!” NATURE: “I have a coefficient β in mind. I took a bunch of real numbers called x 1, x 2..x N thus: x 1 =3.1,x 2 =2, …x N =4.5. For each of them (k=1,2,..N), I generated y k = βx k +ε k where ε k is a Gaussian (i.e. Normal) random variable with mean 0 and standard deviation σ. The ε k ’s were generated independently of each other. Here are the resulting y i ’s: y 1 =5.1, y 2 =4.2, …y N =10.2” You: “Uh-huh.” Nature: “So what do you reckon β is then, eh?” WHAT IS YOUR RESPONSE? Thanks to Andrew Moore

82 Viola 2003 Locally Weighted Regression Four things make a memory-based learner: 1.A distance metric Scaled Euclidian 2.How many nearby neighbors to look at? All of them 3.A weighting function (optional) w k = exp(-D(x k, x query ) 2 / K w 2 ) Nearby points to the query are weighted strongly, far points weakly. The K w parameter is the Kernel Width. 4.How to fit with the local points? 1.First form a local linear model. Find the β that minimizes the locally weighted sum of squared residuals: Then predict y predict =β T x query Thanks to Andrew Moore

83 Viola 2003 How LWR works Linear regression not flexible but trains like lightning. Locally weighted regression is very flexible and fast to train. Query Thanks to Andrew Moore

84 Viola 2003 LWR on our test cases KW = 1/16 of x-axis width.KW = 1/32 of x-axis width.KW = 1/8 of x-axis width. Nicer and smoother, but even now, are the bumps justified, or is this overfitting? Thanks to Andrew Moore

85 Viola 2003 Features, Features, Features In almost every case: Good Features beat Good Learning Learning beats No Learning Critical classifier ratio: AdaBoost >> SVM

Download ppt "Viola 2003 Learning and Vision: Discriminative Models Chris Bishop and Paul Viola."

Similar presentations

Ads by Google