Download presentation

Presentation is loading. Please wait.

Published byGrady Hilton Modified about 1 year ago

1
Viola 2003 Learning and Vision: Discriminative Models Chris Bishop and Paul Viola

2
Viola 2003 Part II: Algorithms and Applications Part I: Fundamentals Part II: Algorithms and Applications Support Vector Machines –Face and pedestrian detection AdaBoost –Faces Building Fast Classifiers –Trading off speed for accuracy… –Face and object detection Memory Based Learning –Simard –Moghaddam

3
Viola 2003 History Lesson 1950’s Perceptrons are cool –Very simple learning rule, can learn “complex” concepts –Generalized perceptrons are better -- too many weights 1960’s Perceptron’s stink (M+P) –Some simple concepts require exponential # of features Can’t possibly learn that, right? 1980’s MLP’s are cool (R+M / PDP) –Sort of simple learning rule, can learn anything (?) –Create just the features you need 1990 MLP’s stink –Hard to train : Slow / Local Minima 1996 Perceptron’s are cool

4
Viola 2003 Why did we need multi-layer perceptrons? Problems like this seem to require very complex non-linearities. Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.

5
Viola 2003 Why an exponential number of features? 14th Order??? 120 Features N=21, k=5 --> 65,000 features

6
Viola 2003 MLP’s vs. Perceptron MLP’s are hard to train… –Takes a long time (unpredictably long) –Can converge to poor minima MLP are hard to understand –What are they really doing? Perceptrons are easy to train… –Type of linear programming. Polynomial time. –One minimum which is global. Generalized perceptrons are easier to understand. –Polynomial functions.

7
Viola 2003 Perceptron Training is Linear Programming Polynomial time in the number of variables and in the number of constraints. What about linearly inseparable?

8
Viola 2003 Rebirth of Perceptrons How to train effectively –Linear Programming (… later quadratic programming) –Though on-line works great too. How to get so many features inexpensively?!? –Kernel Trick How to generalize with so many features? –VC dimension. (Or is it regularization?) Support Vector Machines

9
Viola 2003 Lemma 1: Weight vectors are simple The weight vector lives in a sub-space spanned by the examples… –Dimensionality is determined by the number of examples not the complexity of the space.

10
Viola 2003 Lemma 2: Only need to compare examples

11
Viola 2003 Simple Kernels yield Complex Features

12
Viola 2003 But Kernel Perceptrons Can Generalize Poorly

13
Viola 2003 Perceptron Rebirth: Generalization Too many features … Occam is unhappy –Perhaps we should encourage smoothness? Smoother

14
Viola 2003 The linear program can return any multiple of the correct weight vector... Linear Program is not unique Slack variables & Weight prior - Force the solution toward zero

15
Viola 2003 Definition of the Margin Geometric Margin: Gap between negatives and positives measured perpendicular to a hyperplane Classifier Margin

16
Viola 2003 Require non-zero margin Allows solutions with zero margin Enforces a non-zero margin between examples and the decision boundary.

17
Viola 2003 Constrained Optimization Find the smoothest function that separates data –Quadratic Programming (similar to Linear Programming) Single Minima Polynomial Time algorithm

18
Viola 2003 Constrained Optimization 2

19
Viola 2003 SVM: examples

20
Viola 2003 SVM: Key Ideas Augment inputs with a very large feature set –Polynomials, etc. Use Kernel Trick(TM) to do this efficiently Enforce/Encourage Smoothness with weight penalty Introduce Margin Find best solution using Quadratic Programming

21
Viola 2003 SVM: Zip Code recognition Data dimension: 256 Feature Space: 4 th order –roughly 100,000,000 dims

22
Viola 2003 The Classical Face Detection Process Smallest Scale Larger Scale 50,000 Locations/Scales

23
Viola 2003 Classifier is Learned from Labeled Data Training Data –5000 faces All frontal –10 8 non faces –Faces are normalized Scale, translation Many variations –Across individuals –Illumination –Pose (rotation both in plane and out)

24
Viola 2003 Key Properties of Face Detection Each image contains thousand locs/scales Faces are rare per image –1000 times as many non-faces as faces Extremely small # of false positives: 10 -6

25
Viola 2003 Sung and Poggio

26
Viola 2003 Rowley, Baluja & Kanade First Fast System - Low Res to Hi

27
Viola 2003 Osuna, Freund, and Girosi

28
Viola 2003 Support Vectors

29
Viola 2003 P, O, & G: First Pedestrian Work

30
Viola 2003 On to AdaBoost Given a set of weak classifiers –None much better than random Iteratively combine classifiers –Form a linear combination –Training error converges to 0 quickly –Test error is related to training margin

31
Viola 2003 AdaBoost Weak Classifier 1 Weights Increased Weak classifier 3 Final classifier is linear combination of weak classifiers Weak Classifier 2 Freund & Shapire

32
Viola 2003 AdaBoost Properties

33
Viola 2003 AdaBoost: Super Efficient Feature Selector Features = Weak Classifiers Each round selects the optimal feature given: –Previous selected features –Exponential Loss

34
Viola 2003 Boosted Face Detection: Image Features “Rectangle filters” Similar to Haar wavelets Papageorgiou, et al. Unique Binary Features

35
Viola 2003

36

37
Feature Selection For each round of boosting: –Evaluate each rectangle filter on each example –Sort examples by filter values –Select best threshold for each filter (min Z) –Select best filter/threshold (= Feature) –Reweight examples M filters, T thresholds, N examples, L learning time –O( MT L(MTN) ) Naïve Wrapper Method –O( MN ) Adaboost feature selector

38
Viola 2003 Example Classifier for Face Detection ROC curve for 200 feature classifier A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in false positives. Not quite competitive...

39
Viola 2003 Building Fast Classifiers Given a nested set of classifier hypothesis classes Computational Risk Minimization vs falsenegdetermined by % False Pos % Detection FACE IMAGE SUB-WINDOW Classifier 1 F T NON-FACE Classifier 3 T F NON-FACE F T Classifier 2 T F NON-FACE

40
Viola 2003 Other Fast Classification Work Simard Rowley (Faces) Fleuret & Geman (Faces)

41
Viola 2003 Cascaded Classifier 1 Feature 5 Features F 50% 20 Features 20%2% FACE NON-FACE F F IMAGE SUB-WINDOW A 1 feature classifier achieves 100% detection rate and about 50% false positive rate. A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative) –using data from previous stage. A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

42
Viola 2003 Comparison to Other Systems (94.8)Roth-Yang-Ahuja 94.4Schneiderman-Kanade Rowley-Baluja- Kanade Viola-Jones Detector False Detections

43
Viola 2003 Output of Face Detector on Test Images

44
Viola 2003 Solving other “Face” Tasks Facial Feature Localization Demographic Analysis Profile Detection

45
Viola 2003 Feature Localization Surprising properties of our framework –The cost of detection is not a function of image size Just the number of features –Learning automatically focuses attention on key regions Conclusion: the “feature” detector can include a large contextual region around the feature

46
Viola 2003 Feature Localization Features Learned features reflect the task

47
Viola 2003 Profile Detection

48
Viola 2003 More Results

49
Viola 2003 Profile Features

50
Viola 2003 One-Nearest Neighbor …One nearest neighbor for fitting is described shortly… Similar to Join The Dots with two Pros and one Con. PRO: It is easy to implement with multivariate inputs. CON: It no longer interpolates locally. PRO: An excellent introduction to instance-based learning… Thanks to Andrew Moore

51
Viola Nearest Neighbor is an example of…. Instance-based learning Four things make a memory based learner: A distance metric How many nearby neighbors to look at? A weighting function (optional) How to fit with the local points? x 1 y 1 x 2 y 2 x 3 y 3. x n y n A function approximator that has been around since about To make a prediction, search database for similar datapoints, and fit with the local points. Thanks to Andrew Moore

52
Viola 2003 Nearest Neighbor Four things make a memory based learner: 1.A distance metric Euclidian 2.How many nearby neighbors to look at? One 3.A weighting function (optional) Unused 4.How to fit with the local points? Just predict the same output as the nearest neighbor. Thanks to Andrew Moore

53
Viola 2003 Multivariate Distance Metrics Suppose the input vectors x1, x2, …xn are two dimensional: x 1 = ( x 11, x 12 ), x 2 = ( x 21, x 22 ), …x N = ( x N1, x N2 ). One can draw the nearest-neighbor regions in input space. Dist(x i,x j ) = (x i1 – x j1 ) 2 + (x i2 – x j2 ) 2 Dist(x i,x j ) =(x i1 – x j1 ) 2 +(3x i2 – 3x j2 ) 2 The relative scalings in the distance metric affect region shapes. Thanks to Andrew Moore

54
Viola 2003 Euclidean Distance Metric Other Metrics… Mahalanobis, Rank-based, Correlation-based (Stanfill+Waltz, Maes’ Ringo system…) where Or equivalently, Thanks to Andrew Moore

55
Viola 2003 Notable Distance Metrics Thanks to Andrew Moore

56
Viola 2003 Simard: Tangent Distance

57
Viola 2003 Simard: Tangent Distance

58
Viola 2003 FERET Photobook Moghaddam & Pentland (1995) Thanks to Baback Moghaddam

59
Viola 2003 Eigenfaces Moghaddam & Pentland (1995) Normalized Eigenfaces Thanks to Baback Moghaddam

60
Viola 2003 Euclidean (Standard) “Eigenfaces” Turk & Pentland (1992) Moghaddam & Pentland (1995) Projects all the training faces onto a universal eigenspace to “encode” variations (“modes”) via principal components (PCA) Uses inverse-distance as a similarity measure for matching & recognition Thanks to Baback Moghaddam

61
Viola 2003 Metric (distance-based) Similarity Measures –template-matching, normalized correlation, etc Disadvantages –Assumes isotropic variation (that all variations are equi-probable) –Can not distinguish incidental changes from the critical ones –Particularly bad for Face Recognition in which so many are incidental! for example: lighting and expression Euclidean Similarity Measures Thanks to Baback Moghaddam

62
Viola 2003 PCA-Based Density Estimation Moghaddam & Pentland ICCV’95 Perform PCA and factorize into (orthogonal) Gaussians subspaces: See Tipping & Bishop (97) for an ML derivation within a more general factor analysis framework (PPCA) Solve for minimal KL divergence residual for the orthogonal subspace: Thanks to Baback Moghaddam

63
Viola 2003 Bayesian Face Recognition Moghaddam et al ICPR’96, FG’98, NIPS’99, ICCV’99 Moghaddam ICCV’95 PCA-based density estimation Intrapersonal Extrapersonal dual subspaces for dyads (image pairs) Equate “similarity” with posterior on Thanks to Baback Moghaddam

64
Viola 2003 Intra-Extra (Dual) Subspaces specs lightmouthsmile Intra Extra Standard PCA Thanks to Baback Moghaddam

65
Viola 2003 Intra-Extra Subspace Geometry Two “pancake” subspaces with different orientations intersecting near the origin. If each is in fact Gaussian, then the optimal discriminant is hyperquadratic Thanks to Baback Moghaddam

66
Viola 2003 Bayesian (MAP) Similarity –priors can be adjusted to reflect operational settings or used for Bayesian fusion (evidential “belief” from another level of inference) Likelihood (ML) Similarity Bayesian Similarity Measure Intra-only (ML) recognition is only slightly inferior to MAP (by few %). Therefore, if you had to pick only one subspace to work in, you should pick Intra – and not standard eigenfaces! Thanks to Baback Moghaddam

67
Viola 2003 FERET Identification: Pre-Test Bayesian (Intra-Extra) Standard (Eigenfaces) Thanks to Baback Moghaddam

68
Viola 2003 Official 1996 FERET Test Bayesian (Intra-Extra) Standard (Eigenfaces) Thanks to Baback Moghaddam

69
Viola 2003 One-Nearest Neighbor Objection: That noise-fitting is really objectionable. What’s the most obvious way of dealing with it?..let’s leave distance metrics for now, and go back to…. Thanks to Andrew Moore

70
Viola 2003 k-Nearest Neighbor Four things make a memory based learner: 1.A distance metric Euclidian 2.How many nearby neighbors to look at? k 3.A weighting function (optional) Unused 4.How to fit with the local points? Just predict the average output among the k nearest neighbors. Thanks to Andrew Moore

71
Viola 2003 k-Nearest Neighbor (here k=9) K-nearest neighbor for function fitting smoothes away noise, but there are clear deficiencies. What can we do about all the discontinuities that k-NN gives us? A magnificent job of noise- smoothing. Three cheers for 9- nearest-neighbor. But the lack of gradients and the jerkiness isn’t good. Appalling behavior! Loses all the detail that join-the-dots and 1-nearest-neighbor gave us, yet smears the ends. Fits much less of the noise, captures trends. But still, frankly, pathetic compared with linear regression. Thanks to Andrew Moore

72
Viola 2003 Kernel Regression Four things make a memory based learner: 1.A distance metric Scaled Euclidian 2.How many nearby neighbors to look at? All of them 3.A weighting function (optional) w i = exp(-D(x i, query) 2 / K w 2 ) Nearby points to the query are weighted strongly, far points weakly. The K W parameter is the Kernel Width. Very important. 4.How to fit with the local points? Predict the weighted average of the outputs: predict = Σw i y i / Σw i Thanks to Andrew Moore

73
Viola 2003 Kernel Regression in Pictures Take this dataset…..and do a kernel prediction with x q (query) = 310, K w = 50. Thanks to Andrew Moore

74
Viola 2003 Varying the Query x q = 150x q = 395 Thanks to Andrew Moore

75
Viola 2003 Varying the kernel width Increasing the kernel width K w means further away points get an opportunity to influence you. As K w infinity, the prediction tends to the global average. x q = 310 K W = 50 (see the double arrow at top of diagram) x q = 310 (the same) K W = 100 x q = 310 (the same) K W = 150 Thanks to Andrew Moore

76
Viola 2003 Kernel Regression Predictions Increasing the kernel width K w means further away points get an opportunity to influence you. As K w infinity, the prediction tends to the global average. K W =10K W =20K W =80 Thanks to Andrew Moore

77
Viola 2003 Kernel Regression on our test cases KW=1/32 of x-axis width. It’s nice to see a smooth curve at last. But rather bumpy. If Kw gets any higher, the fit is poor. KW=1/32 of x-axis width. Quite splendid. Well done, kernel regression. The author needed to choose the right K W to achieve this. KW=1/16 axis width. Nice and smooth, but are the bumps justified, or is this overfitting? Choosing a good K w is important. Not just for Kernel Regression, but for all the locally weighted learners we’re about to see. Thanks to Andrew Moore

78
Viola 2003 Weighting functions Let d=D(x i,x query )/K W Then here are some commonly used weighting functions… (we use a Gaussian) Thanks to Andrew Moore

79
Viola 2003 Kernel Regression can look bad KW = Best. Clearly not capturing the simple structure of the data.. Note the complete failure to extrapolate at edges. KW = Best. Also much too local. Why wouldn’t increasing Kw help? Because then it would all be “smeared”. KW = Best. Three noisy linear segments. But best kernel regression gives poor gradients. Time to try something more powerful… Thanks to Andrew Moore

80
Viola 2003 Locally Weighted Regression Kernel Regression: Take a very very conservative function approximator called AVERAGING. Locally weight it. Locally Weighted Regression: Take a conservative function approximator called LINEAR REGRESSION. Locally weight it. Let’s Review Linear Regression…. Thanks to Andrew Moore

81
Viola 2003 Unweighted Linear Regression You’re lying asleep in bed. Then Nature wakes you. YOU: “Oh. Hello, Nature!” NATURE: “I have a coefficient β in mind. I took a bunch of real numbers called x 1, x 2..x N thus: x 1 =3.1,x 2 =2, …x N =4.5. For each of them (k=1,2,..N), I generated y k = βx k +ε k where ε k is a Gaussian (i.e. Normal) random variable with mean 0 and standard deviation σ. The ε k ’s were generated independently of each other. Here are the resulting y i ’s: y 1 =5.1, y 2 =4.2, …y N =10.2” You: “Uh-huh.” Nature: “So what do you reckon β is then, eh?” WHAT IS YOUR RESPONSE? Thanks to Andrew Moore

82
Viola 2003 Locally Weighted Regression Four things make a memory-based learner: 1.A distance metric Scaled Euclidian 2.How many nearby neighbors to look at? All of them 3.A weighting function (optional) w k = exp(-D(x k, x query ) 2 / K w 2 ) Nearby points to the query are weighted strongly, far points weakly. The K w parameter is the Kernel Width. 4.How to fit with the local points? 1.First form a local linear model. Find the β that minimizes the locally weighted sum of squared residuals: Then predict y predict =β T x query Thanks to Andrew Moore

83
Viola 2003 How LWR works Linear regression not flexible but trains like lightning. Locally weighted regression is very flexible and fast to train. Query Thanks to Andrew Moore

84
Viola 2003 LWR on our test cases KW = 1/16 of x-axis width.KW = 1/32 of x-axis width.KW = 1/8 of x-axis width. Nicer and smoother, but even now, are the bumps justified, or is this overfitting? Thanks to Andrew Moore

85
Viola 2003 Features, Features, Features In almost every case: Good Features beat Good Learning Learning beats No Learning Critical classifier ratio: AdaBoost >> SVM

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google