Presentation on theme: "Learning and Vision: Discriminative Models"— Presentation transcript:
1Learning and Vision: Discriminative Models Chris Bishop and Paul Viola
2Part II: Algorithms and Applications Part I: FundamentalsPart II: Algorithms and ApplicationsSupport Vector MachinesFace and pedestrian detectionAdaBoostFacesBuilding Fast ClassifiersTrading off speed for accuracy…Face and object detectionMemory Based LearningSimardMoghaddam
3History Lesson 1950’s Perceptrons are cool Very simple learning rule, can learn “complex” conceptsGeneralized perceptrons are better -- too many weights1960’s Perceptron’s stink (M+P)Some simple concepts require exponential # of featuresCan’t possibly learn that, right?1980’s MLP’s are cool (R+M / PDP)Sort of simple learning rule, can learn anything (?)Create just the features you needMLP’s stinkHard to train : Slow / Local MinimaPerceptron’s are cool
4Why did we need multi-layer perceptrons? Problems like this seem to require very complex non-linearities.Minsky and Papert showed that an exponential number of features is necessary to solve generic problems.
5Why an exponential number of features? 14th Order???120 FeaturesN=21, k=5 --> 65,000 features
6MLP’s vs. Perceptron MLP’s are hard to train… Takes a long time (unpredictably long)Can converge to poor minimaMLP are hard to understandWhat are they really doing?Perceptrons are easy to train…Type of linear programming. Polynomial time.One minimum which is global.Generalized perceptrons are easier to understand.Polynomial functions.
7Perceptron Training is Linear Programming Polynomial time in the number of variablesand in the number of constraints.What about linearly inseparable?
8Rebirth of Perceptrons How to train effectivelyLinear Programming (… later quadratic programming)Though on-line works great too.How to get so many features inexpensively?!?Kernel TrickHow to generalize with so many features?VC dimension. (Or is it regularization?)Support Vector Machines
9Lemma 1: Weight vectors are simple The weight vector lives in a sub-space spanned by the examples…Dimensionality is determined by the number of examples not the complexity of the space.
10Lemma 2: Only need to compare examples This is called the kernel matrix.
20SVM: Key Ideas Augment inputs with a very large feature set Polynomials, etc.Use Kernel Trick(TM) to do this efficientlyEnforce/Encourage Smoothness with weight penaltyIntroduce MarginFind best solution using Quadratic Programming
21SVM: Zip Code recognition Data dimension: 256Feature Space: 4 th orderroughly 100,000,000 dims
22The Classical Face Detection Process LargerScaleSmallestScale50,000 Locations/Scales
23Classifier is Learned from Labeled Data Training Data5000 facesAll frontal108 non facesFaces are normalizedScale, translationMany variationsAcross individualsIlluminationPose (rotation both in plane and out)This situation with negative examples is actually quite common… where negative examples are free.
24Key Properties of Face Detection Each image contains thousand locs/scalesFaces are rare per image1000 times as many non-faces as facesExtremely small # of false positives: 10-6As I said earlier the classifier is evaluated 50,000Faces are quite rare…. Perhaps 1 or 2 faces per imageA reasonable goal is to make the false positive rate less than the true positive rate...
29P, O, & G: First Pedestrian Work Flaws: very poor model for feature selection.Little real motivation for features at all. Why not use the original image.
30On to AdaBoost Given a set of weak classifiers None much better than randomIteratively combine classifiersForm a linear combinationTraining error converges to 0 quicklyTest error is related to training margin
31AdaBoost Freund & Shapire Weak Classifier 1 Weights Increased Weak Final classifier islinear combination of weak classifiers
33AdaBoost: Super Efficient Feature Selector Features = Weak ClassifiersEach round selects the optimal feature given:Previous selected featuresExponential Loss
34Boosted Face Detection: Image Features “Rectangle filters”Similar to Haar waveletsPapageorgiou, et al.For real problems results are only as good as the features used...This is the main piece of ad-hoc (or domain) knowledgeRather than the pixels, we have selected a very large set of simple functionsSensitive to edges and other critcal features of the image** At multiple scalesSince the final classifier is a perceptron it is important that the features be non-linear… otherwise the final classifier will be a simple perceptron.We introduce a threshold to yield binary featuresUnique Binary Features
37Feature Selection For each round of boosting: Evaluate each rectangle filter on each exampleSort examples by filter valuesSelect best threshold for each filter (min Z)Select best filter/threshold (= Feature)Reweight examplesM filters, T thresholds, N examples, L learning timeO( MT L(MTN) ) Naïve Wrapper MethodO( MN ) Adaboost feature selector
38Example Classifier for Face Detection A classifier with 200 rectangle features was learned using AdaBoost95% correct detection on test set with 1 in 14084false positives.Not quite competitive...ROC curve for 200 feature classifier
39Building Fast Classifiers Given a nested set of classifier hypothesis classesComputational Risk Minimizationvsfalsenegdetermined by% False Pos% Detection50In general simple classifiers, while they are more efficient, they are also weaker.We could define a computational risk hierarchy (in analogy with structural risk minimization)…A nested set of classifier classesThe training process is reminiscent of boosting…- previous classifiers reweight the examples used to train subsequent classifiersThe goal of the training process is different- instead of minimizing errors minimize false positivesFACEIMAGESUB-WINDOWClassifier 1FTNON-FACEClassifier 3Classifier 2
40Other Fast Classification Work SimardRowley (Faces)Fleuret & Geman (Faces)
41Cascaded Classifier50%20%2%IMAGESUB-WINDOW1 Feature5 Features20 FeaturesFACEFFFNON-FACENON-FACENON-FACEA 1 feature classifier achieves 100% detection rate and about 50% false positive rate.A 5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative)using data from previous stage.A 20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)
42Comparison to Other Systems (94.8)Roth-Yang-Ahuja94.4Schneiderman-Kanade89.990.189.286.083.2Rowley-Baluja-Kanade93.791.891.190.890.088.885.278.3Viola-Jones422167110957865503110False DetectionsDetector
44Solving other “Face” Tasks Profile DetectionFacial Feature LocalizationDemographicAnalysis
45Feature Localization Surprising properties of our framework The cost of detection is not a function of image sizeJust the number of featuresLearning automatically focuses attention on key regionsConclusion: the “feature” detector can include a large contextual region around the feature
46Feature Localization Features Learned features reflect the task
50Thanks toAndrew MooreOne-Nearest Neighbor …One nearest neighbor for fitting is described shortly…Similar to Join The Dots with two Pros and one Con.PRO: It is easy to implement with multivariate inputs.CON: It no longer interpolates locally.PRO: An excellent introduction to instance-based learning…
511-Nearest Neighbor is an example of…. Instance-based learning Thanks toAndrew Moore1-Nearest Neighbor is an example of…. Instance-based learningx y1x y2x y3.xn ynA function approximator that has been around since about 1910.To make a prediction, search database for similar datapoints, and fit with the local points.Four things make a memory based learner:A distance metricHow many nearby neighbors to look at?A weighting function (optional)How to fit with the local points?
52Nearest Neighbor Four things make a memory based learner: Thanks toAndrew MooreNearest NeighborFour things make a memory based learner:A distance metric EuclidianHow many nearby neighbors to look at? OneA weighting function (optional) UnusedHow to fit with the local points? Just predict the same output as the nearest neighbor.
53Multivariate Distance Metrics Thanks toAndrew MooreMultivariate Distance MetricsSuppose the input vectors x1, x2, …xn are two dimensional:x1 = ( x11 , x12 ) , x2 = ( x21 , x22 ) , …xN = ( xN1 , xN2 ).One can draw the nearest-neighbor regions in input space.Dist(xi,xj) = (xi1 – xj1)2 + (xi2 – xj2)2Dist(xi,xj) =(xi1 – xj1)2+(3xi2 – 3xj2)2The relative scalings in the distance metric affect region shapes.
60Thanks toBaback MoghaddamEuclidean (Standard) “Eigenfaces” Turk & Pentland (1992) Moghaddam & Pentland (1995)Projects all the training facesonto a universal eigenspaceto “encode” variations (“modes”)via principal components (PCA)Uses inverse-distanceas a similarity measurefor matching & recognition
61Euclidean Similarity Measures Thanks toBaback MoghaddamMetric (distance-based) Similarity Measurestemplate-matching, normalized correlation, etcDisadvantagesAssumes isotropic variation (that all variations are equi-probable)Can not distinguish incidental changes from the critical onesParticularly bad for Face Recognition in which so many are incidental!for example: lighting and expression
62PCA-Based Density Estimation Moghaddam & Pentland ICCV’95 Thanks toBaback MoghaddamSolve for minimal KL divergence residual for the orthogonal subspace:Perform PCA and factorize into (orthogonal)Gaussians subspaces:See Tipping & Bishop (97) for an ML derivation within a more general factor analysis framework (PPCA)
63dual subspaces for dyads (image pairs) Thanks toBaback MoghaddamBayesian Face Recognition Moghaddam et al ICPR’96, FG’98, NIPS’99, ICCV’99IntrapersonalExtrapersonaldual subspaces for dyads (image pairs)Equate “similarity” with posterior onMoghaddam ICCV’95PCA-based density estimation
64Intra-Extra (Dual) Subspaces Thanks toBaback MoghaddamIntra-Extra (Dual) SubspacesspecslightmouthsmileIntraExtraNote that the Extra and Standard subspaces are qualitatively the same (eg. both have beard-encoders, not present in the Intra space). In fact, their 1st eigenvector is almost exactly the same (bangs in the forehead)StandardPCA
65Intra-Extra Subspace Geometry Thanks toBaback MoghaddamThe lower-order eigenvectors/eigenvalues allow us to “visualize” these two distributions and to judge their (relative) size and orientation at least.The key thing is that we’re effectively talking about two “pencils” which only intersect near the origin (of delta space). Discriminating between these two subspaces (a binary classification task) is often much easier than discriminating between N faces (N-ary classification). In fact, not only easier, but very often much more effective.Two “pancake” subspaces with different orientations intersecting near the origin. If each is in fact Gaussian, then the optimal discriminant is hyperquadratic
66Bayesian Similarity Measure Thanks toBaback MoghaddamBayesian (MAP) Similaritypriors can be adjusted to reflect operational settings or used for Bayesian fusion (evidential “belief” from another level of inference)Likelihood (ML) SimilarityIntra-only (ML) recognition is only slightly inferior to MAP (by few %). Therefore, if you had to pick only one subspace to work in, you should pick Intra – and not standard eigenfaces!
68Official 1996 FERET Test Bayesian (Intra-Extra) Standard (Eigenfaces) Thanks toBaback MoghaddamBayesian (Intra-Extra)Standard (Eigenfaces)
69..let’s leave distance metrics for now, and go back to…. Thanks toAndrew MooreOne-Nearest NeighborObjection:That noise-fitting is really objectionable.What’s the most obvious way of dealing with it?
70k-Nearest Neighbor Four things make a memory based learner: Thanks toAndrew Moorek-Nearest NeighborFour things make a memory based learner:A distance metric EuclidianHow many nearby neighbors to look at?kA weighting function (optional) UnusedHow to fit with the local points? Just predict the average output among the k nearest neighbors.
71k-Nearest Neighbor (here k=9) Thanks toAndrew Moorek-Nearest Neighbor (here k=9)A magnificent job of noise-smoothing. Three cheers for 9-nearest-neighbor.But the lack of gradients and the jerkiness isn’t good.Appalling behavior! Loses all the detail that join-the-dots and 1-nearest-neighbor gave us, yet smears the ends.Fits much less of the noise, captures trends. But still, frankly, pathetic compared with linear regression.K-nearest neighbor for function fitting smoothes away noise, but there are clear deficiencies.What can we do about all the discontinuities that k-NN gives us?
72Kernel Regression Four things make a memory based learner: Thanks toAndrew MooreKernel RegressionFour things make a memory based learner:A distance metric Scaled EuclidianHow many nearby neighbors to look at? All of themA weighting function (optional) wi = exp(-D(xi, query)2 / Kw2)Nearby points to the query are weighted strongly, far points weakly. The KW parameter is the Kernel Width. Very important.How to fit with the local points? Predict the weighted average of the outputs:predict = Σwiyi / Σwi
73Kernel Regression in Pictures Thanks toAndrew MooreKernel Regression in PicturesTake this dataset…..and do a kernel prediction with xq (query) = 310, Kw = 50.
74Thanks toAndrew MooreVarying the Queryxq = 150xq = 395
75Varying the kernel width Thanks toAndrew MooreVarying the kernel widthxq = 310KW = 50 (see the double arrow at top of diagram)xq = 310 (the same)KW = 100KW = 150Increasing the kernel width Kw means further away points get an opportunity to influence you.As Kwinfinity, the prediction tends to the global average.
76Kernel Regression Predictions Thanks toAndrew MooreKernel Regression PredictionsKW=10KW=20KW=80Increasing the kernel width Kw means further away points get an opportunity to influence you.As Kwinfinity, the prediction tends to the global average.
77Kernel Regression on our test cases Thanks toAndrew MooreKernel Regression on our test casesKW=1/32 of x-axis width.It’s nice to see a smooth curve at last. But rather bumpy. If Kw gets any higher, the fit is poor.Quite splendid. Well done, kernel regression. The author needed to choose the right KW to achieve this.KW=1/16 axis width.Nice and smooth, but are the bumps justified, or is this overfitting?Choosing a good Kw is important. Not just for Kernel Regression, but for all the locally weighted learners we’re about to see.
78Weighting functions Let d=D(xi,xquery)/KW Thanks toAndrew MooreWeighting functionsLetd=D(xi,xquery)/KWThen here are some commonly used weighting functions…(we use a Gaussian)
79Kernel Regression can look bad Thanks toAndrew MooreKernel Regression can look badKW = Best.Clearly not capturing the simple structure of the data.. Note the complete failure to extrapolate at edges.Also much too local. Why wouldn’t increasing Kw help? Because then it would all be “smeared”.Three noisy linear segments. But best kernel regression gives poor gradients.Time to try something more powerful…
80Locally Weighted Regression Thanks toAndrew MooreLocally Weighted RegressionKernel Regression:Take a very very conservative function approximator called AVERAGING. Locally weight it.Locally Weighted Regression:Take a conservative function approximator called LINEAR REGRESSION. Locally weight it.Let’s Review Linear Regression….
81Unweighted Linear Regression Thanks toAndrew MooreUnweighted Linear RegressionYou’re lying asleep in bed. Then Nature wakes you.YOU: “Oh. Hello, Nature!”NATURE: “I have a coefficient β in mind. I took a bunch of real numbers called x1, x2 ..xN thus: x1=3.1,x2=2, …xN=4.5.For each of them (k=1,2,..N), I generated yk= βxk+εkwhere εk is a Gaussian (i.e. Normal) random variable with mean 0 and standard deviation σ. The εk’s were generated independently of each other.Here are the resulting yi’s: y1=5.1 , y2=4.2 , …yN=10.2”You: “Uh-huh.”Nature: “So what do you reckon β is then, eh?”WHAT IS YOUR RESPONSE?
82Locally Weighted Regression Thanks toAndrew MooreLocally Weighted RegressionFour things make a memory-based learner:A distance metric Scaled EuclidianHow many nearby neighbors to look at? All of themA weighting function (optional) wk = exp(-D(xk, xquery)2 / Kw2) Nearby points to the query are weighted strongly, far points weakly. The Kw parameter is the Kernel Width.How to fit with the local points?First form a local linear model. Find the β that minimizes the locally weighted sum of squared residuals:Then predict ypredict=βT xquery
83How LWR works Query Thanks to Andrew Moore Linear regression not flexible but trains like lightning.Locally weighted regression is very flexible and fast to train.
84LWR on our test cases KW = 1/16 of x-axis width. Thanks toAndrew MooreLWR on our test casesKW = 1/16 of x-axis width.KW = 1/32 of x-axis width.KW = 1/8 of x-axis width.Nicer and smoother, but even now, are the bumps justified, or is this overfitting?
85Features, Features, Features In almost every case:Good Features beat Good LearningLearning beats No LearningCritical classifier ratio:AdaBoost >> SVMThis is not to say that a