Face Detection and Recognition Reading: Chapter 18.10 and, optionally, “Face Recognition using Eigenfaces” by M. Turk and A. Pentland.

Face Detection and Recognition Reading: Chapter 18.10 and, optionally, “Face Recognition using Eigenfaces” by M. Turk and A. Pentland

Face Detection Problem Scan window over image Classify window as either: –Face –Non-face Classifier Window Face Non-face

http://www.ri.cmu.edu/projects/project_320.html Face Detection Problem

Face Detection in most Consumer Cameras and Smartphones for Autofocus

The Viola-Jones Real-Time Face Detector P. Viola and M. Jones, 2004 Challenges: Each image contains 10,000 – 50,000 locations and scales where a face may occur Faces are rare: 0 - 50 per image >1,000 times as many non-faces as faces Want a very small # of false positives: 10 -6

Training Data (grayscale) 5,000 faces (frontal) 10 8 non-faces Faces are normalized –Scale, translation Many variations Across individuals Illumination Pose (rotation both in plane and out) Use Machine Learning to Create a 2-Class Classifier

Use Classifier at All Locations and Scales

Building a Classifier Compute lots of very simple features Efficiently choose best features Each feature is used to define a “weak classifier” Combine weak classifiers into an ensemble classifier based on boosting Learn multiple ensemble classifiers and “chain” them together to improve classification accuracy

Computing Features At each position and scale, use a sub- image (“window”) of size 24 x 24 Compute multiple candidate features for each window Want to rapidly compute these features

What Types of Features? Useful domain knowledge: –The eye region is darker than the forehead or the upper cheeks –The nose bridge region is brighter than the eyes –The mouth is darker than the chin Encoding –Location and size: eyes, nose bridge, mouth, etc. –Value: darker vs. lighter

Features 4 feature types (similar to “Haar wavelets”): Two-rectangle Three-rectangle Four-rectangle Value = ∑ (pixels in white area) - ∑ (pixels in black area)

Huge Number of Features 160,000 features for each window!

Computing Features Efficiently: The Integral Image Intermediate representation of the image –Sum of all pixels above and to left of (x, y) in image i : Computed in one pass over the image: ii(x, y) = i(x, y) + ii(x-1, y) + ii(x, y-1) − ii(x-1, y-1)

Using the Integral Image With the integral image representation, we can compute the value of any rectangular sum in constant time For example, the integral sum in rectangle D is computed as: ii(4) + ii(1) – ii(2) – ii(3) (x,y) s(x, y) = s(x, y-1) + i(x, y) ii(x, y) = ii(x-1, y) + s(x, y) (0,0) x y

Features as Weak Classifiers Given window x, feature detector f t, and threshold θ t, construct a weak classifier, h t

Aggregation of predictions of multiple classifiers with the goal of improving accuracy by reducing the variance of an estimated prediction function “Mixture of experts” Combining multiple classifiers often produces higher accuracy than any individual classifier Combining Classifiers: Ensemble Methods

Ensemble Strategies Bagging –Create classifiers using different training sets, where each training set is created by “bootstrapping,” i.e., drawing examples (with replacement) from all possible training examples Boosting –Sequential production of classifiers, where each classifier is dependent on the previous one –Make examples misclassified by current classifier more important in the next classifier used for Random Forests

Boosting Boosting is a class of ensemble methods for sequentially producing multiple weak classifiers, where each classifier is dependent on the previous ones Make examples misclassified by previous classifier more important in the next classifier Combine all weak classifiers linearly into combined classifier: weight weak classifier

Boosting How to select the best features? How to learn the final classification function? C(x) = α 1 h 1 (x) + α 2 h 2 (x) +....

AdaBoost Algorithm Given a set of training windows labelled +/-, initially give equal weight to each training example Repeat T times 1.Select best weak classifier (min total weighted error on all training examples) 2.Increase weights of examples misclassified by current weak classifier Each round greedily selects the best feature given all previously selected features Final classifier weights weak classifiers by their accuracy

AdaBoost Algorithm 1. Initialize Weights 2. Construct a classifier. Compute the error. 3. Update the weights, and repeat step 2. … 4. Finally, sum hypotheses… Slide by T. Holloway

Each data point has a class label: w t =1 and a weight: +1 ( ) -1 ( ) y t = AdaBoost Example x t=1 x t=2 xtxt

Example Weak learners from the family of decision lines Each data point has a class label: w t =1 and a weight: +1 ( ) -1 ( ) y t =

Example This one is best Each data point has a class label: w t =1 and a weight: +1 ( ) -1 ( ) y t = This is a ‘weak classifier’: It performs slightly better than chance

Example Re-weight misclassified examples Each data point has a class label: w t w t exp{-y t h t } Update the weights: +1 ( ) -1 ( ) y t =

Example Select 2 nd weak classifier Each data point has a class label: w t w t exp{-y t h t } Update the weights: +1 ( ) -1 ( ) y t =

Example Re-weight misclassified examples Each data point has a class label: w t w t exp{-y t h t } Update the weights: +1 ( ) -1 ( ) y t =

Example Select 3 rd weak classifier Each data point has a class label: w t w t exp{-y t h t } Update the weights: +1 ( ) -1 ( ) y t =

Example The final classifier is the combination of all the weak, linear classifiers h1h1 h2h2 h3h3 h4h4

Classifications (colors) and weights (size) after 1 iteration Of AdaBoost 3 iterations 20 iterations from Elder, John. From Trees to Forests and Rule Sets - A Unified Overview of Ensemble Methods. 2007. Slide by T. Holloway

AdaBoost with Rectangle Features For each round of boosting: –Evaluate each rectangle feature on each example –Sort examples by feature values –Select best threshold (θ) for each feature (one with lowest error) –Select best feature/threshold combination from all candidate features –Compute weight ( α ) and incorporate feature into strong classifier F(x) F(x) + α f(x) –Re-weight examples O(mn) time where m = #features, n = #examples

AdaBoost - Adaptive Boosting Learn a single simple classifier Classify the data Look at where it makes errors Re-weight the data so that the inputs where we made errors get higher weight in the learning process Now learn a 2 nd simple classifier on the weighted data Combine the 1 st and 2 nd classifiers and weight the data according to where they make errors Learn a 3 rd classifier based on the weighted data … and so on until we learn T simple classifiers Final classifier is the weighted combination of all T classifiers

AdaBoost Advantages –Very little code –Reduces variance Disadvantages –Sensitive to noise and outliers Slide by T. Holloway

Example Classifier ROC curve for 200-feature classifier A classifier with T=200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 face in 14,084 false positives Not quite competitive false positive rate correct detection rate

Learned Features

Classifier Error is Driven Down as T Increases, but Cost Increases

AdaBoost: Training Error Freund and Schapire (1997) proved that AdaBoost ADApts to the error rates of the individual weak hypotheses

AdaBoost: Generalization Error Freund and Schapire (1997) showed that

AdaBoost: Generalization Error An alternative analysis was presented by Schapire et al. (1998), that suits the empirical findings

AdaBoost: Generalization Error It seems that Boosting may overfit if run for too many rounds It has been observed empirically that AdaBoost does not overfit, even when run for thousands of rounds It has been observed that the generalization error continues to go down long after training error reaches 0

AdaBoost: Different Point of View We try to solve the problem of approximating the y’s using a linear combination of weak hypotheses In other words, we are interested in the problem of finding a vector of parameters α such that is a good approximation of y i For classification problems we try to match the sign of f(x i ) to y i

AdaBoost – different point of view Sometimes it is advantageous to minimize some other (non-negative) loss function instead of the number of classification errors For AdaBoost the loss function is This point of view was used by Collins, Schapire and Singer (2002) to demonstrate that AdaBoost converges to optimality

Weak Classifiers Defined by a feature, f, computed on a window in the image, a threshold, θ, and a parity, p: Thresholds are obtained by the mean value for the feature on both classes and then averaging the two values Parity defines the direction of the inequality

Weak Classifiers Weak Classifier: A feature that best separates a set of training examples Given a sub-window (x), a feature (f), a threshold ( θ ), and a polarity (p) indicating the direction of the inequality:

Weak Classifiers A weak classifier is a combination of a feature and a threshold We have K features We have N thresholds where N is the number of examples Thus there are KN weak classifiers

Weak Classifier Selection For each feature, sort the examples based on feature value For each element evaluate the total sum of positive/negative example weights (T+/T-) and the sum of positive/negative weights below the current example (S+/S-) The error for a threshold which splits the range between the current and previous example in the sorted list is

Example eBAS-S+T-T+Wfyx 2/53/52/5002/53/51/52X1 1/54/51/51/502/53/51/53X2 05/502/502/53/51/551X3 1/54/51/52/51/52/53/51/571X4 2/53/52/52/52/52/53/51/581X5

AdaBoost Given a training set of images –For t = 1, …,T (rounds of boosting) A weak classifier is trained using a single feature The error of the classifier is calculated –The classifier (or single feature) with the lowest error is selected, and combined with the others to make a strong classifier After a T rounds a T-strong classifier is created –It is the weighted linear combination of the weak classifiers selected

Cascading Improve accuracy and speed by “cascading” multiple classifiers together Start with a simple (small T) classifier that rejects many of the negative windows while detecting almost all positive windows Positive results from the first classifier triggers the evaluation of a second (more complex; larger T) classifier, and so on –Each classifier is trained with the false positives from the previous classifier A negative outcome at any point leads to the immediate rejection of the window (i.e., no face detected)

Building Fast Classifiers Given a nested set of classifier hypothesis classes Computational Risk Minimization vs falsenegdetermined by % False Pos % Detection 0 50 50 100 FACE IMAGE WINDOW Classifier 1 F T NON-FACE Classifier 3 T F NON-FACE F T Classifier 2 T F NON-FACE

Cascaded Classifier 5 Features 50% F NON-FACE 20 Features 20%2% FACE F NON-FACE IMAGE WINDOW 1 Feature F NON-FACE A T=1 feature classifier achieves 100% detection rate and about 50% false positive rate A T=5 feature classifier achieves 100% detection rate and 40% false positive rate (20% cumulative) –using data from previous stage A T=20 feature classifier achieve 100% detection rate with 10% false positive rate (2% cumulative)

Training a Cascade User selects values for: –Maximum acceptable false positive rate per node –Minimum acceptable detection rate per node –Target overall false positive rate User gives a set of positive and negative training examples

Training a Cascade Algorithm While the overall false positive rate is not met: –While the false positive rate of current node is less than the maximum: Train a classifier with n features using AdaBoost on a set of positive and negative training examples Decrease the threshold for current classifier detection rate of the layer until it is more than the minimum Evaluate current cascade classifier on tuning set –Evaluate current cascade detector on a set of non- face images and put any false detections into the negative training set

Results

Detection in Real Images Basic classifier operates on 24 x 24 windows Scaling –Scale the detector (rather than the images) –Features can easily be evaluated at any scale –Scale by factors of 1.25 Location –Move detector around the image (e.g., 1 pixel increments) Final Detection –A real face may result in multiple nearby detections –Post-process detected sub-windows to combine overlapping detections into a single detection

Training Data Set 4,916 hand-labeled face images (24 x 24) 9,500 non-face images

Structure of the Detector 38 layer cascade 6,060 features

Speed of Final Detector In 2003, on a 700 Mhz processor, the algorithm could process a 384 x 288 image in about 0.067 seconds Real-time implementations exist for face tracking in video

Results Notice detection at multiple scales

Results

Solving other “Face” Tasks Facial Feature Localization Demographic Analysis Profile Detection

Profile Features

Demo Implementing Viola & Jones system Frank Fritze, 2004

http://vimeo.com/12774628#

Real-Time Demos http://flashfacedetection.com/ http://flashfacedetection.com/camdemo2.html

Face detection: recent approaches X. Zhu and D. Ramanan, Face Detection, Pose Estimation, and Landmark Localization in the Wild, CVPR, 2012

Face detection: recent approaches Shen et al., Detecting and Aligning Faces by Image Retrieval, CVPR, 2013

Face detection: recent approaches Shen et al., Detecting and Aligning Faces by Image Retrieval, CVPR 2013

Face Alignment and Landmark Localization Goal of face alignment: automatically align a face (usually non-rigidly) to a canonical reference Goal of face landmark localization: automatically locate face landmarks of interests http://www.mathworks.com/matlabcentral/fx_files/32704/4/icaam.jpg http://homes.cs.washington.edu/~neeraj/projects/face-parts/images/teaser.png

Face Image Parsing Given an input face image, automatically segment the face into its constituent parts Smith, Zhang, Brandt, Lin, and Yang, Exemplar-Based Face Parsing, CVPR 2013

Face Image Parsing: Results + InputSoft segmentsHard segmentsGround truth

Face Detection and Recognition Reading: Chapter 18.10 and, optionally, “Face Recognition using Eigenfaces” by M. Turk and A. Pentland.

Similar presentations

Presentation on theme: "Face Detection and Recognition Reading: Chapter 18.10 and, optionally, “Face Recognition using Eigenfaces” by M. Turk and A. Pentland."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Face Detection and Recognition Reading: Chapter 18.10 and, optionally, “Face Recognition using Eigenfaces” by M. Turk and A. Pentland.

Similar presentations

Presentation on theme: "Face Detection and Recognition Reading: Chapter 18.10 and, optionally, “Face Recognition using Eigenfaces” by M. Turk and A. Pentland."— Presentation transcript:

Similar presentations

About project

Feedback