# Face Alignment with Part-Based Modeling

## Presentation on theme: "Face Alignment with Part-Based Modeling"— Presentation transcript:

Face Alignment with Part-Based Modeling
Vahid Kazemi Josephine Sullivan CVAP KTH Institute of Technology I’m Vahid Kazemi. I’m a PhD student in Computer Vision and Active Perception Lab at KTH, and I’m presenting the Face Alignment with Part Based Modeling Paper.

Objective: Face Alignment
Find the correspondences between landmarks of a template face model and the target face. The main objective of this work is to find the correspondences of all the pixels between a template model, and a target image. To simplify this task, we only look for a sub set of pixels. We hope with a good selection of these points, we can come up with the position of the rest of the pixels by interpolation. Annotated images (source: IMM dataset) Test image (source: YouTube)

Why: Possible Applications
The outcome can be used for: Motion Capture: by determining head pose and facial expressions. Face Recognition: by comparing registered facial features with a database. 3D Reconstruction: by determining camera parameters using correspondences in an image sequence Etc. The outcome of this work can be used for many different applications including motion capture, face recognition, and 3D reconstruction. All of these and many more applications in computer vision rely on accurate localization of a set of landmarks on an object.

Global Methods Overview: Properties: Examples:
Create a constrained generative template model Start with a rough estimate of face position. Refine the template to match the target face. Properties: Model deformations more precisely Arbitrary number of landmarks Examples: Active Shape Models [Cootes 95] Active Appearance Model [Cootes 98] 3D Morphable Models [Blanz 99] - Traditional methods for dealing with face alignment problem, rely on the global appearance of the object. These methods build a constrained generative model of the object, and try to find to best match by minimizing a complicated nonlinear cost function. Optimal solution for this problem is not computable, but we can find an approximation, starting with a close enough initialization, and refining the parameters in multiple iterations to reach convergence. - Global methods model deformations more precisely, but they are harder to train, and have limited generalization ability. In practice these methods usually fail when the initialization error is high. Therefore for example these methods can not be applied to fast motion sequences.

Part-Based Methods Overview: Properties: Examples:
Train different classifiers for each part. Learn constraints on relative positions of parts. Properties: More robust to partial occlusion Better generalization ability Sparse results Examples: Elastic Bunch Graph Matching [Wiskott 97] Pictorial Structures [Felzenszwalb 2003] - Part-based methods on the other hand divide the object into multiple parts, and train different classifiers for each part individually. These methods simplify the problem by the assumption that each part is rigid, and limit the deformation to the relative transformation of each part. The work by Felzenszwalkb in this area is very important. He has proposed an algorithm based on Pictorial Structures which finds the optimal solution of part based matching problem in linear time complexity. - Part-based methods are more robust to partial occlusion, because even if one or a few parts are completely occluded in the object, it is still possible to determine the location of those parts based on the spatial constraints with the visible parts. These methods have better generalization ability since each part is treated individually and therefore the learned model is able to cover more range of variation with limited training data. The down side is that these methods can not be directly applied to dense matching problems, because it requires creating individual classifiers for each part which is not practical, because not all the landmarks have a distinctive local appearance and it is also computationally expensive to define a part for each landmark.

Our approach to face alignment
How can we avoid the draw backs of existing models? In the next slides we will describe a combined model which avoids some of the drawbacks of the mentioned methods.

Our approach to face alignment
Find the mapping, q, from appearance to the landmark positions: But q is complex and non-linear… What we want to do is to find a mapping Q, which maps a feature representation of the appearance of the object to a set of landmarks representing the shape of the object. Ideally we want to use a linear regression function to describe the mapping. This is because training a linear model needs less training data, have a lower chance of over fitting, and is computationally less expensive. But we know that Q is highly nonlinear and complex, therefore we need to transform our data somehow to linearize the model.

Linearizing the model Use piece-wise linear functions
One way to do this is by using piece-wise linear functions that is clustering the data, and train a linear model for each cluster. For prediction on novel images we would then look for the closest cluster to the target image and use the appropriate mapping to find the location of landmarks.

Linearizing the model Use a part based model
Another way is to use a part-based model, that is dividing the object into multiple smaller parts, and training individual linear mappings for each part. This requires a part detector for predicting the location of landmarks on the target image.

Linearizing the model Use a suitable feature descriptor
Using a suitable feature descriptor can also help to linearize the model. Feature Descriptor

Part Selection Criteria
Detect the parts accurately and reliably Contain strong features Ensure a simple (linear) model Minimum variation Capture the global appearance Cover the whole object * Using a part-based model, we face a new set of challenges. One of these are the selection of parts. In this work we didn’t try to build an automatic system for part selection but we point out a set of requirements that a good part need to have. First thing is that we want the parts to contain strong features. This is to ensure that we can accurately and reliably detect the parts in novel images. The parts also need to have minimum variation in shape and appearance. The lower the variation is, the simpler the model will be and we can better approximate it with a linear model. At last we need the parts to cover the whole object so that we can capture the global appearance of the object.

Part Selection for the face
We chose nose, eyes, and mouth as good candidates For the case of human faces a natural selection of parts will be the eyes, the nose, and the mouth. For a more general case we need an automated system to find the best part candidates based on the mentioned criteria which can be subject for further investigation. Image from IMM dataset

Appearance descriptor
Variation of PHOG descriptor Divide the patch into 8 sub-regions Recursively repeat for square regions For the choice of feature descriptor, we use a variation of the PHOG descriptor. As you can see from the figure, in the first level an image patch is divided into 8 sub-regions, and each of these regions can once again be recursively divided into 8 more sub-regions until the required level of pyramid. At the end, all the histograms are concatenated to form the final descriptor. This descriptor allows us to capture the appearance of an object at different scales as well as the joint appearance of adjacent regions both horizontally and vertically. In this way we can better represent shape information while maintaining a degree of spatial invariance.

Part detection Build a tree-structured model of the face, with nose at the root, and eyes and mouth as the leafs of the tree. To train the part detector, we model each part with a multivariate Gaussian model based on the appearance descriptor of individual parts. Spatial constraints are learned by representing the object in the form of a tree. In our case, the nose is defined as the root of the tree, and mouth and eyes are the leaves.

Part detection Detect the parts by sliding a patch on image and calculating the Mahalanobis distance of the patch from the mean model To detect the location of parts in a novel image, a window with fixed size is slid across the image, and the Mahalanobis distance of the appearance descriptors from the mean model is calculated for each window to create a likelihood map.

Part detection Find the optimal solution by minimizing the pictorial structure cost function: We can solve this efficiently by using generalized distance transform [Felzenszwalb 2003] by limiting the cost function After creating the likelihood map of the each part on the image, we find the best match by minimizing the pictorial structure cost function. The first term in this expression is a summation over the appearance mismatch error over all parts, and the second term is the sum of the deformation cost between each pair of connected parts. Using generalized distance transform, and dynamic programming, we can find the optimal solution for this cost function in linear time complexity assuming we use the Mahalanobis distance for the deformation cost function. Details of this procedure is described in the Felzenszwalb paper.

Regression Model the mapping between the patch’s appearance feature (f) and its landmark positions (x) as a linear function: Estimate weights from training set using Ridge regression After detecting the parts we want to estimate the position of landmarks inside the patch, based on its appearance with a linear function. We train the weights using annotations in the training data. There are dozens of methods for estimating the weights of a linear regression function. One particular method which is very fast both for training and predicting, and was giving good results in our experiments is the ridge regression which is based on squared error of the estimate from the target values, and has a quadratic regularization term.

Regression Comparison of different regression methods
This figure shows a comparison of the performance of different regression methods. On the top nearest neighbor and ordinary regression are performing worst. The green line shows the result of using principal component regression with different number of components. The red line is the result of using PCR with regularization which with enough components gives the same result as the ridge regression.

Robustify the regression function
Why Compensate for bad part detection Deformable parts don’t exactly fit in a box How Extend training set by adding noise to part positions Note that usually the appearance descriptors are extracted from the fixed patches from different parts of the object. However in the case of deformable objects, different parts probably don’t exactly fit in these patches. There is not a certain way to say which patch represents the part better or worse. So what we did to overcome the problem was to add a zero mean Gaussian noise to the location of patch corners and therefore create a variety of patches with different scales and positions for every image. This can also be beneficial when the part detector is not able to find the exact location of the parts in the testing phase, in this way the regression model contains information about mapping shifted appearance descriptors to correct shape data. This figure shows the effect of adding noise to the location of parts. The black line shows the part detection error, the red line is the the regression error based on ground truth part locations, and the blue line is the regression error with predicted part locations. As you can see with 4% noise we can get the optimal overall results on our dataset.

Experiments Use 240 face images from IMM dataset.
Dataset contains still images from 40 individual subjects with various facial expressions under the same lighting settings 58 landmarks are used to represent the shape of subjects In our experiments we have used the IMM face database which contains 240 still images of 40 different human faces with the resolution of 640 × 480 pixels. The database includes a variety of facial expressions and head orientations, and contain both male and female subjects. Each image comes with 58 annotated landmarks, outlining the major facial features.

Results Comparison of localization accuracy of our algorithm comparing to some existing methods on IMM dataset. * To benchmark the performance of our algorithm we have done a 40 fold cross validation on the IMM dataset. In each fold we used all the images from one person for testing, and the rest of the images for training. * This table shows a summary of the results of our algorithm comparing to the results of some existing methods. As you can see the mean Euclidean distance error between the predicted and the ground truth in our method is around 4 pixels which is better than the results of some existing AAM implementations. Taking to account the average size of head which is 240*320, we can say the error is roughly less than 2 percent. * Mean error is the mean Euclidean distance between predicted and ground truth location of landmarks in pixels

Results Predicted Ground truth
The results of cross validation on IMM dataset Predicted Ground truth These are a few examples as a result of cross validation on IMM dataset. The red lines in these images show the predicted shape, and the green lines represent the ground truth. Pay attention that sometimes the predicted landmark position is more accurate than the annotation so the actual result might be even better than the numbers we provided earlier.

Demo More videos: http://www.csc.kth.se/~vahidk/face/
To have a qualitative measure of performance of the algorithm on the real world data, we tried the algorithm on a set of image sequences. This video shows the result of performing the algorithm on a novel sequence of images, with unknown expressions, and lighting settings. The model used in this demo is trained only on 240 images from the IMM dataset. Note that the localization of landmarks is fairly stable although we haven’t done any sort of temporal smoothing on the results. For more videos you can visit my KTH homepage. More videos:

Conclusion and future work
Part-Based models can be used to simplify complicated models The choice of parts is very important HOG descriptors are not fully descriptive * In this work we have shown that using part-based models we can simplify very complicated functions, and achieve similar or better performance comparing to the global methods without the risk of falling into the local minima. * As mentioned before, the choice of parts is a crucial factor in the performance of algorithm which for many cases is not trivial. An automated method can be developed to find the optimal part-based configuration which can be subject to a future work. * We have used the HOG descriptors to improve the robustness of the algorithm. The problem is that by transforming the image to the feature space, we lose some appearance information. One way to overcome this problem is to use a cascade model which starts with a rough hog descriptor and ends up with the actual image pixels. This also can be subject for a future work.

Questions?