Cascaded Models for Articulated Pose Estimation

Cascaded Models for Articulated Pose Estimation
… Ben Sapp, Alexander Toshev and Ben Taskar University of Pennsylvania

Human Pose Estimation Goal: Image -> Stick Figure 2D locations of anatomical parts from a single image input output We’re interested in human pose estimation. The goal is to take a single, monocular image as input, and recover the locations of body parts as output efficient inference

Human Pose Estimation: It’s Hard
pose variation intrinsic scale variations lighting variation I shouldn’t have to convince you that this is a hard problem The individual parts are very difficult to detect – especially lower arms – and the joint configuration of all parts is highly variable background clutter foreshortening

Articulated Pose and Pictorial Structures
A popular choice for (articulated) parts-based models A non-exhaustive timeline Ramanan Learning to Parse Images of Articulated Objects Felzenszwalb et al. A Discriminatively Trained, Multiscale, Deformable Part Model. Fischler & Elschlager The representation and matching of pictorial structures Eichner & Ferrari Better Appearance Models for Pictorial Structures Pictorial structures is one of the primary tools use to tackle this problem We show here a non-exhaustive timeline. We can see that in the past 5 years, pictorial structures has been extremely popular. it’s been applied to both rigid and articulated objects. 1972 2005 2006 2008 2009 2010 Felzenszwalb & Huttenlocher PS for Object Recognition Fergus et al. ICCV Short Course Ferrari et al. Progressive Search Space Reduction… Andriluka et al. Pictorial Structures Re-visited Sapp et al. Adaptive Pose Priors for PS

Background: How PS works
head : location for part i luarm ruarm torso llarm rlarm ω unary score: detection maps part detectors max-product inference sum-product inference prediction let’s review how pictorial structures works Pictorial Structures models each part’s location in the image as a variable. The joint configured of all parts is described as a log-linear combination of unary and pairwise terms. The unary terms can be thought of as individual part detectors, and express the affinity for the part being at any location and orientation in the image. The pairwise terms are a simple function of the geometric displacement between neighboring parts. This geometric cost is typically independent of the image, and hence refered to as a geometric prior. The PS scoring function can be optimized using standard inference techniques: using max-product inference, the most likely configuration of parts can be inferred - using sum-product inference, we can obtain marginal distributions of location for each part. y ω x pairwise score: geometric prior

Background: The Complexity of PS
head llarm rlarm ruarm luarm torso state space for part i typical state space size: n > 150,000 states y ω x While the number of variables in this PS is small, the size of the state space for each variable is huge in a typical discretization used in human pose estimation, we have a state space of over 150,000 locations and orientations standard inference in this type of graphical model is n-squared. the bottleneck computation is that each state must be compared with at least a fraction of states from a neighboring apart. Again in a realistic setting this leads to about 1 billion state pairs for every pair of parts, which need to be checked both for feature generation and inference operations PART X PART = HUGE Standard inference in a tree graphical model is Typical # of valid combinations for two neighboring parts: (80 x 80 x 24)(80/5 x 80/5 x 24) ≈ 1 billion state-pairs! pairwise computation: x =

Background: The Complexity of PS
If , efficient inference tricks can be used: [Felzenszwalb & Huttenlocher, 2005 ] Max-prod w/ unimodal cost: Distance transform for Sum-prod w/ linear filter cost: Convolution for + score for part-state pair: unary i unary j pairwise i,j State-of-the-art: As Felzenswalb and Huttenlocher showed in 2005, if the pairwise term is really just a simple function of the geometry, efficient inference tricks can be applied: We can achieve max-product inference in linear time using distance transforms if the geometric cost is unimodal We can achieve sum product inference in n log n using convolution when the cost can be represented as a linear filter This tremendous speed-up has made pictorial structures practical, and all state of the art systems uses this restriction/trick. In summary, the current PS is efficient if it simply pieces together individual part detector scores with geometric consistency But are we paying too much in expressivity and accuracy for this gain in efficiency? %%%Our work shows that we can have both expressivity and efficiency. Q: Are we losing too much in expressivity for this gain in efficiency?

Goal: Integrating richer pairwise terms
incorporate image evidence For example, we’d like to incorporate image evidence into pairwise terms, which is not possible in the standard PS model. A simple and intuitive cue along these lines is the distance in color distribution between neighboring parts ( , ) e.g., distance in color distribution:

Computation example ( , )
color histogram χ2 distance computation between all pairs of part hypotheses ( , ) 20 Let’s try to compute this cue and see how long it takes. At a very coarse spatial grid, we need to compute 3.7 million histogram distances, which takes under a second. cpu time 20 <1 second 3.7 million comparisons 20x20 grid 24 angles

color histogram χ2 distance computation between all pairs of part hypotheses ( , ) 40 As we scale up to finer resolution, the processing time gets considerably slower. cpu time 40 20 seconds <1 second 59 million comparisons 20x20 grid 24 angles 40x40 grid 24 angles

color histogram χ2 distance computation between all pairs of part hypotheses ( , ) 5 minutes 80 At the standard resolution, this simple feature computation takes 5 minutes. cpu time 80 20 seconds <1 second 1 billion comparisons 20x20 grid 24 angles 40x40 grid 24 angles 80x80 grid 24 angles

2 hours! color histogram χ2 distance computation between all pairs of part hypotheses ( , ) 5 minutes 80 And if we want to scale up beyond the standard PS representation to model a scale for each part, this is prohibitively expensive cpu time 80 20 seconds <1 second 20x20 grid 24 angles 40x40 grid 24 angles 80x80 grid 24 angles 80x80 grid 24 angles 3 scales

375 GB! color histogram χ2 distance computation between all pairs of part hypotheses ( , ) 18 GB storage space 80 similarly, if you wanted to store this feature for learning or anaylysis, you would have to buy a lot of hard drives 80 1 GB 70MB 20x20 grid 24 angles 40x40 grid 24 angles 80x80 grid 24 angles 80x80 grid 24 angles 3 scales

Exhaustive Inference Clearly, exhaustive inference is not going to work.

Our Contribution: Coarse-to-Fine Structured Inference
Our solution is to focus inference on promising states. We do this by learning a coarse-to-fine cascade of structured models, built on 2 principles: First, we want to avoid pruning the correct answer Second, at the same time, we want to prune unpromising states as early as possible. This enables us to use richer features and beat the state-of-the-art in terms of accuracy and efficiency. … Focus inference on promising states Learning a cascade of structured models Safety: no groundtruth left behind Efficiency: prune wrong states early Enables richer features and better results

Inspiration: Cascade of Classifiers [Viola & Jones 2001, Fleuret & Geman 2000]
simple model complex model level 1 level 2 level 3 … level N face For inspiration, let’s turn to one of the most successful pruning strategies in computer vision: a cascade of classifiers This cascade throws out easy-to-reject portions of the image with only a few feature computations, and focuses more effort areas of the image that are harder to disambiguate . This works well for binary classificaiton, but how do we generalize this to parts-based models? reject reject reject reject Throw out most states with only a few computations Focuses more computational effort on harder areas How do we generalize this to parts-based models?

Generalizing Classifier Cascades
Naïve solution: filter states based on part detector scores Independent cascades for each part richer ps model on reduced state space head … The simplest extension would be to have a cascade of classifiers for each part. Each part’s state space is pruned individually, and the reduced state spaces can be combined in the end into a more expressive PS model to make a final prediction. torso … head luarm … luarm ruarm predict ruarm … llarm … llarm rlarm rlarm …

Naïve solution: filter states based on part detector scores … predict head torso luarm ruarm llarm rlarm Let’s see an example of this in action. on the bottom is a detector heatmap of the image, showing the likelihood of the left lower arm at every location and orientation If we prune this heatmap down to a reasonable number of states, we obtain the following. In the right picture, each state is represented as a joint location and direction vector We are left with lower arm states all over the image, and in fact, we miss the correct hypothesis left elbow joint detector heatmap misses correct elbow joint location+direction prune to 800 states

Naïve solution: filter states based on part detector scores result … predict head torso luarm ruarm llarm rlarm The fundamental problem with this approach is that it only takes local scores into account and typically misses correct locations with weak signal. We instead want information from other parts to help. For example, a strong belief in an upper arm location should save a lower arm hypothesis that is otherwise pruned Scores locally, prunes locally Want information from other parts to help

Generalizing Classifier Cascades (Our Take)
Better: Prune based on a cascade of pictorial structures ps model 1 ps model 2 ps model 3 ps model N … predict coarse fine state space resolution This motivates our approach: a cascade of pictorial structures models We start with a coarse, efficiently computable state space. We compute a global scoring measure, which we explain next. Then, we prune away low scoring states We then refine the resolution; and repeat the process with a refined model 10x10x12 80x80x24 Score globally, prune locally 0. Start with a coarse, efficiently computable state space 1. Compute global scoring measure (to be explained) 2. Throw away low scoring states 3. Refine resolution, refine model

Better solution: prune based on a cascade of pictorial structures … predict left lower arm (elbow) left upper arm (shoulder) torso top 10x10 (before pruning) 10x10 20x20 40x40 80x80 An illustration of our approach is as follows We start with a coarse grid with an exhaustive set of locations and angles We then prune. The torso and upper arm are much easier to detect than the lower arm, so more of these states get pruned initially We the refine and prune again, and repeat until we are at standard resolution Then we can make a prediction from the states we have left

Global vs. Local Pruning (elbow joint)
Prune to 800 states from original 150K Global score pruning (ours) Naïve, local detector score pruning Pruning to 800 states in this fashion, we maintain the correct answer unpruned. By comparision, on the same image, the naïve local detector pruning eliminate the correct answer. correct answer kept correct answer pruned

Computing a Global Pruning Score
Define: score of the most likely configuration of parts (a.k.a. MAP score a.k.a. Viterbi score): What about this state for the lower arm? Should it be pruned? Let’s look at how we can compute a global scoring score. First, we can run inference to obtain the global best configuration of all parts. From this we know that all these parts are likely, but it does not tell us anything about the left lower arm at this location, for example. To obtain a score for this hypothesis, let’s fix the lower arm here and re-run inference. We obtain a new score that is a global measure, and directly comparable to the max score or max-marginal scores of other parts. We denote this quantity the max-marginal score, and it is the key ingredient to our pruning. s★ = 27.85 max-marginal score for part i, location li: fix lower arm here; re-run inference s★llarm(x=20,y=80,ω=-π/2) = 24.76

Computing a Global Pruning Score
max-marginal score si★(x,y,ω) s★ = 27.85 s★llarm (x,y,ω) = 24.76 s★llarm (x,y,ω) = 14.28 s★llarm (x,y,ω) = 7.10 … lower arm We can continue to get a score by placing the lower arm at all possible locations in the image And do the same for all parts at all locations. The important thing to remember is that this score is a global quantity, and all scores are on the same scale. s★ruarm (x,y,ω) = -3.67 s★ruarm (x,y,ω) = s★ruarm (x,y,ω) = s★ruarm (x,y,ω) = … upperarm s★head (x,y,ω) = 0.85 s★head (x,y,ω) = 13.19 … s★head (x,y,ω) = 6.31 s★head (x,y,ω) = 25.55 head

… Max-Marginals si★ heatmap, lower arm si heatmap, lower arm
s★llarm (x,y,ω) = 24.76 s★llarm (x,y,ω) = 14.28 … we can collect all scores for a single part and view it as a heatmap at the coarsest level. we can compare this to the original part detector even though the correct position is locally not very promising, the max-marginal of this location indicates it should not be pruned. %%%%we can compare this to the original part detector and see that the correct location is much less likely to be pruned si★ heatmap, lower arm si heatmap, lower arm correct location

Learning to Prune Goal: “No true pose left behind!” On training data:
Max-marginals of groundtruth pose should be above average At test time: Max-marginals of groundtruth above average with high probability (see D. Weiss & B. Taskar, “Structured Prediction Cascades”, AISTATS’10 for details) Cascaded Learning: Each level is trained from the output states of the previous level We want to learn models optimized for the task of pruning The goal of the learning procedure is to never prune the groundtruth On training data, we formalize this by requiring that the groundtruth be above average At test time, we have a guarantee that max marginals of the groundtruth are above average with high probability. For details, please see my colleagues'’ paper from earlier this year. We learn the cascade one level at a time, from coarse to fine, with each successive level using the unpruned states of the previous one. max-marginal histogram low high true pose prune away keep

Learning One Cascade Level
Let To formulate learning, let l^t denote the true pose for training example t Let s of l^t and image^t be the score of the true pose on example t let s star bar be the average max-marginal score we represent our unary and pairwise scores as linear combinations of features, where theta are wieghts, and phi are features then we can pose the learning problem as follows … the convex constraint requires that the score of the groundtruth is above the avrage max marginal score we can show that this also implies that the max-marginals of the truth are above average θ: parameters φ: features convex constraint: “score of the true pose should be above average” safe pruning: implies that max-marginals of true pose above average too Learning problem:

Stochastic Sub-gradient Learning
features of true pose Pick a random training example : we solve the optim problem using a simple, stochastic sub gradient update: parameters are adjusted by the difference between the groundtruth features and the average features used in computing max-marginals it is interesting to compare this with the standard structured perceptron update which uses the difference between features of the groundtruth and highest scoring non-truth: the complexity of the update is essentially the same the difference is that the perceptron tries to separate the truth from 2nd best we just try to keep the truth near the top λ: regularization η: step size Average of features used in computing max-marginals O(n2) to obtain along with max-marginals themselves Compare with standard structured perceptron update: features of highest scoring non-truth

Recap: At test time refine resolution  prune below avg ✖ ✔
Now that we’ve specified all the details of the model, we can review the processing that takes place during test time We start with an exhaustive coarse state space, compute max-marginals, and prune all states below average. We are only showing the left lower arm here which has a lot of uncertainty, so not very much is pruned in this stage for this particular part We then refine resolution, compute max-marginals with a more refined model, and repeat ✔ ✖ prune below avg max marginals left elbow joint predict

Coarse-to-Fine Pruning Results
On Buffy upper body pose dataset cascade level state space size % reduction in state space % true arms closely matched cumulative cascade cpu time* 10x10x12 - 100.0 1.1 s 1 10x10x24 52.5 76.6 1.5 s 3 20x20x24 95.6 72.3 2.6 s 5 40x40x24 98.3 70.5 3.6 s 7 80x80x24 99.7 68.4 5.2 s detector pruning 58.6 * additional time after computing unary scores We quantify how well our cascade model works in practice on the Buffy upper body pose dataset. In practice our cascade is 8 stages long, where we double one of the dimensions of the state space at each level. We can successfully prune down to less than 500 states per part. We also can prune using local detector pruning down to the same number of states. The cascade performs 10% better on matching lower arms while only taking 5 additional seconds. < 500 states left to deal with runs in 5 additional seconds outperforms naïve local pruning

Richer Features Still need to make a final prediction
… predict Still need to make a final prediction Now we can afford to include rich features and a complex pairwise cost function, including: Texture Geometry Color Shape from Regions Shape from Contours

Features Texture | Geometry | Color | Shape:Regions | Shape:Contours Standard part detector cue - [Andriluka et al. 2009] HoG Adaboost lower arm detector + =

Features Texture | Geometry | Color | Shape:Regions | Shape:Contours
Standard geometric cues: displacement in x, y and angle in part-relative coordinate frame. [Felzenszwalb & Huttenlocher, 2005]

Richer features Texture | Geometry | Color | Shape:Regions | Shape:Contours Unary: Image-adaptive skin and clothing color compatibility – [Eichner & Ferrari, 2009] Pairwise: Quantize color into 8 bins; compute histogram difference face color model torso color model ( , ) new

Richer features Texture | Geometry | Color | Shape:Regions | Shape:Contours Measure shape moments of superpixels supporting part hypothesis NCut oversegmentation good region support weak region support new

Richer features Texture | Geometry | Color | Shape:Regions | Shape:Contours Extract long contours from segment boundaries Assign limb-pair to single contour which aligns well to both. Use alignment score as feature Ncut Segmentation new

Experiments Challenging, real-world upper body human pose estimation datasets Buffy Stickmen v from television ETHZ PASCAL Stickmen v1.0 – from flickr [Provided by the ETH Zurich CALVIN research lab: Buffy PASCAL

most challenging parts
note: #’s here have changed since the talk to exactly match the publicly available reference implementation available at: End-system results Buffy v2.1, PCP0.5 torso head upper arms lower total Andriluka et al. 2009 98.3 95.7 86.6 52.8 78.8 Eichner et al. 2009 98.7 97.9 82.8 59.8 80.3 Sapp et al. 2010 100 91.1 65.7 85.9 CPS (this paper) 99.6 91.9 64.5 85.2 most challenging parts * not counting part detectors & segmentation time Takes 10 minutes* Takes 1.5 minutes* PASCAL, PCP0.5 torso head upper arms lower total Eichner et al. 2009 97.2 88.6 73.8 41.5 69.3 Sapp et al. 2010 100 98.0 83.9 54.0 79.0 CPS (this paper) 99.2 81.5 53.9 78.3

Results: Us vs. Local Pruning
% correctly matched arms on Buffy ours – cascade of PS … naïve approach - score locally, prune locally …

Results: Feature analysis

Results: Feature analysis
% of correct parts geometry geometry +regions geometry +contours geometry +color all features new new new new

Summary A learned cascade of coarse-to-fine ps models
“score globally, prune locally” Overcomes state space explosion to enable complex pairwise scores Can be naturally extended to: Higher-order cliques Richer state spaces (e.g., occlusion, scale) Full-frame, temporal modeling of human pose See our upcoming NIPS 2010 paper! D. Weiss, B. Sapp and B. Taskar. Tracking Complex Dynamics with Structured Prediction Cascades.

Code available soon at http://vision.grasp.upenn.edu/video
Thanks! Code available soon at

Cascaded Models for Articulated Pose Estimation

Similar presentations

Presentation on theme: "Cascaded Models for Articulated Pose Estimation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cascaded Models for Articulated Pose Estimation

Similar presentations

Presentation on theme: "Cascaded Models for Articulated Pose Estimation"— Presentation transcript:

Similar presentations

About project

Feedback