Articulated Pose Estimation with Flexible Mixtures of Parts

Name: Articulated Pose Estimation with Flexible Mixtures of Parts
Uploaded: 2017-08-18T21:04:08+00:00
Duration: PTM24S4
Channel: Frank Henderson
Description: Articulated Pose Estimation with Flexible Mixtures of Parts

Articulated Pose Estimation with Flexible Mixtures of Parts
Yi Yang & Deva Ramanan University of California, Irvine Good morning everyone. Welcome to our talk. My name is Yi. Our talk is articulated pose estimation with flexible mixtures of parts. First, what is the task that we address?

Goal Articulated pose estimation ( )
According to Wikipedia, the goal of articulated pose estimation is to report joint positions of articulated limbs, as we show here for our Kung-Fu panda. Articulated pose estimation ( ) recovers the pose of an articulated object which consists of joints and rigid parts

Applications Action HCI Gaming Segmentation Object Detection ……
There are lots of applications like action recognition, human-computer interaction and gaming where being able to estimate human pose is useful. Articulated pose estimation can also play a role in general image and scene understanding such as image segmentation and object detection.

Unconstrained Images In unconstrained images, human pose estimation can be a very hard problem because people can appear with a variety of poses, clothing, and body shape. In the slides, you can see some very interesting and unusual examples that demonstrate how flexible the human pose is.

Classic Approach Part Representation Head, Torso, Arm, Leg
Marr & Nishihara 1978 Part Representation Head, Torso, Arm, Leg Location, Rotation, Scale A classic approach for human pose estimation is to model the human as a set of parts, such as a head, torso, arm, and leg part. In 3D, these parts can be modeled as cylinders.

Felzenszwalb & Huttenlocher 2005
Classic Approach Marr & Nishihara 1978 Part Representation Head, Torso, Arm, Leg Location, Rotation, Scale Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005 Pictorial Structure Unary Templates Pairwise Springs Pictorial structures use 2D part models, where geometric relations between parts are encoded by springs.

Felzenszwalb & Huttenlocher 2005
Classic Approach Marr & Nishihara 1978 Part Representation Head, Torso, Arm, Leg Location, Rotation, Scale Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005 Pictorial Structure Unary Templates Pairwise Springs However, capturing the whole range of appearances using pictorial structures is still quite difficult and there has been a large body of recent work on generalizing and extending pictorial structures to explicitly capture this variability. A big problem is that even projections of a simple cylinder into 2D yields many different appearances so one usually has to explicitly evaluate many different possible in-plane orientations and foreshortenings in order to find a good match for a part template. Lan & Huttenlocher 2005 Andriluka etc. 2009 Sigal & Black 2006 Eichner etc. 2009 Ramanan 2007 Singh etc. 2010 Epshteian & Ullman 2007 Johnson & Everingham 2010 Wang & Mori 2008 Sapp etc. 2010 Ferrari etc. 2008 Tran & Forsyth 2010

Problem: Wide Variations
In-plane rotation Foreshortening However, in addition to in-plane rotation and foreshortening, there are many other factors which can affect the appearance of a part.

In-plane rotation Foreshortening Scaling Out-of-plane rotation Intra-category variation Aspect ratio For example, here you see variations in scaling, out of plane rotation, intra-class variation unrelated to pose and finally aspect ratio.

In-plane rotation Foreshortening Scaling Out-of-plane rotation Intra-category variation Aspect ratio Explicitly searching over all of these factors can be very expensive. And here we describe an efficient method for capturing such general deformations of parts. Naïve brute-force evaluation is expensive

Our Method – “Mini-Parts”
We do this with a many miniature part model that models the appearance of a single limb using multiple parts connected with springs. For example, the lower leg of the panda can be modeled with two parts and the torso of the panda can be modeled with four parts. Key idea: “mini part” model can approximate deformations

Example: Arm Approximation
Here is a video illustration of our “mini” part model for modeling a lower arm. On the left, we visualize 3D transformations of an arm as 2D image foreshortening and rotation. On the right, we approximate these transformed images with a two-part model where parts only need to shift. If the arm rotates or foreshortens a lot, we use a different set of templates and springs. This means we use a pool of part templates and springs to capture such transformations.

Example: Torso Approximation
The mini part model can also be used to capture out-of-plane rotations of torsos, which vary in width when viewed frontally versus sideways.

Key Advantages Flexibility: Speed:
General affine warps (orientation, foreshortening, …) Speed: Mixtures of local templates + dynamic programming Then what is the advantage of such a representation? The key point is that we can model a general set of affine deformations of a template very efficiently. Rather than separately evaluating a large number of oriented and foreshortened limbs, we share computation among them using a small number of local templates. This sharing makes inference faster and learning easier due to fewer parameters.

Pictorial Structure Model
𝐼: Image 𝑙 𝑖 : Location of part 𝑖 To explain our model, we define a scoring function based on the well-known pictorial structure model. We write l_i for the 2D location of part i. We score a configuration of parts with a summation over two terms.

𝛼 𝑖 : Unary template for part 𝑖 𝜙(𝐼, 𝑙 𝑖 ): Local image features at location 𝑙 𝑖 The first term scores a local match for placing a part template at a particular image location. We write alpha for the part template, and phi for the image feature vector extracted at that location. We use HOG features, and score the template with a dot product.

𝜓( 𝑙 𝑖 , 𝑙 𝑗 ): Spatial features between 𝑙 𝑖 and 𝑙 𝑗 𝛽 𝑖𝑗 : Pairwise springs between part 𝑖 and part 𝑗 The second term consists of a deformation model that evaluates the relative locations of pairs of parts. We write psi for the squared offset between two part locations, and we write beta for the parameters of a spring that favors certain offsets over others. Beta encodes both the rest position and rigidity of the spring. In a Gaussian model, this would be the mean and covariance.

Our Flexible Mixture Model
𝑚 𝑖 : Mixture of part 𝑖 Our model is an extension of the previous formulation that allows for mixtures of parts. For simplicity, assume that each mixture type corresponds to one of say, 4 orientations. We now would like to score a configuration of parts, where a configuration includes both part locations and mixture types. We write m_i for the mixture type of part i,

𝑚 𝑖 : Mixture of part 𝑖 𝛼 𝑖 𝑚 𝑖 : Unary template for part 𝑖 with mixture 𝑚 𝑖 and score parts using a mixture-specific template alpha.

𝑚 𝑖 : Mixture of part 𝑖 𝛼 𝑖 𝑚 𝑖 : Unary template for part 𝑖 with mixture 𝑚 𝑖 𝛽 𝑖𝑗 𝑚 𝑖 𝑚 𝑗 : Pairwise springs between part 𝑖 with mixture 𝑚 𝑖 and part 𝑗 with mixture 𝑚 𝑗 We write beta for a collection of 4 times 4, or 16, springs, that are specific to each combination of mixtures.

𝑚 𝑖 : Mixture of part 𝑖 𝛼 𝑖 𝑚 𝑖 : Unary template for part 𝑖 with mixture 𝑚 𝑖 𝛽 𝑖𝑗 𝑚 𝑖 𝑚 𝑗 : Pairwise springs between part 𝑖 with mixture 𝑚 𝑖 and part 𝑗 with mixture 𝑚 𝑗 Finally, we have an additional term S(M), that scores the combination of mixtures itself.

Co-occurrence “Bias” 𝑏 𝑖𝑗 𝑚 𝑖 𝑚 𝑗 : Pairwise co-occurrence prior between part 𝑖 with mixture 𝑚 𝑖 and part 𝑗 with mixture 𝑚 𝑗 We score a combination with a co-occurrence model, that favors certain pairs of mixtures of parts.

Co-occurrence “Bias” 𝑏 𝑖𝑗 𝑚 𝑖 𝑚 𝑗 : Pairwise co-occurrence prior between part 𝑖 with mixture 𝑚 𝑖 and part 𝑗 with mixture 𝑚 𝑗 We visualize the co-occurrence model for each pair as a 4 by 4 matrix. For example, if two parts are on the same rigid limb, they should have consistent orientations. Then this matrix will be diagonal .

Co-occurrence “Bias” 𝑏 𝑖𝑗 𝑚 𝑖 𝑚 𝑗 : Pairwise co-occurrence prior between part 𝑖 with mixture 𝑚 𝑖 and part 𝑗 with mixture 𝑚 𝑗 If two parts are on different limbs that can flex into all orientations, then this matrix will be uniform.

Inference & Learning Inference
max 𝐿,𝑀 𝑆(𝐼,𝐿,𝑀) For a tree graph (V,E): dynamic programming Given an image I, we would like to find the best scoring configuration of parts. When the edge structure is a tree, this can be done with dynamic programming.

Inference & Learning Inference
max 𝐿,𝑀 𝑆(𝐼,𝐿,𝑀) For a tree graph (V,E): dynamic programming min 𝑤 1 2 𝑤 s.t. ∀𝑛∈pos 𝑤∙𝜙 𝐼 𝑛 , 𝑧 𝑛 ≥1 ∀𝑛∈neg,∀𝑧 𝑤∙𝜙 𝐼 𝑛 ,𝑧 ≤−1 Given labeled positive { 𝐼 𝑛 , 𝐿 𝑛 , 𝑀 𝑛 } and negative { 𝐼 𝑛 }, write 𝑧 𝑛 =( 𝐿 𝑛 , 𝑀 𝑛 ), and 𝑆 𝐼,𝑧 =𝑤∙𝜙 𝐼,𝑧 Learning In order to train our model, we assume a fully supervised training dataset, where we are given positive and negative images with part locations and mixture types. Since the scoring function is linear in its parameters, we can learn the model using a SVM solver. This means we learn the local part templates, the springs, and the mixture co-occurences simultaneously.

Benchmark Datasets PARSE Full-body BUFFY Upper-body
BUFFY Upper-body We use two benchmark datasets to evaluate our model. One is the Parse dataset, consisting of full-body poses. The other is the Buffy dataset consisting of upper body poses.

How to Get Part Mixtures?
Solution: Cluster relative locations of joints w.r.t. parents These datasets include training data that is annotated with joint locations, but not mixture labels. Since we want mixtures to correspond to orientation and foreshortening, we derive mixture labels by clustering the relative locations of joints. Here, we show the result of k-means with K = 4. For example, we learn mixtures that correspond to different head orientations on the left and mixtures that correspond to different arm orientations on the right.

Articulation We show the learned full-body model for the Parse dataset. We show the local templates on the top, and draw a skeleton on the bottom to denote the spring rest locations. Each skeleton corresponds to a particular combination of part mixtures.

𝐾 parts, 𝑀 mixtures ⇒ 𝐾 𝑀 unique pictorial structures
Articulation Though we show 4 skeletons, there exists an exponential number of mixture combinations. 𝐾 parts, 𝑀 mixtures ⇒ 𝐾 𝑀 unique pictorial structures

Articulation Not all are equally likely; the co-occurrence model S(M) favors certain combinations over others. This is in contrast with a classic pictorial structure, which only encodes a single skeleton. 𝐾 parts, 𝑀 mixtures ⇒ 𝐾 𝑀 unique pictorial structures Not all are equally likely --- “prior” given by 𝑆(𝑀)

Qualitative Results We show qualitative results of our model for the Parse dataset. For each pair of images, we draw a skeleton on the left and the bounding boxes directly produced by our model, on the right. Our model is able to capture a large set of transformations including in-plane rotation, out of plane rotation and foreshortening.

Quantitative Results on PARSE
% of correctly localized limbs All previous work use explicitly articulated models Image Parse Testset Method Total Ramanan 2007 27.2 Andrikluka 2009 55.2 Johnson 2009 56.4 Singh 2010 60.9 Johnson 2010 66.2 Our Model 74.9 For quantitative results, we compare our model to all previously published work on the Parse dataset, using the standard criteria of percentage of correctly localized parts, or PCP. Our total performance is 75%, comparing favorably to the previous best result which is 66%.

Quantitative Results on PARSE
% of correctly localized limbs 1 second per image Image Parse Testset Method Head Torso U. Legs L. Legs U. Arms L. Arms Total Ramanan 2007 52.1 37.5 31.0 29.0 17.5 13.6 27.2 Andrikluka 2009 81.4 75.6 63.2 55.1 47.6 31.7 55.2 Johnson 2009 77.6 68.8 61.5 54.9 53.2 39.3 56.4 Singh 2010 91.2 76.6 71.5 64.9 50.0 34.2 60.9 Johnson 2010 85.4 76.1 73.4 65.4 64.7 46.9 66.2 Our Model 97.6 93.2 83.9 75.1 72.0 48.3 74.9 We also outperform all previous results on a per-part basis. Our model requires roughly 1 second per image for the PARSE dataset. We believe our model could be real-time with additional speed-ups.

More Parts and Mixtures Help
We also compare the effect of different model structures on pose estimation for Parse dataset. Overall, increasing the number of parts from 14 to 27 improves performance, probably because more image evidence is covered and larger flexibility is allowed. When fixing the number of parts, increasing the number of mixtures improves performance too, likely due to the fact that more orientations can be modeled. 14 parts (joints) 27 parts (joints + midpoints)

Quantitative Results on BUFFY
% of correctly localized limbs Our algorithm = 5 seconds -vs- Next best = 5 minutes Subset of Buffy Testset Method Total Tran 2010 62.3 Andrikluka 2009 73.5 Eichner 2009 80.1 Sapp 2010a 85.9 Sapp 2010b 85.5 Our Model 89.1 For the Buffy dataset, we again obtain the best overall performance while being orders of magnitude faster. Our total pipeline requires roughly 5 seconds to process an image, while the next best takes 5 minutes.

Quantitative Results on BUFFY
% of correctly localized limbs All previous work use explicitly articulated models Subset of Buffy Testset Method Head Torso U. Arms L. Arms Total Tran 2010 --- 62.3 Andrikluka 2009 90.7 95.5 79.3 41.2 73.5 Eichner 2009 98.7 97.9 82.8 59.8 80.1 Sapp 2010a 100 91.1 65.7 85.9 Sapp 2010b 96.2 95.3 63.0 85.5 Our Model 99.6 96.6 70.9 89.1 We also outperform or nearly tie all previous results on a per-part basis for the Buffy dataset.

Human Detection Our model can be used to detect humans as well as estimate their poses. We compare results with a state-of-the-art upper body detector included in the BUFFY dataset release. The upper-body detector is based on the weakly-supervised part model introduced by Felzenszwalb et al. Our detector is more precise for almost all recall values. This suggests that our flexible mixture model and supervised learning framework is useful for general object detection.

Conclusion Model affine warps with a part-based model
In conclusion, we have described a state-of-the-art system, in both performance and speed, for the tasks of human pose estimation and detection. It is based on a novel representation for capturing limb deformations using flexible part-based models.

Exponential set of pictorial structures Our model encodes an exponential set of pictorial structures,

Exponential set of pictorial structures Flexible vs rigid relations and so it is quite flexible, but it can also learn to be rigid when needed.

Exponential set of pictorial structures Flexible vs rigid relations Supervision helps Finally, we find that supervised learning is helpful when learning such complex models.

Thank you Thank you very much.

Articulated Pose Estimation with Flexible Mixtures of Parts

Similar presentations

Presentation on theme: "Articulated Pose Estimation with Flexible Mixtures of Parts"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Articulated Pose Estimation with Flexible Mixtures of Parts

Similar presentations

Presentation on theme: "Articulated Pose Estimation with Flexible Mixtures of Parts"— Presentation transcript:

Similar presentations

About project

Feedback