Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

Similar presentations


Presentation on theme: "Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)"— Presentation transcript:

1 Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)
Recognizing Human-Object Interaction in still Image by Modeling the Mutual Context of Objects and Human Poses Date: 2013/05/27 Instructor: Prof. Wang, Sheng-Jyh Student: Hung, Fei-Fan Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)

2 Outline Introduction Model Representation Model Learning
Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

3 Outline Introduction Model Representation Model Learning
Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

4 Why using context in computer vision?
simple image vs. human activities Without context: With mutual context: ~3-4% with context without context

5 Challenges in Human Pose Estimation
Human pose estimation is challenging  Object detection facilitate human pose estimation Difficult part appearance Self-occlusion Image region looks like a body part

6 Challenges in Object Detection
Object detection is challenging human pose estimation facilitate object detection Small, low-resolution, partially occluded Image region similar to detection target

7 The Goal To build a mutual context model in Human-Object Interaction(HOI) activities To

8 Outline Introduction Model Representation Model Learning
Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

9 Model representation Modeling the mutual context of object and human poses A: Croquet shot Volleyball smash Tennis forehand Body parts Tennis ball Croquet mallet Volleyball Tennis racket O: Conditional random field without hidden variable 𝑂 1 , 𝑂 2 ,…, 𝑂 𝑀 , M:num of bounding box H: More than one atomic pose H in A P: body parts, 𝑃 1 , 𝑃 2 ,…, 𝑃 𝐿

10 Model representation 𝝓 𝟏 : co-occurrence compatibility between A,O,H
activity H A P1 P2 PL O1 O2 Human pose objects 𝝓 𝟏 : co-occurrence compatibility between A,O,H 𝝓 𝟐 : spatial relationship between O,H 𝝓 𝟑 ~ 𝝓 𝟓 : modeling the image evidence with detectors or classifiers

11 𝝓1: Co-occurrence context
co-occurrence between all A,O,H 𝜍 𝑖,𝑗,𝑘 : strength of co-occurrence interaction between ℎ 𝑖 , 𝑜 𝑗 , 𝑎 𝑘 H A P1 P2 PL O1 O2 𝟏 (∙) : indicator function 𝑁 ℎ : total number of atomic poses 𝑁 𝑜 : total number of objects 𝑁 𝑎 : total number of activity classes

12 𝝓2: Spatial context Spatial relationship between all O and different H
𝒙 𝐼 𝑙 : Spatial relationship between all O and different H 𝜆 𝑖,𝑗,𝑘 : weight of 𝑏 𝒙 𝐼 𝑙 , 𝑂 𝑚 𝑂 𝑚 = 𝑜 𝑗 𝑏 𝒙 𝐼 𝑙 , 𝑂 𝑚 : a sparse binary vector shows relative location of 𝑂 𝑚 w.r.t. 𝒙 𝐼 𝑙 H A P1 P2 PL O1 O2

13 𝝓3: Modeling objects Model O in the image I using object detection score For all object O 𝑔 𝑂 𝑚 : vector of score of detecting 𝑂 𝑚 𝛾 𝑗 : weight of 𝑔 𝑂 𝑚 𝑂 𝑚 = 𝑜 𝑗 Between Om and Om’ 𝑏 𝑂 𝑚 , 𝑂 𝑚′ : binary feature vector 𝛾 𝑗,𝑗′ : weight of 𝑜 𝑗 and 𝑜 𝑗′ H A P1 P2 PL O1 O2 𝑂 𝑚 =object in the mth bounding box 𝑏 𝑂 𝑚 , 𝑂 𝑚′ : binary feature vector show the spatial relationship between m & m’ 𝛾 𝑗,𝑗′ : weight for geometric configuration between 𝑜 𝑗 and 𝑜 𝑗′

14 𝝓4: Modeling human pose Model atomic pose that H belongs to and likelihood 𝑃(𝐼| ℎ 𝑖 ) 𝑃 𝒙 𝐼 𝑙 | 𝒙 ℎ 𝑖 𝑙 : Gaussian likelihood function 𝑓 𝑙 (𝐼) : vector of score of detecting body part in 𝒙 𝐼 𝑙 H A P1 P2 PL O1 O2 𝑃 𝒙 𝐼 𝑙 | 𝒙 ℎ 𝑖 𝑙 : Gaussian likelihood of body part given atomic pose

15 𝝓5: Modeling activity Model HOI activity by training activity classifier 𝑠 𝐼 : 𝑁 𝑎 -dim output of one-versus-all (OVA) discriminative classifier taking image as features 𝜂 𝑘 : feature weight of 𝑎 𝑘 H A P1 P2 PL O1 O2

16 One-versus-all classifier
OVA: OVO:

17 Model Properties Spatial context between O and H
Object detection and human pose estimation facilitate each other Ignore the objects and body parts that are unreliable Flexible to extend to large scale datasets and other activities Jointly model can share all objects and atomic poses

18 Outline Introduction Model Representation Model Learning
Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

19 Training detectors and classifiers
Model Learning Assign human pose to atomic pose Training detectors and classifiers Estimate parameters by Maximum Likelihood

20 Obtaining Atomic Poses
Assign human pose to atomic pose Using clustering to obtain atomic poses Normalize the annotations 𝒙 1 , 𝒙 2 ,…, 𝒙 𝐿 Finding missing part Using the nearest visible neighbor Obtain a set of atomic poses Hierarchical clustering with maximum linkage measure : 𝑙=1 𝐿 𝑤 𝑇 | 𝒙 𝑖 𝑙 − 𝒙 𝑗 𝑙 | Training detectors and classifiers Estimate parameters by Maximum Likelihood

21 Training Detectors and Classifiers
Assign human pose to atomic pose 𝑔 𝑂 𝑚 : Object detector in 𝜙 3 𝑂,𝐼 𝑓 𝑙 (𝐼) : Human body part detector in 𝜙 4 𝐻,𝐼 𝑠 𝐼 : Overall activity classifier in 𝜙 5 (𝐴,𝐼)  deformable part model Training detectors and classifiers Spatial pyramid matching (SPM) SIFT + 3 level image pyramid Estimate parameters by Maximum Likelihood

22 Spatial pyramid matching
SIFT 3-level histogram intersection kernel S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.

23 Spatial pyramid matching

24 Estimating Model Parameters
Assign human pose to atomic pose Estimate 𝜍, 𝜆, 𝛾, 𝛼, 𝛽 by using ML approach with zero-mean Gaussian prior Training detectors and classifiers Estimate parameters by Maximum Likelihood

25 Learning result

26 Outline Introduction Model Representation Model Learning
Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

27 Update object detection results
Model Inference New image Update human body parts Update object detection results Initialize with learned results Update A and H labels

28 Activity classification
Initialization New image A: SPM classification O: object detection H: pictorial structure model Initialize with learned results Initialize Activity classification Object detection Human pose estimation

29 Update model inference
Marginal distribution of human pose: 𝑝(𝐻= ℎ 𝑖 ) 𝑖=1 𝑁 ℎ Using mixture of Gaussian to refine the prior of body part 𝒩( 𝒙 ℎ 𝑖 𝑙 ) 𝑖=1 𝑁 ℎ 𝑝(𝐻= ℎ 𝑖 ) 𝒩( 𝒙 ℎ 𝑖 𝑙 ) Update human body parts Update object detection results Marginal distribution of human pose 看哪一個 human pose H 的機率最高 Update A and H labels

30 Update model inference
O,H O,A,H O,I Update human body parts Update object detection results Greedy forward search method : Initial (𝑚,𝑗) and no object in bounding box Select 𝑚 ∗ , 𝑗 ∗ =𝑎𝑟𝑔𝑚𝑎𝑥 (𝑚,𝑗) Label 𝑚 ∗ box as 𝑜 𝑗 ∗ update (𝑚,𝑗) Stop when 𝑚 ∗ , 𝑗 ∗ <0 Score of the mth object bounding box to object oj Update A and H labels

31 Update model inference
Enumerate possible A and H label Optimize Ψ(𝐴,𝑂,𝐻,𝐼) Update human body parts Update object detection results Based on results of Human pose estimation and Object detection Update A and H labels

32 Outline Introduction Model Representation Model Learning
Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

33 Experimental Results (Sports Dataset)

34 Experimental Results (Sports Dataset)

35 Experimental Results (Sports Dataset)
Activity classification

36

37 Experimental results (PPMI Dataset)

38 Experimental results (PPMI Dataset)

39

40 Outline Introduction Model Representation Model Learning
Intuition and goal Model Representation Model Learning Obtaining Atomic Poses Training Detectors and Classifiers Estimating Model Parameters Model Inference Experimental Results Conclusion

41 Conclusion Mutual context can significantly improve the performance in difficult visual recognition problems The joint model can share all the information Annotate all the human body parts and objects in training images

42 Reference Yao, B., and Fei-fei, L. “Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses,” IEEE Transactions on Pattern Analysis and Machine Intelligence (2012) B. Yao and L. Fei-Fei, “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010 B. Sapp, A. Toshev, and B. Taskar, “Cascade Models for Articulated Pose Estimation,” Proc. European Conf. Computer Vision, 2010. S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.

43 H A P1 P2 PL O1 O2 H A P1 P2 PL O1 O2

44


Download ppt "Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)"

Similar presentations


Ads by Google