Bangpeng Yao and Li Fei-Fei

Slides:



Advertisements
Similar presentations
Learning Shared Body Plans Ian Endres University of Illinois work with Derek Hoiem, Vivek Srikumar and Ming-Wei Chang.
Advertisements

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
A generic model to compose vision modules for holistic scene understanding Adarsh Kowdle *, Congcong Li *, Ashutosh Saxena, and Tsuhan Chen Cornell University,
Agenda Introduction Bag-of-words models Visual words with spatial location Part-based models Discriminative methods Segmentation and recognition Recognition-based.
Classification using intersection kernel SVMs is efficient Joint work with Subhransu Maji and Alex Berg Jitendra Malik UC Berkeley.
Wrap Up. We talked about Filters Edges Corners Interest Points Descriptors Image Stitching Stereo SFM.
LARGE-SCALE IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill road building car sky.
Modeling the Shape of People from 3D Range Scans
Activity Recognition Computer Vision CS 143, Brown James Hays 11/21/11 With slides by Derek Hoiem and Kristen Grauman.
Bangpeng Yao Li Fei-Fei Computer Science Department, Stanford University, USA.
Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Lecture Pose Estimation – Gaussian Process Tae-Kyun Kim 1 EE4-62 MLCV.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Global spatial layout: spatial pyramid matching Spatial weighting the features Beyond bags of features: Adding spatial information.
2D Human Pose Estimation in TV Shows Vittorio Ferrari Manuel Marin Andrew Zisserman Dagstuhl Seminar July 2008.
Contour Based Approaches for Visual Object Recognition Jamie Shotton University of Cambridge Joint work with Roberto Cipolla, Andrew Blake.
Detecting Pedestrians by Learning Shapelet Features
Student: Yao-Sheng Wang Advisor: Prof. Sheng-Jyh Wang ARTICULATED HUMAN DETECTION 1 Department of Electronics Engineering National Chiao Tung University.
Retrieving Actions in Group Contexts Tian Lan, Yang Wang, Greg Mori, Stephen Robinovitch Simon Fraser University Sept. 11, 2010.
Beyond Actions: Discriminative Models for Contextual Group Activities Tian Lan School of Computing Science Simon Fraser University August 12, 2010 M.Sc.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework Li-Jia Li, Richard Socher, Li Fei- Fei 1.
LARGE-SCALE NONPARAMETRIC IMAGE PARSING Joseph Tighe and Svetlana Lazebnik University of North Carolina at Chapel Hill CVPR 2011Workshop on Large-Scale.
Transferring information using Bayesian priors on object categories Li Fei-Fei 1, Rob Fergus 2, Pietro Perona 1 1 California Institute of Technology, 2.
Object Recognition: Conceptual Issues Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and K. Grauman.
Lecture 17: Parts-based models and context CS6670: Computer Vision Noah Snavely.
Visual Object Recognition Rob Fergus Courant Institute, New York University
Computer Vision Group University of California Berkeley Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA Greg Mori and Jitendra Malik.
Cue Integration in Figure/Ground Labeling Xiaofeng Ren, Charless Fowlkes and Jitendra Malik, U.C. Berkeley We present a model of edge and region grouping.
1 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Descriptive Querying of.
What, Where & How Many? Combining Object Detectors and CRFs
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Action Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/11.
Unsupervised Learning of Hierarchical Spatial Structures Devi Parikh, Larry Zitnick and Tsuhan Chen.
Object Detection Sliding Window Based Approach Context Helps
Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.
Professor: S. J. Wang Student : Y. S. Wang
“Secret” of Object Detection Zheng Wu (Summer intern in MSRNE) Sep. 3, 2010 Joint work with Ce Liu (MSRNE) William T. Freeman (MIT) Adam Kalai (MSRNE)
Why Categorize in Computer Vision ?. Why Use Categories? People love categories!
1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)
Pedestrian Detection and Localization
Putting Context into Vision Derek Hoiem September 15, 2004.
Computer Vision Group University of California Berkeley On Visual Recognition Jitendra Malik UC Berkeley.
Project 3 Results.
Histograms of Oriented Gradients for Human Detection(HOG)
Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.
Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.
Pedestrian Detection Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs CVPR ‘05 Pete Barnum March 8, 2006.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Object Recognition by Integrating Multiple Image Segmentations Caroline Pantofaru, Cordelia Schmid, Martial Hebert ECCV 2008 E.
Max-Margin Training of Upstream Scene Understanding Models Jun Zhu Carnegie Mellon University Joint work with Li-Jia Li *, Li Fei-Fei *, and Eric P. Xing.
Fast Human Detection in Crowded Scenes by Contour Integration and Local Shape Estimation Csaba Beleznai, Horst Bischof Computer Vision and Pattern Recognition,
Bangpeng Yao1, Xiaoye Jiang2, Aditya Khosla1,
Recognizing Deformable Shapes
Action Recognition ECE6504 Xiao Lin.
Object detection as supervised classification
CS 1674: Intro to Computer Vision Scene Recognition
Pedestrian Detection Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs CVPR ‘05 Pete Barnum March 8, 2006.
Pedestrian Detection Histograms of Oriented Gradients for Human Detection Navneet Dalal and Bill Triggs CVPR ‘05 Pete Barnum March 8, 2006.
Cascaded Classification Models
Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu
Introduction to Object Tracking
Adarsh Kowdle*, Congcong Li*, Ashutosh Saxena, and Tsuhan Chen
Human-object interaction
Presentation transcript:

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University {bangpeng,feifeili}@cs.stanford.edu

Human-Object Interaction Robots interact with objects Automatic sports commentary Medical care “Kobe is dunking the ball.”

Human-Object Interaction Holistic image based classification (Previous talk: Grouplet) Playing saxophone Playing bassoon Detailed understanding and reasoning Vs. Grouplet is a generic feature for structured objects, or interactions of groups of objects. HOI activity: Tennis Forehand Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS 48% 59% 77% 62% Caltech101

Human-Object Interaction Holistic image based classification Detailed understanding and reasoning Human pose estimation Head Right-arm Left-arm Torso Right-leg Left-leg

Human-Object Interaction Holistic image based classification Detailed understanding and reasoning Human pose estimation Object detection Tennis racket

Human-Object Interaction Holistic image based classification Detailed understanding and reasoning Human pose estimation Object detection Head Right-arm Left-arm Torso Tennis racket Right-leg Left-leg HOI activity: Tennis Forehand

Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion

Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion

Human pose estimation & Object detection Human pose estimation is challenging. Difficult part appearance Self-occlusion Image region looks like a body part Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009

Human pose estimation & Object detection Human pose estimation is challenging. Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari, 2009

Human pose estimation & Object detection Facilitate Given the object is detected.

Human pose estimation & Object detection Object detection is challenging Small, low-resolution, partially occluded Image region similar to detection target Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al, 2009

Human pose estimation & Object detection Object detection is challenging Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al, 2009

Human pose estimation & Object detection Facilitate Given the pose is estimated.

Human pose estimation & Object detection Mutual Context

Context in Computer Vision Previous work – Use context cues to facilitate object detection: Helpful, but only moderately outperform better ~3-4% with context without context Hoiem et al, 2006 Rabinovich et al, 2007 Oliva & Torralba, 2007 Heitz & Koller, 2008 Desai et al, 2009 Divvala et al, 2009 Murphy et al, 2003 Shotton et al, 2006 Harzallah et al, 2009 Li, Socher & Fei-Fei, 2009 Marszalek et al, 2009 Bao & Savarese, 2010 Viola & Jones, 2001 Lampert et al, 2008

Context in Computer Vision Previous work – Use context cues to facilitate object detection: Our approach – Two challenging tasks serve as mutual context of each other: With mutual context: Helpful, but only moderately outperform better ~3-4% Without context: with context without context Hoiem et al, 2006 Rabinovich et al, 2007 Oliva & Torralba, 2007 Heitz & Koller, 2008 Desai et al, 2009 Divvala et al, 2009 Murphy et al, 2003 Shotton et al, 2006 Harzallah et al, 2009 Li, Socher & Fei-Fei, 2009 Marszalek et al, 2009 Bao & Savarese, 2010

Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion

Mutual Context Model Representation Croquet shot Volleyball smash Tennis forehand Activity A Human pose H Croquet mallet Volleyball Tennis racket O: Object O Body parts P1 P2 PN H: fO f1 f2 fN Intra-class variations More than one H for each A; Unobserved during training. Image evidence P: lP: location; θP: orientation; sP: scale. f: Shape context. [Belongie et al, 2002]

Mutual Context Model Representation Markov Random Field , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential H O P1 P2 PN fO f1 f2 fN

Mutual Context Model Representation Markov Random Field , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential , , : Spatial relationship among object and body parts. H O location orientation size P1 P2 PN fO f1 f2 fN

Mutual Context Model Representation Obtained by structure learning Markov Random Field , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential , , : Spatial relationship among object and body parts. H O location orientation size Obtained by structure learning Learn structural connectivity among the body parts and the object. P1 P2 PN fO f1 f2 fN

Mutual Context Model Representation Markov Random Field , , : Frequency of co-occurrence between A, O, and H. A Clique weight Clique potential , , : Spatial relationship among object and body parts. H O location orientation size Learn structural connectivity among the body parts and the object. P1 P2 PN fO and : Discriminative part detection scores. f1 f2 fN Shape context + AdaBoost [Andriluka et al, 2009] [Belongie et al, 2002] [Viola & Jones, 2001]

Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion

Model Learning Input: Goals: Hidden human poses cricket shot fO f1 f2 fN P1 P2 PN cricket shot cricket bowling Goals: Hidden human poses

Model Learning Input: Goals: Hidden human poses fO f1 f2 fN P1 P2 PN cricket shot cricket bowling Goals: Hidden human poses Structural connectivity

Model Learning Input: Goals: Hidden human poses fO f1 f2 fN P1 P2 PN cricket shot cricket bowling Goals: Hidden human poses Structural connectivity Potential parameters Potential weights

Model Learning Input: Goals: Hidden human poses Hidden variables fO f1 f2 fN P1 P2 PN cricket shot cricket bowling Goals: Hidden human poses Hidden variables Structural connectivity Structure learning Potential parameters Parameter estimation Potential weights

Model Learning Approach: Goals: Hidden human poses fO f1 f2 fN P1 P2 PN Approach: croquet shot Goals: Hidden human poses Structural connectivity Potential parameters Potential weights

Model Learning Approach: Goals: Hidden human poses fO f1 f2 fN P1 P2 PN Approach: Hill-climbing Joint density of the model Gaussian priori of the edge number Add an edge Remove an edge Goals: Hidden human poses Structural connectivity Potential parameters Add an edge Remove an edge Potential weights

Model Learning Approach: Goals: Maximum likelihood Standard AdaBoost fO f1 f2 fN P1 P2 PN Approach: Maximum likelihood Standard AdaBoost Goals: Hidden human poses Structural connectivity Potential parameters Potential weights

Model Learning Approach: Goals: Max-margin learning Hidden human poses fO f1 f2 fN P1 P2 PN Approach: Max-margin learning Goals: Hidden human poses Notations Structural connectivity xi: Potential values of the i-th image. wr: Potential weights of the r-th pose. y(r): Activity of the r-th pose. ξi: A slack variable for the i-th image. Potential parameters Potential weights

Cricket defensive shot Learning Results Cricket defensive shot Cricket bowling Croquet shot

Learning Results Tennis forehand Tennis serve Volleyball smash

Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion

Model Inference The learned models

Compositional Inference Model Inference The learned models Head detection Torso detection Compositional Inference [Chen et al, 2007] Tennis racket detection Layout of the object and body parts.

Model Inference The learned models Output

Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion

Dataset and Experiment Setup [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: Object detection; Pose estimation; Activity classification.

Dataset and Experiment Setup [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: Object detection; Pose estimation; Activity classification.

Object Detection Results Cricket bat Cricket ball Valid region Sliding window Pedestrian context Our Method [Andriluka et al, 2009] [Dalal & Triggs, 2006] Croquet mallet Tennis racket Volleyball 42

Object Detection Results Cricket ball Sliding window Pedestrian context Our method Small object Volleyball Background clutter 43

Dataset and Experiment Setup [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training & 120 testing images Tasks: Object detection; Pose estimation; Activity classification.

Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 .52 .22 .21 .28 .24 .17 .14 .42 Andriluka et al, 2009 .50 .31 .30 .27 .18 .19 .11 .45 Our full model .66 .43 .39 .44 .34 .40 .29 .58

Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 .52 .22 .21 .28 .24 .17 .14 .42 Andriluka et al, 2009 .50 .31 .30 .27 .18 .19 .11 .45 Our full model .66 .43 .39 .44 .34 .40 .29 .58 Tennis serve model Our estimation result Andriluka et al, 2009 Volleyball smash model Our estimation result Andriluka et al, 2009

Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 .52 .22 .21 .28 .24 .17 .14 .42 Andriluka et al, 2009 .50 .31 .30 .27 .18 .19 .11 .45 Our full model .66 .43 .39 .44 .34 .40 .29 .58 One pose per class .63 .36 .41 .38 .35 .23 Estimation result Estimation result Estimation result Estimation result

Dataset and Experiment Setup [Gupta et al, 2009] Cricket defensive shot Cricket bowling Croquet shot Tennis forehand Tennis serve Volleyball smash Sport data set: 6 classes 180 training & 120 testing images Tasks: Object detection; Pose estimation; Activity classification.

Activity Classification Results No scene information Scene is critical!! Cricket shot Tennis forehand Our model Gupta et al, 2009 Bag-of-words SIFT+SVM

Conclusion Grouplet representation Human-Object Interaction Vs. Mutual context model Next Steps Pose estimation & Object detection on PPMI images. Modeling multiple objects and humans.

Acknowledgment Stanford Vision Lab reviewers: Barry Chai (1985-2010) Juan Carlos Niebles Hao Su Silvio Savarese, U. Michigan Anonymous reviewers

Human-Object Interaction Holistic image based classification How to beat this??? Detailed understanding and reasoning Human pose estimation Object detection Head Right-arm Left-arm Torso Tennis racket Right-leg Left-leg

Mutual Context Model Representation Hierarchical representation of images human-object interaction activity H O A fO f1 f2 fN P1 P2 PN human pose object body parts image patches