Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA.

Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA CLLR Workshop December 2, 2008

Grand Goal: Scene Understanding “man wearing a backpack, smoking a cigarette, walking a dog on a sidewalk” Man Dog Backpack Cigarette “A cow walking through the grass on a pasture by the sea”

Understanding Verb Frames “a man is walking on a sidewalk” Primitives Objects Parts Surfaces Regions Interactions Context Actions Methods exist to extract these, but we need to both do a better job, and get them all at once Modeling verb frames requires understanding the interactions between primitives, and which fit well into the framework of graphical models. Man Dog Backpack Cigarette Building Sidewalk “a dog is walking on a sidewalk” Frame: to walk

Outline Extracting the Primitives Qualitative 3D Scene Layout Modeling Relationships Learning Frames Refined Characterization of Objects

Computer View of a “Scene” BUILDING ROAD STREET SCENE

Object Detection = Car = Person = Motorcycle = Boat = Sheep = Cow Detection Window W Score(W) > 0.5

Finding the Primitives Jointly SEASIDE PASTURE GRASS SKY Grass = Flat Sky = Far FG = Vertical 40% Grass, 30% Sky… 1 cow, 2 boats… [Heitz et al., NIPS 2008a]

Results – TAS Model Contextual Detector Base Detector [Heitz et al., ECCV 2008]

Qualitative 3D Scene Layout Primitives imply a certain 3D layout of the scene, absolute depth may not be preserved For example: Sky is a far, vertical plane Water, road are horizontal planes Objects “popup” from the image

Modeling Relationships Beside In front of On We have explored how to model 2D relationships We should be able to extend this to 3D relationships [Gould et al., IJCV 2008] [Heitz et al., ECCV 2008]

Outline Extracting the Primitives Qualitative 3D Scene Layout Modeling Relationships Learning Frames Refined Characterization of Objects

Learning Semantics: Verb Frames Given primitives, rough layout, and relationships Let’s learn subjects, verb, and objects for frames: The [S] [V] the [O]. [S],[O] CAR ROAD COW GRASS PERSON APPLE … [V] WALKS ON EATS DRIVES ON JUMPS OVER THROWS …

The CAR DRIVES ON the ROAD

Refined Characterization We need to know that the white stick is a cigarette… and where the man’s mouth is… in order to determine that he’s smoking.

Refined Object Characterization Set of “keypoint” landmarks Outline shape defined by connecting contour [Heitz et al., NIPS 2008b, IJCV in submission]

Results GiraffeLlama Rhino

Mammals [Heitz et al., NIPS 2008b, IJCV in submission] EatingStanding RunningStanding

Activity Recognition Eating Drinking 2) Extract histogram of “stuff” in a window around the head landmark 1) Localize the landmarks of the cow, including the head. Grass Cow 3) Make a decision Eating

Activity Recognition with People RunningWalkingStandingHitting Pose of person is one of the important factors Also need to recognize objects person interacts with

How far can we take this? Front legs off ground = Jumping Ball near hands = Throwing Apple near mouth = Eating

Does phased learning help? Cartoon/Caricature Exaggerates the most salient features of the object class. Simple BG Real object with no confusing clutter. Cluttered BG Object in standard pose on natural background. Articulated Once we have built a strong appearance model, can we learn complicated articulations?

Our Related Papers G. Elidan, B. Packer, G. Heitz, and D. Koller. Convex Point Estimation using Undirected Bayesian Transfer Hierarchies. UAI, 2008. S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller. Multi- Class Segmentation with Relative Location Prior. IJCV, 2008. S. Gould, P. Baumstarck, M. Quigley, A. Ng, and D. Koller. Integrating Visual and Range Data for Robotic Object Detection. ECCV Workshop M2SFA2, 2008. G. Heitz and D. Koller. Learning Spatial Context: Using Stuff to Find Things. ECCV, 2008. G. Heitz, S. Gould, A. Saxena, and D. Koller. Cascaded Classification Models: Combining Models for Holistic Scene Understanding. NIPS, 2008. G. Heitz, G. Elidan, B. Packer, and D. Koller. Shape-based Object Localization for Descriptive Classification. NIPS, 2008.

Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA.

Similar presentations

Presentation on theme: "Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA.

Similar presentations

Presentation on theme: "Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA."— Presentation transcript:

Similar presentations

About project

Feedback