1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009.

1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009

2 Scene/Image Understanding What’s happening in these pictures?

3 Human View of a “Scene” “A car passes a bus on the road, while people walk past a building.” ROAD BUILDING CAR BUS PEOPLE WALKING

4 Computer View of a “Scene” BUILDING ROAD STREET SCENE Can we integrate all of these subtasks, so that whole > sum of parts ?

5 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions [Heitz et al. NIPS 2008a] [Heitz & Koller ECCV 2008]

6 Image/Scene Understanding “a man and a dog are walking on a sidewalk in front of a building” Man Dog Backpack Cigarette Primitives Objects Parts Surfaces Regions Interactions Context Actions Scene Descriptions Established techniques address these in isolation. Reasoning over image statistics Complex web of relations well represented by graphical models. Reasoning over more abstract entities. Building Sidewalk

7 Why will integration help? What is this object?

8 More Context Context is key!

9 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions [Heitz et al. NIPS 2008a]

10 Human View of a “Scene” ROAD BUILDING CAR BUS PEOPLE WALKING Scene Categorization Object Detection Region Labelling Depth Reconstruction Surface Orientations Boundary/Edge Detection Outlining/Refined Localization Occlusion Reasoning...

11 Intrinsic Images [Barrow and Tenenbaum, 1978], [Tappen et al., 2005] Hoiem et al., “Closing the Loop in Scene Interpretation”, 2008 We want to focus more on “semantic” classes We want to be flexible to using outside models We want an extendable framework, not one engineered for a particular set of tasks Related Work = + =

12 How Should we Integrate? Single joint model over all variables Pros: Tighter interactions, more designer control Cons: Need expertise in each of the subtasks Simple, flexible combination of existing models Pros: State-of-the-art models, easier to extend Limited “black-box” interface to components Cons: Missing some of the modeling power DETECTION Dalal & Triggs, 2006 REGION LABELING Gould et al., 2007 DEPTH RECONSTRUCTION Saxena et al., 2007

13 DET 1 REG 1 REC 1 Cascaded Classification Models Image Features f DET Object Detection Region Labeling DET 0 Independent Models f REG REG 0 f REC REC 0 3D Reconstruction Context-aware Models

14 Integrated Model for Scene Understanding Object Detection Multi-class Segmentation Depth Reconstruction Scene Categorization I’ll show you these

15 Basic Object Detection = Car = Person = Motorcycle = Boat = Sheep = Cow Detection Window W Score(W) > 0.5

16 Base Detector - HOG [ Dalal & Triggs, CVPR, 2006 ] HOG Detector: Feature Vector XSVM Classifier

17 Context-Aware Object Detection From Base Detector Log Score D(W) From Scene Category MAP category, marginals From Region Labels How much of each label is in a window adjacent to W From Depths Mean, variance of depths, estimate of “true” object size Final Classifier P(Y) = Logistic(Φ(W)) Scene Type: Urban scene % of “road” below W Variance of depths in W

18 Multi-class Segmentation CRF Model Label each pixel as one of: {‘grass’, ‘road’, ‘sky’, etc } Conditional Markov random field (CRF) over superpixels: Singleton potentials: log- linear function of boosted detectors scores for each class Pairwise potentials: affinity of classes appearing together conditioned on (x,y) location within the image [Gould et al., IJCV 2007]

19 Context-Aware Multi-class Seg. Additional Feature: Relative Location Map Where is the grass?

20 Depth Reconstruction CRF [Saxena et al., PAMI 2008] Label each pixel with it’s distance from the camera Conditional Markov random field (CRF) over superpixels Continuous variables Models depth as linear function of features with pairwise smoothness constraints http://make3d.stanford.edu

21 Depth Reconstruction with Context BLACK BOX GRASS SKY Find d* Reoptimize depths with new constraints: d CCM = argmin α||d - d*|| + β||d - d CONTEXT ||

22 Training I: Image f: Image Features Ŷ: Output labels Training Regimes Independent Ground: Groundtruth Input I fDfD fSfS fZfZ ŶDŶD 0 ŶSŶS 0 ŶZŶZ 0 I fDfD fSfS fZfZ ŶDŶD 1 ŶSŶS * ŶSŶS 1 ŶZŶZ * ŶZŶZ 1

23 Training CCM Training Regime Later models can ignore the mistakes of previous models Training realistically emulates testing setup Allows disjoint datasets K-CCM: A CCM with K levels of classifiers I fDfD fSfS fZfZ ŶDŶD 0 ŶDŶD 1 ŶSŶS 0 ŶSŶS 1 ŶZŶZ 0 ŶZŶZ 1

24 Experiments DS1 422 Images, fully labeled Categorization, Detection, Multi-class Segmentation 5-fold cross validation DS2 1745 Images, disjoint labels Detection, Multi-class Segmentation, 3D Reconstruction 997 Train, 748 Test

25 CCM Results – DS1 CAR PEDESTRIAN MOTORBIKE BOAT CATEGORIES REGION LABELS

26 CCM Results – DS2 DetectionCarPersonBikeBoatSheepCowDepth INDEP0.3570.2670.4100.0960.3190.39516.7m 2-CCM0.3640.2720.4100.2120.2890.41515.4m RegionsTreeRoadGrassWaterSkyBuildingFG INDEP0.5410.7020.8590.4440.9240.4360.828 2-CCM0.5810.6920.8600.5650.9300.4890.819 Boats

27 Example Results INDEPENDENT CCM

28 Example Results Independent ObjectsIndependent RegionsCCM Objects Independent ObjectsIndependent RegionsCCM Regions

29 Understanding the man “a man, a dog, a sidewalk, a building”

30 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions [Heitz & Koller ECCV 2008]

31 Things vs. Stuff Stuff (n): Material defined by a homogeneous or repetitive pattern of fine-scale properties, but has no specific or distinctive spatial extent or shape. (REGIONS) Thing (n): An object with a specific size and shape. (DETECTIONS) From: Forsyth et al. Finding pictures of objects in large collections of images. Object Representation in Computer Vision, 1996.

32 Cascaded Classification Models DET 1 REG 1 REC 1 Image Features f DET f REG f REC Object Detection Region Labeling DET 0 Independent Models REG 0 REC 0 3D Reconstruction Context-aware Models

33 CCM Feedforward CCMs vs. TAS Image f DET f REG DET 0 REG 0 DET 1 REG 1 TAS Modeled Jointly Image f DET f REG DET REG Relationships

34 Satellite Detection Example FALSE POSITIVE TRUE POSITIVE

35 Stuff-Thing Context Stuff-Thing: Based on spatial relationships Intuition: Trees = no cars Houses = cars nearby Road = cars here “Cars drive on roads” “Cows graze on grass” “Boats sail on water” Goal: Unsupervised

36 Things Detection T i Є {0,1} T i = 1: Candidate window contains a positive detection TiTi Image Window W i P(T i ) = Logistic(score(W i ))

37 Stuff Coherent image regions Coarse “superpixels” Feature vector F j in R n Cluster label S j in {1…C} Stuff model Naïve Bayes SjSj FjFj

38 Relationships Descriptive Relations “Near”, “Above”, “In front of”, etc. Choose set R = { r 1 …r K } R ijk =1: Detection i and region j have relation k Relationship model S 72 = Trees S 4 = Houses S 10 = Road T1T1 R ijk TiTi SjSj R 1,10,in =1

39 Unrolled Model T1T1 S1S1 S2S2 S3S3 S4S4 S5S5 T2T2 T3T3 R 2,1,above = 0 R 3,1,left = 1 R 1,3,near = 0 R 3,3,in = 1 R 1,1,left = 1 Candidate Windows Image Regions

40 Learning the Parameters Assume we know R S j is hidden Everything else observed Expectation-Maximization “Contextual clustering” Parameters are readily interpretable R ijk TiTi SjSj FjFj Image Window W i N J K Supervised in Training Set Always Observed Always Hidden

41 Which Relationships to Use? Rijk = spatial relationship between candidate i and region j Rij1 = candidate in region Rij2 = candidate closer than 2 bounding boxes (BBs) to region Rij3 = candidate closer than 4 BBs to region Rij4 = candidate farther than 8 BBs from region Rij5 = candidate 2BBs left of region Rij6 = candidate 2BBs right of region Rij7 = candidate 2BBs below region Rij8 = candidate more than 2 and less than 4 BBs from region … RijK = candidate near region boundary How do we avoid overfitting?

42 Learning the TAS Relations Intuition “Detached” R ijk = inactive relationship Structural EM iterates: Learn parameters Decide which edge to toggle Evaluate with l (T|F,W,R) Requires inference Better results than using standard E[ l (T,S,F,W,R)] R ij1 TiTi SjSj FjFj R ij2 R ijK

43 Inference Goal: Block Gibbs Sampling Easy to sample T i ’s given S j ’s and vice versa

44 Learned Satellite Clusters

45 Results - Satellite Prior: Detector Only Posterior: Detections Posterior: Region Labels

46 Discovered Context - Bicycles Bicycles Cluster #3

47 TAS Results – Bicycles Examples Discover “true positives” Remove “false positives” BIKE ? ? ?

48 Results – VOC 2005 TAS Base Detector

49 Understanding the man “a man and a dog on a sidewalk, in front of a building ”

50 Outline Overview Integrating Vision Models CCM: Cascaded Classification Models Learning Spatial Context TAS: Things and Stuff Future Directions

51 Shape models for segmentation We have a good deformable shape model (LOOPS) for outlining objects We have good models for segmenting objects Let’s combine them Add terms encouraging landmarks to lie on segmentation boundaries Ben Packer is working on this… OutlineSegmentation Joint OutlineJoint Segmentation Landmark Seg Mask

52 Refined Segmentation Our segmentation only knows about pixel “classes” What about objects? Steve Gould is working on this… Region Class Region Appearance Pixel/Region Assignment Pixel Appearance

53 Full TAS-like Integration R ijk TiTi SjSj Depths Occlusion Edges Surface Edges Shape Models

1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009.

Similar presentations

Presentation on theme: "1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009.

Similar presentations

Presentation on theme: "1 Integrating Vision Models for Holistic Scene Understanding Geremy Heitz CS223B March 4 th, 2009."— Presentation transcript:

Similar presentations

About project

Feedback