Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indoor Scene Segmentation using a Structured Light Sensor

Similar presentations


Presentation on theme: "Indoor Scene Segmentation using a Structured Light Sensor"— Presentation transcript:

1 Indoor Scene Segmentation using a Structured Light Sensor
Nathan Silberman and Rob Fergus ICCV 2011 Workshop on 3D Representation and Recognition Courant Institute

2 Overview Indoor Scene Recognition using the Kinect
Introduce new Indoor Scene Depth Dataset Describe CRF-based model Explore the use of rgb/depth cues

3 Motivation Indoor Scene recognition is hard
Far less texture than outdoor scenes More geometric structure

4 Motivation Indoor Scene recognition is hard
Far less texture than outdoor scenes More geometric structure Kinect gives us depth map (and RGB) Direct access to shape and geometry information

5 Overview Indoor Scene Recognition using the Kinect
Introduce new Indoor Scene Depth Dataset Describe CRF-based model Explore the use of rgb/depth cues

6 Capturing our Dataset We wanted a dataset that could be used for scene understanding rather than just detection Kinect normally has AC Adapter Replaced with battery Open Source Drivers used with mouse to record data

7 Statistics of the Dataset
Scene Type Number of Scenes Frames Labeled Frames * Bathroom 6 5,588 76 Bedroom 17 22,764 480 Bookstore 3 27,173 784 Cafe 1 1,933 48 Kitchen 10 12,643 285 Living Room 13 19,262 355 Office 14 19,254 319 Total 64 108,617 2,347 Fix the names of columns. Talk them through the columns. DENSELY LABELED!!!! - Mention labelme Most of these are collected from friends apts in new york Each image is 480x640 Each pixel is a 5-tuple: RGB (3) Depth (1) Label (1) * Labels obtained via LabelMe

8 Dataset Examples Living Room Explain the noise RGB Raw Depth Labels

9 Dataset Examples Living Room RGB Depth* Labels
Add note about bilateral filtering use to fill in holes RGB Depth* Labels * Bilateral Filtering used to clean up raw depth image

10 Dataset Examples Bathroom RGB Depth Labels
WHEN WE FILL IN THE HOLES, ADD SOME TEXT ABOUT PROJECTION + BILATERAL FILTER RGB Depth Labels

11 Dataset Examples Bedroom RGB Depth Labels

12 Existing Depth Datasets
RGB-D Dataset [1] Most depth datasets are very small, a few images, but there are a few bigger ones RGBD for small objects only, similar to COIL Uses kinect Make3d is not densely labeled and is all outdoors B3do Uses kinect, is really for detection, not scene understanding has 849 labeled frames, 75 scenes, around 50 objects [1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010 Stanford Make3d [2]

13 Existing Depth Datasets
PCD has 52 scenes Point Cloud Data [1] B3DO [2] [1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011 [2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision. 2011

14 Dataset Freely Available
Images - code Labels Splits Emphasis the raw data is the raw data streams

15 Overview Indoor Scene Recognition using the Kinect
Introduce new Indoor Scene Depth Dataset Describe CRF-based model Explore the use of rgb/depth cues

16 Segmentation using CRF Model
Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Use phrase cost function (energy) Standard CRF, talk about costs. Now we’re going to go into detail with regard to each term, the alternative choices we can make for each term, and we’ll talk about evaluation last. Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness Standard CRF formulation Optimized via graph cuts Discrete label set (~12 classes)

17 Model Cost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

18 Model Cost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

19 Appearance Term Appearance(label i | descriptor i)
Several Descriptor Types to choose from: RGB-SIFT Depth-SIFT Depth-SPIN RGBD-SIFT RGB-SIFT/D-SPIN We’ll evaluate these alternatives later.

20 Descriptor Type: RGB-SIFT
RGB image from the Kinect 128 D Say STANDARD SIFT Extracted Over Discrete Grid

21 Descriptor Type: Depth-SIFT
Depth image from kinect with linear scaling 128 D Pixel intensity is proportional to depth Talk about why depth sift is a good idea -large scale shape information -small magnitude directional gradient information – essentially surface normals Extracted Over Discrete Grid

22 Descriptor Type: Depth-SPIN
Depth image from kinect with linear scaling Radius 50 D Same sift features, but done over linear scaled depth Talk about why depth sift is a good idea -large scale shape information -small magnitude directional gradient information – essentially surface normals Depth Extracted Over Discrete Grid A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999

23 Descriptor Type: RGBD-SIFT
RGB image from the Kinect Concatenate Depth image from kinect with linear scaling 256 D

24 Descriptor Type: RGD-SIFT, D-SPIN
RGB image from the Kinect Concatenate Depth image from kinect with linear scaling 178 D

25 Descriptor at each location
Appearance Model Appearance(label i | descriptor i) - Modeled by a Neural Network with a single hidden layer Talk through this in animated way. Last animation is backprop with curly arrow going down Descriptor at each location

26 Descriptor at each location
Appearance Model Appearance(label i | descriptor i) Softmax output layer 13 Classes Talk through this in animated way. Last animation is backprop with curly arrow going down 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

27 Appearance Model Appearance(label i | descriptor i) Interpreted as
p(label | descriptor) Probability Distribution over classes 13 Classes Talk through this in animated way. Last animation is backprop with curly arrow going down 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

28 Appearance Model Appearance(label i | descriptor i)
Probability Distribution over classes 13 Classes Trained with backpropagation Talk through this in animated way. Last animation is backprop with curly arrow going down 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location

29 Model Cost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

30 Model Cost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness

31 Model Cost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness 2D Priors 3D Priors

32 Location Priors: 2D 2D Priors are histograms of P(class, location)
Smoothed to avoid image-specific artifacts Check on the math!!!

33 Motivation: 3D Location Priors
2D Priors don’t capture 3d geomety 3D Priors can be built from depth data Rooms are of different shapes and sizes, how do we align them? The idea of a 3d prior is mean to capture high degree of regularity…

34 Motivation: 3D Location Priors
To align rooms, we’ll use a normalized cylindrical coordinate system: Show that we are dividing each scanline by its max depth to give unit distance Band of maximum depths along each vertical scanline

35 Relative Depth Distributions
Table Television Bed Wall Talk through the images better Explain that the cylinder gives relative depth Density 1 1 Relative Depth

36 Location Priors: 3D Each column is a bin of relative depth
3 dimensions: Explain binning! - Log Depth Explain Height and angle

37 Model Cost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness 2D Priors 3D Priors

38 Model Cost(labels) = Local Terms(label i) +
Spatial Smoothness (label i, label j) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness Penalty for adjacent labels disagreeing (Standard Potts Model)

39 Spatial Modulation of Smoothness
Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Solving it using standard graph cuts solver The interesting thing is how we incorporate depth using the local appearance, and the spatial smoothness In practice they don’t make a huge diff Spatial Modulation of Smoothness None RGB Edge Depth Edges RGB + Depth Edges Superpixel Edges Superpixel + RGB Edges Superpixel + Depth Edges

40 Experimental Setup 60% Train (~1408 images) 40% Test (~939 images)
10 fold cross validation Images of the same scene cannot appear apart Performance criteria is pixel-level classification (mean diagonal of confusion matrix) 12 most common classes, 1 background class (from the rest)

41 Evaluating Descriptors
Percent 5% gain using RGBD over RGB, 7% over just depth Talk over what the blue bars are and what the red bars ar 2D Descriptors 3D Descriptors

42 Evaluating Location Priors
Percent Talk over the actual percentage boosts 2D Descriptors 3D Descriptors

43 2nd column is standard RGB, 3rd is leveraging the depth cues

44 Point out the failure on top, still somewhat unreliable, still doesn’t understand 3d structure of objects. Absolute numbers may be low but many, many objects in indoor scene,

45 Conclusion Kinect Depth signal helps scene parsing
Still a long way from great performance Shown standard approaches on RGB-D data. Lots of potential for more sophisticated methods. No complicated geometric reasoning

46 Preprocessing the Data
We use open source calibration software [1] to infer: Parameters of RGB & Depth cameras Homography between cameras. Depth and RGB are not aligned Missing pixels due to Shadow casued by displacement between infrared emmitter and camera Dark/specular surfaces Random noise Depth values are between 0 and 65,536 We need to invert the depth [1] N. Burrus. Kinect RGB Demo v Feb. 2011

47 Preprocessing the data
Bilateral filter used to diffuse depth across regions of similar RGB intensity Naïve GPU implementation runs in ~100 ms

48 Motivation Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes. [1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006


Download ppt "Indoor Scene Segmentation using a Structured Light Sensor"

Similar presentations


Ads by Google