Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indoor Scene Segmentation using a Structured Light Sensor Nathan Silberman and Rob Fergus ICCV 2011 Workshop on 3D Representation and Recognition Courant.

Similar presentations


Presentation on theme: "Indoor Scene Segmentation using a Structured Light Sensor Nathan Silberman and Rob Fergus ICCV 2011 Workshop on 3D Representation and Recognition Courant."— Presentation transcript:

1 Indoor Scene Segmentation using a Structured Light Sensor Nathan Silberman and Rob Fergus ICCV 2011 Workshop on 3D Representation and Recognition Courant Institute

2 Overview Indoor Scene Recognition using the Kinect Introduce new Indoor Scene Depth Dataset Describe CRF-based model – Explore the use of rgb/depth cues

3 Motivation Indoor Scene recognition is hard – Far less texture than outdoor scenes – More geometric structure

4 Motivation Indoor Scene recognition is hard – Far less texture than outdoor scenes – More geometric structure Kinect gives us depth map (and RGB) – Direct access to shape and geometry information

5 Overview Indoor Scene Recognition using the Kinect Introduce new Indoor Scene Depth Dataset Describe CRF-based model – Explore the use of rgb/depth cues

6 Capturing our Dataset

7 Statistics of the Dataset Scene Type Number of Scenes FramesLabeled Frames * Bathroom65,58876 Bedroom1722, Bookstore327, Cafe11,93348 Kitchen1012, Living Room1319, Office1419, Total64108,6172,347 * Labels obtained via LabelMe

8 Dataset Examples Living Room RGBRaw DepthLabels

9 Dataset Examples Living Room RGBDepth*Labels * Bilateral Filtering used to clean up raw depth image

10 Dataset Examples Bathroom RGBDepthLabels

11 Dataset Examples Bedroom RGBDepthLabels

12 Existing Depth Datasets [1] K. Lai, L. Bo, X. Ren, and D. Fox. A Large-Scale Hierarchical Multi-View RGB-D Object Dataset. ICRA 2011 [2] B. Liu, S. Gould and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. CVPR 2010 RGB-D Dataset [1] Stanford Make3d [2]

13 Existing Depth Datasets [1] Abhishek Anand, Hema Swetha Koppula, Thorsten Joachims, Ashutosh Saxena. Semantic Labeling of 3D Point Clouds for Indoor Scenes. NIPS, 2011 [2] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz, K. Saenko, T. Darrell. A Category-Level 3-D Object Dataset: Putting the Kinect to Work. ICCV Workshop on Consumer Depth Cameras for Computer Vision Point Cloud Data [1] B3DO [2]

14 Dataset Freely Available

15 Overview Indoor Scene Recognition using the Kinect Introduce new Indoor Scene Depth Dataset Describe CRF-based model – Explore the use of rgb/depth cues

16 Segmentation using CRF Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Standard CRF formulation Optimized via graph cuts Discrete label set (~12 classes)

17 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i)

18 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i)

19 Appearance Term Appearance(label i | descriptor i) Several Descriptor Types to choose from: o RGB-SIFT o Depth-SIFT o Depth-SPIN o RGBD-SIFT o RGB-SIFT/D-SPIN

20 Descriptor Type: RGB-SIFT Extracted Over Discrete Grid RGB image from the Kinect 128 D

21 Descriptor Type: Depth-SIFT Depth image from kinect with linear scaling 128 D Extracted Over Discrete Grid

22 Descriptor Type: Depth-SPIN Depth image from kinect with linear scaling 50 D Radius Depth A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE PAMI, 21(5):433–449, 1999 Extracted Over Discrete Grid

23 Descriptor Type: RGBD-SIFT Concatenate 256 D RGB image from the Kinect Depth image from kinect with linear scaling

24 Descriptor Type: RGD-SIFT, D-SPIN Concatenate RGB image from the Kinect Depth image from kinect with linear scaling 178 D

25 Appearance Model Descriptor at each location Appearance(label i | descriptor i) - Modeled by a Neural Network with a single hidden layer

26 Appearance Model Descriptor at each location Appearance(label i | descriptor i) 13 Classes 1000-D Hidden Layer 128/178/256-D Input Softmax output layer

27 Appearance Model 13 Classes 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location Probability Distribution over classes Appearance(label i | descriptor i) Interpreted as p(label | descriptor) Interpreted as p(label | descriptor)

28 Appearance Model 13 Classes 1000-D Hidden Layer 128/178/256-D Input Descriptor at each location Probability Distribution over classes Appearance(label i | descriptor i) Trained with backpropagation

29 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i)

30 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i)

31 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) 3D Priors 2D Priors

32 Location Priors: 2D 2D Priors are histograms of P(class, location) Smoothed to avoid image-specific artifacts

33 Motivation: 3D Location Priors 2D Priors don’t capture 3d geomety 3D Priors can be built from depth data Rooms are of different shapes and sizes, how do we align them?

34 Motivation: 3D Location Priors To align rooms, we’ll use a normalized cylindrical coordinate system: Band of maximum depths along each vertical scanline

35 Relative Depth Distributions TableTelevision BedWall Relative Depth Density

36 Location Priors: 3D

37 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) = Appearance(label i | descriptor i) Location(i) 3D Priors 2D Priors

38 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Penalty for adjacent labels disagreeing (Standard Potts Model) Penalty for adjacent labels disagreeing (Standard Potts Model)

39 Model Cost(labels) = Local Terms(label i) + Spatial Smoothness (label i, label j) Spatial Modulation of Smoothness None RGB Edge Depth Edges RGB + Depth Edges Superpixel Edges Superpixel + RGB Edges Superpixel + Depth Edges

40 Experimental Setup 60% Train (~1408 images) 40% Test (~939 images) 10 fold cross validation Images of the same scene cannot appear apart Performance criteria is pixel-level classification (mean diagonal of confusion matrix) 12 most common classes, 1 background class (from the rest)

41 Evaluating Descriptors 2D Descriptors 3D Descriptors Percent

42 Evaluating Location Priors Percent 2D Descriptors 3D Descriptors

43

44

45 Conclusion Kinect Depth signal helps scene parsing Still a long way from great performance Shown standard approaches on RGB-D data. Lots of potential for more sophisticated methods. No complicated geometric reasoning

46 Preprocessing the Data [1] N. Burrus. Kinect RGB Demo v V2, Feb We use open source calibration software [1] to infer: Parameters of RGB & Depth cameras Homography between cameras.

47 Preprocessing the data Bilateral filter used to diffuse depth across regions of similar RGB intensity Naïve GPU implementation runs in ~100 ms

48 Motivation Results from Spatial Pyramid-based classification [1] using 5 indoor scene types. Contrast this with the 81% received by [1] on a 13-class (mostly outdoor) scene dataset. They note similar confusion within indoor scenes. [1] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene CategoriesS. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006


Download ppt "Indoor Scene Segmentation using a Structured Light Sensor Nathan Silberman and Rob Fergus ICCV 2011 Workshop on 3D Representation and Recognition Courant."

Similar presentations


Ads by Google