1. Introduction Humanising GrabCut: Learning to segment humans using the Kinect Varun Gulshan, Victor Lempitksy and Andrew Zisserman Dept. of Engineering.
Published byModified over 4 years ago
Presentation on theme: "1. Introduction Humanising GrabCut: Learning to segment humans using the Kinect Varun Gulshan, Victor Lempitksy and Andrew Zisserman Dept. of Engineering."— Presentation transcript:
1. Introduction Humanising GrabCut: Learning to segment humans using the Kinect Varun Gulshan, Victor Lempitksy and Andrew Zisserman Dept. of Engineering Science, University of Oxford, UK 2. Top-Down Learning 3. Bottom up refinement: Local GrabCut4. Evaluation Idea: Learn to segment humans in RGB images, using a large dataset of RGB-D images during training. Segmentation Pipeline Original imageBounding box detection Top-down Segmentation Bottom up refinement Dataset Acquisition RGB imageKinect scene labelsCleaned up Ground truth Dataset Local HOGLocal ImageLocal ground truth mask The sparse coding toolbox of  is used to learn a dictionary with 2500 elements. A linear classifier is trained to predict the segmentation mask at each location, given the sparsely coded local HOG descriptor. Liblinear is used for training. Training statistics Dimensionality of h i = 325, x i = 2500. Local mask y i is scaled to 40x40 pixels => 1600 independent SVM classifiers are trained. Each w l consists of 2500 parameters, total of 1600x2500 = 4 million parameters learned. Total of 180,000 pairs of (h i,y i ) are extracted from the training set (training images are also flipped left-right to generate more data). All the 1600 SVM’s are trained approximately in a total of 1 hour using LibLinear. MethodTrain(%)Test(%) Box+GC76.5 ± 0.472.5 ± 0.6 Box+LocalGC78.0 ± 0.474.4 ± 0.6 LinSVM73.9 ± 0.276.1 ± 0.3 SpSVM86.1 ± 0.180.6 ± 0.2 SpSVM+Pos89.8 ± 0.182.6 ± 0.2 SpSVM+Pos+GC87.3 ± 0.286.5 ± 0.3 SpSVM+Pos+LocalGC91.8 ± 0.188.5 ± 0.2 MethodTrain(%)Test(%) SpSVM+Pos+LocalGC80.6±0.578.6±0.6 Ladicky70.4±0.342.3±0.5 Ladicky+Detection71.4±0.566.2±0.5 In order to train the top-down segmentation, a large training corpus that captures variations in human poses, clothing and backgrounds, is captured using the Kinect. OpenNI (http://www.openni.org) libraries are used for segmenting humans from depth images, and registration with the RGB camera.http://www.openni.org Dataset limited to indoor locations, as the Kinect works indoor only. Sample images and segmentations from our dataset Testing At test time, predicted segmentations in overlapping local regions are combined using majority voting. Training Spatial SVM Level 1 Level 2 Level 3 Level 4 Separate SVMs are learnt for each vertical level, to make the training task easier..... Top-down segmentation Local color model window After graph-cut segmentation Unary terms Similar to SnapCut, local windows of size 61x61 pixels are slid over the image. Windows that intersect the segmentation boundary are used to estimate color models. Unary terms from overlapping local windows are averaged Energy function: Unary term obtained using local color models Unary term penalising deviations from top-down segmentation Repeat 4 times References:  L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierarchical crfs for object class image segmentation. In Proceedings, ICCV, 2009.  J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings, ICML, 2009  P. F. Felzenszwalb, R. B. Grishick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE PAMI, 2009.  X. Bai, J. Wang, D. Simons, and G. Sapiro. Video SnapCut: Robust video object cutout using localized classifiers. In Proc. ACM SIGGRAPH, 2009. Box+GCSpSVM+PosSpSVM+Pos+LocalGC Results using ground truth bounding boxes (Overlap score with GT) Results using bounding box detectors of  Qualitative results 3386 segmented humans in total (1930 training and 1456 in testing). 10 human subjects across 4 indoor locations for training, and 6 human subjects across 4 indoor locations for testing. Human subjects and locations are disjoint between the train and test sets. Available at: http://www.robots.ox.ac.uk/~vgg/data/humanSeg/ The bounding box around the object is divided into dense overlapping local regions, and independent per-pixel classifiers are trained to predict the label of every pixel within the local region. Non-linear mapping via sparse coding . 0 0 0 0 0 0 0 0.3.9 Local regions are described using HOG descriptors h i, that are sparsely coded onto a dictionary to give our feature vector x i.