Fully Convolutional Networks for Semantic Segmentation

Slides:



Advertisements
Similar presentations
Rich feature Hierarchies for Accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitandra Malik (UC Berkeley)
Advertisements

Lecture 6: Classification & Localization
Classification spotlights
Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection CVPR2013 POSTER.
R-CNN By Zhang Liliang.
Spatial Pyramid Pooling in Deep Convolutional
From R-CNN to Fast R-CNN
Generic object detection with deformable part-based models
The Three R’s of Vision Jitendra Malik.
Detection, Segmentation and Fine-grained Localization
ECE 6504: Deep Learning for Perception
Object detection, deep learning, and R-CNNs
Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Feedforward semantic segmentation with zoom-out features
Unsupervised Visual Representation Learning by Context Prediction
Cascade Region Regression for Robust Object Detection
Lecture 4a: Imagenet: Classification with Localization
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
Spatial Localization and Detection
Introduction to Convolutional Neural Networks
Lecture 3b: CNN: Advanced Layers
Recent developments in object detection
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
Demo.
Faster R-CNN – Concepts
Object Detection based on Segment Masks
Object detection with deformable part-based models
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
The Problem: Classification
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Announcements Project proposal due tomorrow
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Combining CNN with RNN for scene labeling (segmentation)
Nonparametric Semantic Segmentation
Dhruv Batra Georgia Tech
Structured Predictions with Deep Learning
Training Techniques for Deep Neural Networks
Efficient Deep Model for Monocular Road Segmentation
CS6890 Deep Learning Weizhen Cai
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Project Implementation for ITCS4122
Object detection.
Fully Convolutional Networks for Semantic Segmentation
Computer Vision James Hays
Introduction to Neural Networks
Image Classification.
EVA2: Exploiting Temporal Redundancy In Live Computer Vision
Counting in Dense Crowds using Deep Learning
Object Detection + Deep Learning
Smart Robots, Drones, IoT
KFC: Keypoints, Features and Correspondences
Semantic segmentation
Neural network training
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
Visualizing and Understanding Convolutional Networks
RCNN, Fast-RCNN, Faster-RCNN
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Heterogeneous convolutional neural networks for visual recognition
Convolutional Neural Network
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Department of Computer Science Ben-Gurion University of the Negev
Human-object interaction
Deep Object Co-Segmentation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
Semantic Segmentation
Learning Deconvolution Network for Semantic Segmentation
Jiahe Li
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Goal of work is to use FCn to predict class at every pixel Transfer existing classification models to dense prediction tasks Presented by: Gordon Christie Slide credit: Jonathan Long

Overview Reinterpret standard classification convnets as “Fully convolutional” networks (FCN) for semantic segmentation Use AlexNet, VGG, and GoogleNet in experiments Novel architecture: combine information from different layers for segmentation State-of-the-art segmentation for PASCAL VOC 2011/2012, NYUDv2, and SIFT Flow at the time Inference less than one fifth of a second for a typical image Note that using existing networks is transfer learning Slide credit: Jonathan Long

pixels in, pixels out Slide credit: Jonathan Long monocular depth estimation (Liu et al. 2015) boundary prediction (Xie & Tu 2015) semantic segmentation Slide credit: Jonathan Long

convnets perform classification < 1 millisecond “tabby cat” 1000-dim vector end-to-end learning Slide credit: Jonathan Long

R-CNN does detection R-CNN many seconds “dog” “cat” Slide credit: Jonathan Long

R-CNN figure: Girshick et al. Slide credit: Jonathan Long

< 1/5 second ??? end-to-end learning Slide credit: Jonathan Long

a classification network “tabby cat” note omissions “activations” fixed size input, single label output desire: efficient per-pixel output Slide credit: Jonathan Long

becoming fully convolutional Slide credit: Jonathan Long

becoming fully convolutional Slide credit: Jonathan Long

upsampling output Slide credit: Jonathan Long

end-to-end, pixels-to-pixels network upsampling pixelwise output + loss conv, pool, nonlinearity Slide credit: Jonathan Long

Dense Predictions Shift-and-stitch: trick that yields dense predictions without interpolation Upsampling via deconvolution Shift-and-stitch used in preliminary experiments, but not included in final model Upsampling found to be more effective and efficient “Final layer deconvolutional filters are fixed to bilinear inter- polation, while intermediate upsampling layers are initial- ized to bilinear upsampling” Changing only the filters and layer strides of a convnet can produce the same output as this shift-and-stitch trick.

Classifier to Dense FCN Convolutionalize proven classification architectures: AlexNet, VGG, and GoogLeNet (reimplementation) Remove classification layer and convert all fully connected layers to convolutions Append 1x1 convolution with channel dimensions and predict scores at each of the coarse output locations (21 categories + background for PASCAL) “Despite similar classification accuracy, our implementation of GoogLeNet did not match this segmentation result.”

Classifier to Dense FCN Cast ILSVRC classifiers into FCNs and compare performance on validation set of PASCAL 2011 THESE ARE VAL NUMBERS. Just begun and they are already state of the art They initialize using the classification models trained on imagenet Train with per-pixel multinomial loss and validate with mean intersection over union

spectrum of deep features combine where (local, shallow) with what (global, deep) fuse features into deep jet (cf. Hariharan et al. CVPR15 “hypercolumn”) Slide credit: Jonathan Long

skip layers interp + sum skip to fuse layers! dense output end-to-end, joint learning of semantics and location skip to fuse layers! Slide credit: Jonathan Long

skip layers “Max fusion made learning difficult due to gradient switching.” Decreasing the stride of pooling layers is the most straightforward way to obtain finer predictions. However, doing so is problematic for our VGG16-based net. Setting the pool5 layer to have stride 1 requires our convolutionalized fc6 to have a kernel size of 14 × 14 in order to maintain its receptive field size. In addi- tion to their computational cost, we had difficulty learning such large filters. We made an attempt to re-architect the layers above pool5 with smaller filters, but were not suc- cessful in achieving comparable performance; one possible explanation is that the initialization from ImageNet-trained weights in the upper layers is important.

Comparison of skip FCNs Results on subset of validation set of PASCAL VOC 2011 Fixed = only fine tuning in final layer

skip layer refinement input image stride 32 stride 16 stride 8 ground truth no skips 1 skip 2 skips Slide credit: Jonathan Long

training + testing train full image at a time without patch sampling reshape network to take input of any size forward time is ~150ms for 500 x 500 x 21 output Slide credit: Jonathan Long

Results – PASCAL VOC 2011/12 VOC 2011: 8498 training images (from additional labeled data For following 3 results, dropout was used when used in original network SDS: MCG proposals, feature extraction, SVM to classify, region refinement

Results – NYUDv2 1449 RGB-D images with pixelwise labels  40 categories Gupta: region proposals (using depth and rgb), deep features for depth and rgb, svm classifier, segmentation Gupta et all encode depth differently (surface normals and height from ground included) RGBD (early fusion) little improvement, perhaps difficult to propogate meaningful gradients through model To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion)

Results – SIFT Flow 2688 images with pixel labels 33 semantic categories, 3 geometric categories Learn both label spaces jointly  learning and inference have similar performance and computation as independent models Semantic: bridge, mountain, sun, etc Geometric: horizontal, vertical, sky Farabet: multi-scale convnet, averaging class predictions across superpixels Pinheiro: patch based learning using multiple scales with rcnns

results FCN SDS* Truth Input Relative to prior state-of-the-art SDS: 20% relative improvement for mean IoU 286× faster + NYUD net for multi-modal input and SIFT Flow net for multi-task output *Simultaneous Detection and Segmentation Hariharan et al. ECCV14 Slide credit: Jonathan Long

== segmentation with Caffe leaderboard FCN FCN FCN FCN FCN FCN FCN FCN FCN == segmentation with Caffe FCN FCN FCN FCN FCN FCN Many segmentation methods powered by Caffe, most FCNs Slide credit: Jonathan Long

github.com/BVLC/caffe conclusion fully convolutional networks are fast, end-to-end models for pixelwise problems code in Caffe branch (merged soon) models for PASCAL VOC, NYUDv2, SIFT Flow, PASCAL-Context caffe.berkeleyvision.org fcn.berkeleyvision.org github.com/BVLC/caffe Slide credit: Jonathan Long