Learning Deconvolution Network for Semantic Segmentation

Learning Deconvolution Network for Semantic Segmentation
Computer Vision Lab. Hyeonwoo Noh, Seunghoon Hong, Bohyung Han

Contents Semantic Segmentation
Previous CNN based Semantic Segmentation Our Approach Discussion Future Direction

Semantic Segmentation
Objective: Recognition of objects in the image with pixel level detail. lion person person person person dog bicycle person giraffe person ball bicycle bicycle dog Image Classification Object Detection Semantic Segmentation

Previous CNN based Semantic Segmentation
Fully Convolutional Networks for Semantic Segmentation [FCN] [1] Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF] [2] [1] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015 [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015

Fully Convolutional Networks for Semantic Segmentation [FCN]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

Approach Apply classification CNN to large image with large padding
FCN

How to Apply CNN to Large Input
Fully connected Layer is also convolution layer. 1 1 16 16 fc7 4096 fc7 4096 4096 fc7 1 1 4096 1 fc6 1 1 16 1 16 fc6 4096 fc6 4096 7 7 pool5 512 22 7 512 7 512 22 pool5 pool5 Fully connected layer Convolution layer Apply to Large Input

How to Obtain Higher Resolution Output - 1
Single Deconvolution Layer with Large Stride This deconvolution filter is initialized with bilinear weight Equivalent to Bilinear Interpolation 544 16 16 544

How to Obtain Higher Resolution Output - 2
Skip Architecture Pros: Low level feature are larger than high level feature spatially. Cons: Low level feature are less discriminative

Limitations Coarse output score maps
Skip architecture generate noisy predictions Fixed receptive field size (cause label confusion)

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs [Deeplab-CRF]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In ICLR, 2015

Approach Better than FCN in two aspects Note:
Produce denser output prediction using “Hole Algorithm” CRF based post processing Note: this algorithm doesn’t use skip architecture Simply upscale output score map without deconvolution layer

Hole Algorithm Why output score map is smaller than input image?
Because of Pooling with stride Removing Pooling? Pooling with no stride? Make utilizing pre-trained CNN hard Hole Algorithm Solves this problem output pool output pool pool pool FCN with conventional Pooling FCN with Hole Algorithm

Limitations Still coarse output score maps
Produce 39x39 output map from 306x306 input image Fixed receptive field size (cause label confusion)

Learning Deconvolution Network for Semantic Segmentation
Hyeonwoo Noh, Seunghoon Hong, Bohyung Han

Our Approach Deconvolution Network Instance-wise prediction
CNN Architecture designed to generate large output Enables dense output score prediction Instance-wise prediction Inference on object proposals, then aggregate Enables recognition of objects with multiple scales

Deconvolution Network
Generate dense segmentation from fc7 feature representation Multi-layer of deconvolution with relu non-linearity Unpooling layer based upscaling

Unpooling Place activations to pooled location Preserve structure of activations Deconvolution Densify sparse activations Bases to reconstruct shape input forward propogation

Training deconvolution network is difficult Very deep network Large output space Batch Normalization [3] Normalize input of every layer to standard Gaussian distribution Prevent drastic change of input distribution in upper layers Two stage training First stage: training with object centered examples Second stage: training with real object proposals This approach make the network generalize better [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: , 2015.

Instance-wise Prediction
Inference on object proposals, pixel-wise aggregation Objects with multiple scale are detected in different proposals DeconvNet 1. Input image 2. Object proposals 3. Prediction and aggregation 4. Results

Further performance enhancement
Ensemble with FCN based model FCN based models have complementary characteristic with ours Ours: capture fine-detail, handle objects with various scales FCN: capture context within image Fully Connected CRF [4] based post-processing [4] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.

Quantitative Result Best among the models trained with VOC2012 training data VOC2012: 12,031 images MSCOCO: 123,287 images

Qualitative Result

Understanding Details of our work
Discussion Understanding Details of our work

Importance of Two Stage Training
Two Stage Training is not that critical DeconvNet can be trained well with single stage training (directly apply 2nd step) But, Two Stage Training it better. In our experiment, validation accuracy of two stage training is higher (almost 1.00) than single stage training Based on previous experiments, this margin could make big difference in mean IOU score. Detailed study is not yet employed 1st stage 2nd stage

Importance of Batch Normalization [3]
Without Batch Normalization, DeconvNet stuck in local minima Experimental result with early DeconvNet model (Binary segmentation, Single deconv layer between unpooling layer) Maximum Segmentation Accuracy: With Batch Normalization: (0.18 loss) Without Batch Normalization: (0.59 loss) What is Batch Normalization? Brief Introduction Normalize output of every layer to standard Gaussian distribution (in each mini-batch) learn “mean” and “variance” instead Why it works? my opinion ReLU [3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: , 2015.

Training Deconvolution Network
Training plot of the best performed model Network Learning Capacity Generalization Ability Accuracy: 87.53 Accuracy: 92.98 Train + val / val Train / val

Network Learning Capacity fc7 is 4096 dimension vector Is it enough to encode every possible segmentations? fc7

Improving Generalization Dropout? Not effective Baseline Min Train loss: 0.08 Min Test loss: 0.24 Max Test accuracy: 92.8 Dropout[fc6,fc7](0.2) Min Train loss: 0.11 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[fc6,fc7](0.5) Min Train loss: 0.12 Min Test loss: 0.23 Max Test accuracy: 92.6 Dropout[finallayer](0.2) Min Train loss: 0.11 Min Test loss: 0.24 Max Test accuracy: 92.7

Instance-wise Prediction
What instance-wise prediction is doing really? (short demo) Instance is work like “attention” rather than object proposal for detection See the image with various aspect Observations are aggregated to image level prediction Disadvantage of Instance-wise prediction Obstacles for End-to-End training When to stop training? How to aggregate predictions (max? average?..) How to construct training data object-centered bounding box? Random cropping? Hard negative mining? Predictions for each proposals are independent Quiz: What is this? Answer

Results on COCOVOC Why we didn’t use MSCOCO? DeconvNet: 63.683
DeconvNet+CRF: EDeconvNet+CRF:

Possible Future Direction
Data augmentation: MSCOCO Enhance performance on training set Make Network more flexible (apply hole algorithm to encode more feature) Studying why dropout doesn’t work in our setting Tackle Disadvantage of Instance-wise prediction Attention model [4] [5] [6] [4] Mnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention." Advances in Neural Information Processing Systems [5] Tang, Yichuan, Nitish Srivastava, and Ruslan R. Salakhutdinov. "Learning generative models with visual attention." Advances in Neural Information Processing Systems [6] Xu, Kelvin, et al. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." arXiv preprint arXiv: (2015).

Learning Deconvolution Network for Semantic Segmentation

Similar presentations

Presentation on theme: "Learning Deconvolution Network for Semantic Segmentation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Deconvolution Network for Semantic Segmentation

Similar presentations

Presentation on theme: "Learning Deconvolution Network for Semantic Segmentation"— Presentation transcript:

Similar presentations

About project

Feedback