VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Slides:



Advertisements
Similar presentations
Lecture 6: Classification & Localization
Advertisements

Classification spotlights
ImageNet Classification with Deep Convolutional Neural Networks
Karen Simonyan Andrew Zisserman
OverFeat Part1 Tricks on Classification
Spatial Pyramid Pooling in Deep Convolutional
Deep face recognition Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman.
Deeper is Better Latha Pemula.
Fully Convolutional Networks for Semantic Segmentation
Deep Convolutional Nets
Convolutional Neural Network
Lecture 4a: Imagenet: Classification with Localization
Deep Residual Learning for Image Recognition
Lecture 3b: CNN: Advanced Layers
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Lecture 4b Data augmentation for CNN training
Lecture 3a Analysis of training of NN
Understanding Convolutional Neural Networks for Object Recognition
Convolutional Neural Networks
Cancer Metastases Classification in Histological Whole Slide Images
Deep Learning and Its Application to Signal and Image Processing and Analysis Class III - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering.
Recent developments in object detection
Deep Residual Learning for Image Recognition
Learning to Compare Image Patches via Convolutional Neural Networks
Analysis of Sparse Convolutional Neural Networks
Demo.
The Relationship between Deep Learning and Brain Function
CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.
Summary of “Efficient Deep Learning for Stereo Matching”
Data Mining, Neural Network and Genetic Programming
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
The Problem: Classification
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Ajita Rattani and Reza Derakhshani,
Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.
ECE 6504 Deep Learning for Perception
Training Techniques for Deep Neural Networks
CVPR 2017 (in submission) Genetic CNN
Efficient Deep Model for Monocular Road Segmentation
Deep Belief Networks Psychology 209 February 22, 2013.
CS6890 Deep Learning Weizhen Cai
Machine Learning: The Connectionist
Deep Residual Learning for Image Recognition
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Deep Learning Convoluted Neural Networks Part 2 11/13/
State-of-the-art face recognition systems
Fully Convolutional Networks for Semantic Segmentation
Computer Vision James Hays
Aoxiao Zhong Quanzheng Li Team HMS-MGH-CCDS
Introduction to Neural Networks
Image Classification.
Two-Stream Convolutional Networks for Action Recognition in Videos
CS 4501: Introduction to Computer Vision Training Neural Networks II
Very Deep Convolutional Networks for Large-Scale Image Recognition
Smart Robots, Drones, IoT
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
Lecture: Deep Convolutional Neural Networks
Visualizing and Understanding Convolutional Networks
Object Tracking: Comparison of
Heterogeneous convolutional neural networks for visual recognition
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
CSC 578 Neural Networks and Deep Learning
Department of Computer Science Ben-Gurion University of the Negev
Natalie Lang Tomer Malach
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Learning Deconvolution Network for Semantic Segmentation
Week 3 Volodymyr Bobyr.
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION Karen Simonyan & Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford ICLR 2015

ILSVRC 2014

Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK CLASSIFICATION EXPERIMENTS CONCLUSION

Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK CLASSIFICATION EXPERIMENTS CONCLUSION

INTRODUCTION Krizhevsky et al. (2012) ILSVRC-2012 (Alexnet) Zeiler & Fergus et al. 2013 (ZF) Sermanet et al. 2014 (Overfeat) Howard, 2014 smaller receptive window size (11→7) and smaller stride (4→2) training and testing over multiple scales

INTRODUCTION Deeper layers & Very small convolution filters One 7x7 layer = 7 2 𝐶 2 = 49 𝐶 2 Three 3x3 layers = 3x( 3 2 𝐶 2 ) = 27 𝐶 2

Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK CLASSIFICATION EXPERIMENTS CONCLUSION

CONVNET CONFIGURATIONS Input : a fixed-size 224 × 224 RGB image Pre-processing : subtracting the mean RGB value Conv. Layers : a very small receptive field (3 × 3, 1 × 1) 1 pixel stride Max-pooling : 5 layers, 2 × 2 pixel window, with stride 2 Fully-Connected (FC) : 3 layers (2*4096 channels + 1000 channel) Non-linearity : ReLU (Rectified Linear Units) Normalisation : LRN (Local Response Normalization) linear transformation “Network in Network” Lin et al. (2014)

Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK CLASSIFICATION EXPERIMENTS CONCLUSION

CLASSIFICATION FRAMEWORK TRAINING : generally follows Krizhevsky et al. (2012) Mini-batch gradient descent with momentum Batch size : 256 Momentum : 0.9 Weight decay : 5 · 10 −4 Dropout ratio (first two FCs) : 0.5 Learning rate : 10 −2 (factor 10, 3 times) Eta

CLASSIFICATION FRAMEWORK TRAINING : generally follows Krizhevsky et al. (2012) 370K iterations (74 epochs) (less than Krizhevsky et al.) Greater depth and smaller conv. filter sizes Pre-initialisation

CLASSIFICATION FRAMEWORK TRAINING : generally follows Krizhevsky et al. (2012) Pre-initialization Training the configuration A (shallow, random initialisation) Initialised the first 4 conv. layers and the last 3 fc layers with A ※Using the random initialisation (Glorot & Bengio 2010) Randomly cropped 224 × 224 from rescaled training images To further augment the training set Horizontal flipping Random RGB color shift Rescaling

CLASSIFICATION FRAMEWORK TRAINING : generally follows Krizhevsky et al. (2012) Training image size S : Smallest side of an isotropically-rescaled training image S = 224 : whole-image statistics S ≫ 224 : a small part of the image Two approaches for setting the training scale S : Single-scale Multi-scale

CLASSIFICATION FRAMEWORK TRAINING : generally follows Krizhevsky et al. (2012) Single-scale S = 256 (Krizhevsky ; Zeiler & Fergus ; Sermanet) S = 384 Initialised with pre-trained S = 256 (speed-up) Smaller initial learning rate ( 10 −3 )

CLASSIFICATION FRAMEWORK TRAINING : generally follows Krizhevsky et al. (2012) Multi-scale Randomly sampling S from a certain range [ S 𝑚𝑖𝑛 , S 𝑚𝑎𝑥 ] Fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed S = 384 256 512

CLASSIFICATION FRAMEWORK TESTING Q : isotropically rescaled to a pre-defined smallest image side Q is not necessarily equal to the training scale S Similar to Sermanet et al., 2014 (dense sliding window) Fully-connected layers are converted to convolutional layers (Fully-convolutional network) Applied to the whole (uncropped) image

CLASSIFICATION FRAMEWORK Fully-convolutional network Functional form is identical (dot products) Eg. The first FC layer with K = 4096, input = 7 × 7 × 512 Equivalent to a Conv. layer with F=7, P=0, S=1, K=4096 Output = 1 × 1 × 4096 The last 2 FC layers F=1 Variable spatial resolution, dependent on the input image size The class score map is spatially averaged (sum-pooling) Augment test set Original and horizontal flipped images scores are averaged

CLASSIFICATION FRAMEWORK TESTING No need to sample multiple crops (less efficient) Szegedy et al. (2014) using a large set of crops can lead to improved accuracy Different convolution boundary conditions (zero padding) Multi-crop evaluation is complementary to dense evaluation

CLASSIFICATION FRAMEWORK IMPLEMENTATION DETAILS C++ Caffe toolbox 4 NVIDIA Titan Black GPUs Training a single net took 2–3 weeks

Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK CLASSIFICATION EXPERIMENTS CONCLUSION

CLASSIFICATION EXPERIMENTS Dataset ILSVRC 1000 classes Training (1.3M images) Validation (50K images) Testing (100K images with held-out class labels)

CLASSIFICATION EXPERIMENTS SINGLE SCALE EVALUATION MULTI-SCALE EVALUATION MULTI-CROP EVALUATION CONVNET FUSION COMPARISON WITH THE STATE OF THE ART

CLASSIFICATION EXPERIMENTS SINGLE SCALE EVALUATION Single-scale training (fixed S) Q = S Multi-scale training (S ϵ [ S 𝑚𝑖𝑛 , S 𝑚𝑎𝑥 ]) Q = 0.5( S 𝑚𝑖𝑛 + S 𝑚𝑎𝑥 )

CLASSIFICATION EXPERIMENTS SINGLE SCALE EVALUATION

CLASSIFICATION EXPERIMENTS MULTI-SCALE EVALUATION Single-scale training (fixed S) Q = {S − 32, S, S + 32} Multi-scale training (S ϵ [ S 𝑚𝑖𝑛 , S 𝑚𝑎𝑥 ]) Q = { S 𝑚𝑖𝑛 , 0.5( S 𝑚𝑖𝑛 + S 𝑚𝑎𝑥 ), S 𝑚𝑎𝑥 }

CLASSIFICATION EXPERIMENTS MULTI-SCALE EVALUATION

CLASSIFICATION EXPERIMENTS MULTI-CROP EVALUATION different treatment of convolution boundary conditions

CLASSIFICATION EXPERIMENTS CONVNET FUSION Combine the outputs of several models by averaging

CLASSIFICATION EXPERIMENTS COMPARISON WITH THE STATE OF THE ART

Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK CLASSIFICATION EXPERIMENTS CONCLUSION

CONCLUSION We evaluated very deep convolutional networks (up to 19 weight layers) for large-scale image classification Demonstrated that the representation depth is beneficial for the classification accuracy State-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture with substantially increased depth (LeCun et al., 1989; Krizhevsky et al., 2012)

Thanks for Listening!