Presentation is loading. Please wait.

Presentation is loading. Please wait.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Similar presentations


Presentation on theme: "VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION"— Presentation transcript:

1 VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Karen Simonyan & Andrew Zisserman Visual Geometry Group, Department of Engineering Science, University of Oxford ICLR 2015

2 ILSVRC 2014

3 Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK
CLASSIFICATION EXPERIMENTS CONCLUSION

4 Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK
CLASSIFICATION EXPERIMENTS CONCLUSION

5 INTRODUCTION Krizhevsky et al. (2012) ILSVRC-2012 (Alexnet)
Zeiler & Fergus et al (ZF) Sermanet et al (Overfeat) Howard, 2014 smaller receptive window size (11→7) and smaller stride (4→2) training and testing over multiple scales

6 INTRODUCTION Deeper layers & Very small convolution filters
One 7x7 layer = 𝐶 2 = 49 𝐶 2 Three 3x3 layers = 3x( 3 2 𝐶 2 ) = 27 𝐶 2

7 Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK
CLASSIFICATION EXPERIMENTS CONCLUSION

8 CONVNET CONFIGURATIONS
Input : a fixed-size 224 × 224 RGB image Pre-processing : subtracting the mean RGB value Conv. Layers : a very small receptive field (3 × 3, 1 × 1) pixel stride Max-pooling : 5 layers, 2 × 2 pixel window, with stride 2 Fully-Connected (FC) : 3 layers (2*4096 channels channel) Non-linearity : ReLU (Rectified Linear Units) Normalisation : LRN (Local Response Normalization) linear transformation “Network in Network” Lin et al. (2014)

9

10 Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK
CLASSIFICATION EXPERIMENTS CONCLUSION

11 CLASSIFICATION FRAMEWORK
TRAINING : generally follows Krizhevsky et al. (2012) Mini-batch gradient descent with momentum Batch size : 256 Momentum : 0.9 Weight decay : 5 · 10 −4 Dropout ratio (first two FCs) : 0.5 Learning rate : 10 −2 (factor 10, 3 times) Eta

12 CLASSIFICATION FRAMEWORK
TRAINING : generally follows Krizhevsky et al. (2012) 370K iterations (74 epochs) (less than Krizhevsky et al.) Greater depth and smaller conv. filter sizes Pre-initialisation

13 CLASSIFICATION FRAMEWORK
TRAINING : generally follows Krizhevsky et al. (2012) Pre-initialization Training the configuration A (shallow, random initialisation) Initialised the first 4 conv. layers and the last 3 fc layers with A ※Using the random initialisation (Glorot & Bengio 2010) Randomly cropped 224 × 224 from rescaled training images To further augment the training set Horizontal flipping Random RGB color shift Rescaling

14 CLASSIFICATION FRAMEWORK
TRAINING : generally follows Krizhevsky et al. (2012) Training image size S : Smallest side of an isotropically-rescaled training image S = 224 : whole-image statistics S ≫ 224 : a small part of the image Two approaches for setting the training scale S : Single-scale Multi-scale

15 CLASSIFICATION FRAMEWORK
TRAINING : generally follows Krizhevsky et al. (2012) Single-scale S = 256 (Krizhevsky ; Zeiler & Fergus ; Sermanet) S = 384 Initialised with pre-trained S = 256 (speed-up) Smaller initial learning rate ( 10 −3 )

16 CLASSIFICATION FRAMEWORK
TRAINING : generally follows Krizhevsky et al. (2012) Multi-scale Randomly sampling S from a certain range [ S 𝑚𝑖𝑛 , S 𝑚𝑎𝑥 ] Fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed S = 384

17 CLASSIFICATION FRAMEWORK
TESTING Q : isotropically rescaled to a pre-defined smallest image side Q is not necessarily equal to the training scale S Similar to Sermanet et al., 2014 (dense sliding window) Fully-connected layers are converted to convolutional layers (Fully-convolutional network) Applied to the whole (uncropped) image

18 CLASSIFICATION FRAMEWORK
Fully-convolutional network Functional form is identical (dot products) Eg. The first FC layer with K = 4096, input = 7 × 7 × 512 Equivalent to a Conv. layer with F=7, P=0, S=1, K=4096 Output = 1 × 1 × 4096 The last 2 FC layers F=1 Variable spatial resolution, dependent on the input image size The class score map is spatially averaged (sum-pooling) Augment test set Original and horizontal flipped images scores are averaged

19 CLASSIFICATION FRAMEWORK
TESTING No need to sample multiple crops (less efficient) Szegedy et al. (2014) using a large set of crops can lead to improved accuracy Different convolution boundary conditions (zero padding) Multi-crop evaluation is complementary to dense evaluation

20 CLASSIFICATION FRAMEWORK
IMPLEMENTATION DETAILS C++ Caffe toolbox 4 NVIDIA Titan Black GPUs Training a single net took 2–3 weeks

21 Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK
CLASSIFICATION EXPERIMENTS CONCLUSION

22 CLASSIFICATION EXPERIMENTS
Dataset ILSVRC 1000 classes Training (1.3M images) Validation (50K images) Testing (100K images with held-out class labels)

23 CLASSIFICATION EXPERIMENTS
SINGLE SCALE EVALUATION MULTI-SCALE EVALUATION MULTI-CROP EVALUATION CONVNET FUSION COMPARISON WITH THE STATE OF THE ART

24 CLASSIFICATION EXPERIMENTS
SINGLE SCALE EVALUATION Single-scale training (fixed S) Q = S Multi-scale training (S ϵ [ S 𝑚𝑖𝑛 , S 𝑚𝑎𝑥 ]) Q = 0.5( S 𝑚𝑖𝑛 + S 𝑚𝑎𝑥 )

25 CLASSIFICATION EXPERIMENTS
SINGLE SCALE EVALUATION

26 CLASSIFICATION EXPERIMENTS
MULTI-SCALE EVALUATION Single-scale training (fixed S) Q = {S − 32, S, S + 32} Multi-scale training (S ϵ [ S 𝑚𝑖𝑛 , S 𝑚𝑎𝑥 ]) Q = { S 𝑚𝑖𝑛 , 0.5( S 𝑚𝑖𝑛 + S 𝑚𝑎𝑥 ), S 𝑚𝑎𝑥 }

27 CLASSIFICATION EXPERIMENTS
MULTI-SCALE EVALUATION

28 CLASSIFICATION EXPERIMENTS
MULTI-CROP EVALUATION different treatment of convolution boundary conditions

29 CLASSIFICATION EXPERIMENTS
CONVNET FUSION Combine the outputs of several models by averaging

30 CLASSIFICATION EXPERIMENTS
COMPARISON WITH THE STATE OF THE ART

31 Outline INTRODUCTION CONVNET CONFIGURATIONS CLASSIFICATION FRAMEWORK
CLASSIFICATION EXPERIMENTS CONCLUSION

32 CONCLUSION We evaluated very deep convolutional networks (up to 19 weight layers) for large-scale image classification Demonstrated that the representation depth is beneficial for the classification accuracy State-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture with substantially increased depth (LeCun et al., 1989; Krizhevsky et al., 2012)

33 Thanks for Listening!


Download ppt "VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION"

Similar presentations


Ads by Google