Presentation on theme: "Spatial Pyramid Pooling in Deep Convolutional"— Presentation transcript:
1 Spatial Pyramid Pooling in Deep Convolutional CS688/WST665Spatial Pyramid Pooling in Deep ConvolutionalNetworks for Visual RecognitionPresenter ByungIn Yoo1
2 Contents Introduction Motivation Previous work Main Idea Details ExperimentsConclusion
3 Introduction Web-scale image retrieval Why is this challenging? Classify images or videosDetect and localize objectEstimate semantic and geometrical attributesWhy is this challenging?View pointIlluminationOcclusionScaleDeformationClutter background
4 Convolutional Neural Network (CNN) MotivationThe current CNN require a fixed input image size (e.g., 224 x 224 )Recognition accuracy is degraded!Content lossCropDistortion224x224Convolutional Neural Network (CNN)Warp
5 Convolutional Neural Network (CNN) MotivationSpatialPyramidPoolingThe current CNN require a fixed input image size (e.g., 224 x 224 )Recognition accuracy is degraded!Content lossCropDistortion224x224Convolutional Neural Network (CNN)Warp
6 Previous work (1/2) Spatial Pyramid Matching - very successful in traditional computer visionGrauman et al, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.Lazebnik et al, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006.
7 Previous work (2/2) Zeiler-Fergus Architecture (2013, 1st) Google LeNet (2014, 1st)8 LayersStill low accuracy! & Fixed Image SizeConvolutionPoolingSoftmaxOther22 LayersToo complex model! & Fixed Image SizeM.D. Zeiler et al, “Visualizing and understanding convolutional neural networks”, aXiv: , 2013.Christian Szegedy et a, “Going Deeper with Convolutions”, arXiv: , 2014.
8 Main Idea (1/2) Add Spatial Pyramid Pooling layer! Previous Nets SPP Net
9 Main Idea (2/2)Generate fixed length representation regardless of image size/scale.Simple (still 8 layers) and Powerful Model!Variable input size/scaleMulti-size training, Multi-scale testing, Full image viewMulti-level poolingRobust to deformationOperated on feature mapPooling in regions
10 Details – Convolutional Layers and Feature Maps Inherently, the convolutional layers can accept arbitrary size image.Feature map involve not only the strength of the responses, but also their spatial positions.
11 Details – The Spatial Pyramid Pooling Layer SPP-net is a new layer with Spatial Pyramid Pooling256 x ( 4x4 + 2x2 + 1) = 5376 Dimension vectorConv1Conv2Conv3Conv4Conv5SPPFC6FC7SoftMax256 filters
12 Details – Training with the Spatial Pyramid Pooling Single-size trainingSimply modify the configuration file of CNN frameworksConv1Conv2Conv3Conv4Conv5SPPFC6FC7SoftMaxFeature map: 13x13
13 Details – Training with the Spatial Pyramid Pooling Multiple-size trainingMultiple networks sharing all weightsEach network for a single size. (e.g. 224x224, 180x180)Improve scale-invarianceresize
14 Details – Fast CNN-based Object Detection The features can be computed from entire image only once.Similar accuracy, much faster (24x~64x) than R-CNN2000 Convolutions!1 Convolution!
16 Experiments (2/4) ILSVRC image classification task (rank #3) SPP improves all CNN architecturesTop-5 test accuracyTop-5 val. accuracy
17 Experiments (3/4) ILSVRC image detection task Fully annotated 200 object classes across 121,931 imagesAllows evaluation of generic object detection in cluttered scenes at scaleDetected RegionGround-truth:True:False
18 Experiments (4/4) ILSVRC image detection task (rank #2) More practical than R-CNN
19 ConclusionSPP is flexible solution for handling different scales, sizes, and aspect ration.Spatial Pyramid Pooling improves accuracy.Multi-size training improves accuracy.Full-image representation improves accuracy.Classification: SPP improves all CNNs in the literature.Detection: Practical, fast and accurate than R-CNN.