Lecture 4a: Imagenet: Classification with Localization

Slides:

Advertisements

Similar presentations

Lecture 6: Classification & Localization

Advertisements

ImageNet Classification with Deep Convolutional Neural Networks

Karen Simonyan Andrew Zisserman

OverFeat Part1 Tricks on Classification

Large-Scale Object Recognition with Weak Supervision

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Lecture 4: CNN: Optimization Algorithms

R-CNN By Zhang Liliang.

Spatial Pyramid Pooling in Deep Convolutional

From R-CNN to Fast R-CNN

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.

Object detection, deep learning, and R-CNNs

Fully Convolutional Networks for Semantic Segmentation

Deep Convolutional Nets

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Feedforward semantic segmentation with zoom-out features

ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.

Convolutional Neural Network

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

Spatial Localization and Detection

Lecture 3b: CNN: Advanced Layers

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.

Lecture 4b Data augmentation for CNN training

Lecture 3a Analysis of training of NN

Cancer Metastases Classification in Histological Whole Slide Images

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Recent developments in object detection

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Learning to Compare Image Patches via Convolutional Neural Networks

Faster R-CNN – Concepts

Object Detection based on Segment Masks

Data Mining, Neural Network and Genetic Programming

Data Mining, Neural Network and Genetic Programming

Computer Science and Engineering, Seoul National University

The Problem: Classification

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.

Lecture 24: Convolutional neural networks

Lecture 25: Backprop and convnets

Object Localization Goal: detect the location of an object within an image Fully supervised: Training data labeled with object category and ground truth.

Neural Networks 2 CS446 Machine Learning.

Training Techniques for Deep Neural Networks

Structure learning with deep autoencoders

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Object detection.

ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants

Deep Learning Convoluted Neural Networks Part 2 11/13/

Fully Convolutional Networks for Semantic Segmentation

Computer Vision James Hays

Introduction to Neural Networks

Image Classification.

Object Detection + Deep Learning

On-going research on Object Detection *Some modification after seminar

Very Deep Convolutional Networks for Large-Scale Image Recognition

Smart Robots, Drones, IoT

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Faster R-CNN By Anthony Martinez.

Neural Networks Geoff Hulten.

Lecture: Deep Convolutional Neural Networks

Object Tracking: Comparison of

RCNN, Fast-RCNN, Faster-RCNN

CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.

Introduction to Neural Networks

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Image recognition.

Object Detection Implementations

Presentation transcript:

Lecture 4a: Imagenet: Classification with Localization boris.ginsburg@gmail.com

Agenda ILSVRC competition Classification with Localization Overfeat: integrated classification, localization, and detection R-CNN (Regions with CNN) SPP-net (Spatial Pyramid Pooling) Fast R-CNN

Imagenet Database Imagenet data base: 22 categories, ~14 mln labeled images ( ~700 images/class )

ILSVRC Classification over 1000 categories: Classification http://www.image-net.org/challenges/LSVRC/2014/ Classification over 1000 categories: 1.2 million training images 50,000 validation images 150,000 testing images Classification Assign to each image label 5 guesses Classification & Localization 5 guesses: label + bounding box Detection: any number of objects in image (including zero) False positives are penalized

ILSVRC: Classification top-5 labels http://www.image-net.org/challenges/LSVRC/2014/

ILSVRC: Classification & Localization top-5 labels + bounding box

ILSVRC 2014 vs PASCAL 2012 PASCAL 2012 ILSVRC 2013 ILSVRC 2014 # classes 20 200 Training # images 5,717 395,909 456,567 # objects 13,609 345,854 478,807 Validation 5,823 20,121 13,841 55,502 testing 10,991 40,152 http://image-net.org/challenges/LSVRC/2014/ http://pascallin.ecs.soton.ac.uk/challenges/VOC/

ILSVRC: Classification These are two dogs from two distinct classes

ILSVRC: Classification Groundtruth: ????

ILSVRC: Classification Groundtruth: coffee mug

ILSVRC: Classification Groundtruth: coffee mug Top-5: table lamp lamp shade printer projector desktop computer

AlexNet (2012) AlexNet – winner 2012 with 85% top-5 accuracy 8 layers (5 conv. + 3 fully connected layers+droput + soft-max) 650K neurons , 60 Mln weights was trained on two GTX-580 with 3 GB memory. training took 6 days www.cs.toronto.edu/~fritz/absps/imagenet.pdf

AlexNet : Training SGD parameters batch = 128 examples, momentum = 0.9, weight decay = 0.0005. Weight initialization: a 0-mean Gaussian with std dev= 0.01. Learning rate: The learning rate was initialized at 0.01 and adjusted manually throughout training: divide the learning rate by 10 when the validation error rate stopped improving The same for all layers, Dropout for fully connected layer 90 epochs through whole image dataset.

GoogleNet (2014) Winner of 2014 with 93.5% top-5 accuracy 22 layers (2 conv. +9 inception+ linear with dropout+ softmax) 5 mln parameters Trained on Google DistBelief cluster ~ 1 week on multi-GPU system (?)

Overfeat: Classification with localization

Overfeat: integrated classification, localization & detection Overfeat (NYU) - a convolutional network to simultaneously classify, locate and detect objects. Key ideas: multiple scales apply a ConvNet at multiple locations in the image in a sliding window train the system to produce for each window a distribution over categories a prediction of the location and size of the bounding box with object relative to the viewing window accumulate results for each categories at each location and scale http://cilvr.nyu.edu/doku.php?id=software:overfeat:start

Overfeat: topology summary First 5 layers are similar to Alexnet: conv. layer with ReLU and max pooling, but with the following differences: no contrast normalization pooling regions are non-overlapping smaller stride to improve accuracy Input 3x [221x221], Output of last convoluitonal layer 1024 x [5x5] Feature Extraction: 3 x [231x231]  1024 x [5x5], down-sampling: 36:1

Overfeat: classification Let’s takes image, and apply window [231x231]. Output of last conv. layer 1024 features x [5x5]. FC layers + log-loss will give scores for each class Feature extractor has multiple poolings and strides with total scale ratio 36:1, so if we slide the input window with step 36, the feature window will slide with step 1. 231x231 5x5 Input window [231x231] 1024 Features: x [5x5] 1000 class scores

Overfeat: classification 2 adjacent windows share many computations. Let’s get features for all windows locations simultaneously Feature Extraction: We compute first 5 layers for whole image ( correspond to 12:1 “subsampling” ) Classifier: The classifier has a fixed-size 5x5 input which is applied to the layer 5. We will shift the classifier’s input window by 1 pixel through pooling layers without subsampling. In the end we have [MxN] x C scores, where M, N are sliding windows index, and C – number of classes. Input Layer 5 Before pooling After pool 3x3 Classifier map 245x245 17x17 [3x3] x [5x5] [3x3] x C 281x 317 20x23 [6x9] x [5x5] [6x9] x C

Overfeat: data augmentation and scaling to locate objects in different sizes we rescale image to 6 scales (The typical ratio from one scale to another is about ~1.4 ) horizontal flipping. Final post-processing: for each class we took local spatial max for different locations and scales take top-5 classes

Overfeat: Boosting Boosting: train 7 different models with different init weights, and select the best result

Overfeat: Training parameters Data augmentation: Each image is down-sampled so that the smallest dimension is 256 pixels. Then extract 5 random crops (and their horizontal flips) of size 221x221 pixels Training: weight initialization: randomly with (µ, σ) = (0, 1 × 10 -2 ) SGD with momentum =0.6 weight decay =1×10-5 learning rate = 0.05 , is decreased by ½ after 30, 50, 60, 70, 80 epochs dropout

Overfeat: Localization Starting from classification-trained network Fix the feature extraction layers (1-5) and replace the classifier layers by a regression network Regression net takes as input the features from layer 5 2 fully-connected hidden layers of size 4096 and 1024 channels output layer 4 units for each class: with the coordinates for the bounding box edges. Train regression net the same set of scales as in multi-scale classification ℓ2 loss between the predicted and true bounding box for each example compare the prediction of the regressor at each spatial location with the ground-truth bounding box, shifted into the frame of reference

Overfeat: Regression Net Topology Input 3x231x231 conv: 11×11 stride 4×4; ReLU; maxpool: 2×2 stride 2×2; output: 96x24x24 conv: 5×5 stride 1×1; ReLU; maxpool: 2×2 stride 2×2; output: 256x12x12 conv: 3×3 stride 1×1 0-padded; ReLU; output: 512x12x12 conv: 3×3 stride 1×1 0-padded; ReLU; output: 1024x12x12 conv: 3×3 stride 1×1 0-padded; ReLU; maxpool: 2×2 stride 2×2; output: 1024x6x6 full; ReLU; output : 4096x1x1 full; output: 1024x1x1 Regressor: output: 200 x 4

Overfeat: localization pipeline The raw classifier/detector outputs a class confidence for each window location (multi- scale)

Overfeat: localization pipeline The regression predicts the location of the object with respect to each window (multi-scale)

Overfeat: localization pipeline 3. Box merging for top5 classes

Overfeat: Localization pipeline Choose top-5 classes Cs by taking the maximum detection class outputs across spatial locations for each scale s ∈ 1… 6 Initial set of bounding boxes: B ←Us Bs, where Bs the set of bounding boxes predicted by the regressor net for each class in Cs, across all spatial locations at scale s. Repeat merging of boxes from B until done: (b1, b2) = argmin b1!= b2∈B match_score (b1, b2) , where match_score = the sum of the distance between centers of the two bounding boxes and the intersection area of the boxes. If (match_score(b1, b2) < t), then set B ← B\ {b1, b2} ∪ box_merge(b1, b2): box_merge computes the average of the bounding boxes’ coordinates.

R-CNN: Regions with CNN

R-CNN: Regions with CNN features Regions with CNN detection approach: generates ~2000 category-independent regions for the input image, extracts a fixed-length feature vector from each region using a CNN, classifies each region with category-specific linear SVM R-CNN outperforms OverFeat, with a mAP = 31.4% vs 24.3%. R. Girshick et al , “Rich feature hierarchies…” http://www.cs.berkeley.edu/~rbg/slides/rcnn-cvpr14-slides.pdf

R-CNN pipeline Region detection  2000 regions , see http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-124.pdf Feature extraction with Imagenet: Region croped and scaled to [227 x 227] 5 conv.layers + 2FC  4096 features SVM for 200 classes Greedy non-maximum region suppression for each class: rejects a region if it has an big overlap with a region which has > score

R-CNN Training The key idea is to train feature extraction CNN on a large auxiliary dataset (ILSVRC-classification 1000 classes), followed by domain specific fine-tuning on a small dataset (200 classes): Pre-training: Train Imagenet for classification Replace last layer with FC layer to N+1 outputs: N classes + 1 “background” if ground truth class and gt box has IoU with region > ½  positive, otherwise  background Regular SGD based training batch from 128 images = 32 positive + 96 background Random weights initialization

R-CNN: ILSVRC 2013 detection performance

R-CNN drawbacks Training is expensive in time: Training is a multi-stage pipeline. Fine-tune a ConvNet for detection using cross-entropy loss. Train linear SVMs on ConvNet features computed on warped object proposals Learn bounding-box regressors Training is expensive in storage: For SVM and regressor training, features are extracted from each warped object proposal in each image and written to disk – hundreds of GB Test-time detection is slow.

R-CNN speed and R-CNN detection is very slow: 12.5 sec/image time per 1 frame analysis

CNN with Spatial Pyramid pooling

SPP-net = Spatial Pyramid Pooling + CNN Classical conv. net takes a fixed-size (e.g. 224224) input image: Need cropping or warping to transform original image to square shape This constraint is related to Fully-Connected layer ONLY Idea: let’s use Spatial Pooling Pyramid to transform any-shape image to ‘fixed-length” feature vector. Kaiming He et al, “Spatial Pyramid Pooling in Deep Convolutional Networks …“ http://research.microsoft.com/en-us/um/people/kahe/

SPP-Net Soft Max Inner Product BACKWARD FORWARD ReLUP Inner Product SPP(5x5+7x7+13x13) Pooling [2x2, stride 2] Convolutional layer [5x5] Pooling [2x2, stride 2] Convolutional layer [5x5] Data Layer Data Layer [28x28]

Spatial Pyramid Pooling Example. SPP layer : Input feature maps has size [13 x13] 3 pooling layers, which have fixed outputs size: [4x4], [2x2], and [1x1] Size of pooling windows and stride for each pooling layer depends on input feature map size.

SPP-net training Network - based on Overfeat Training: Data augmentation: horizontal flipping and Color altering Dropout with 2 last FC layers Init learning rate =0.01; divide by 10 when error plateau

SPP-net: Detection pipeline find 2000 region-candidates ( like R-CNN ) extract the feature maps from the entire image SPP maps each window of the features, corresponding to region-candidate, to a fixed-length representation SVM using 2 FC layers

SPP-net vs R-CNN Detection is ~100x faster than R-CNN, but training is very slow: Multi-stage training (fine-tuning of last layers, SVM and regressors) Still needs a lot of disk space to save features vectors

Fast R-CNN

Fast-RCNN Fast R-CNN is similar to SPP-net, but it trains both classification and regression networks together using a multi-task loss in a single training stage: Unlike SPP-net, all network layers can be updated during fine-tuning The multi-task loss simplifies learning and improves detection accuracy Currently the best and the fastest detection algorithm 3x faster training, Detection only 0.3s , Best performance R. Girshick, Fast R-CNN: http://arxiv.org/abs/1504.08083 code: https://github.com/rbgirshick/fast-rcnn

Fast-RCNN: architecture Start from CNN trained for ImageNet classification Add Region-of-Interest (RoI) pooling layer after last convolutional layer Replace FC layers by two ‘sibling” nets: Sof-max: estimates probability over (K+1 background) classes Regression layer which predicts box (x,y,w,h) for each K classes

Fast-RCNN: architecture Each RoI is pooled into fixed size vector 2 FC layers  RoI feature vector Multi-task loss: softmax classifier + box regressor/ class

Fast-RCNN: RoI pooling layer Input: feature maps from the last convoluional layer [C x H x W] List of regions of interest . Each RoI is a tuple (n; x; y; h;w) where n - image index (or scale), (r; c) region top-left location and (h;w) - height and width Output: for each region Max-pooled feature maps [C x H’ x W’] (H’ ≤ H, d W’ ≤ W). The pooling area (“bin” ) size ~ [h/H’, w/W’] ( SPP with one level) RoI pooling layer back-propagation:

Fast-RCNN: Multi-task Loss For each RoI 2 outputs: Sof-max: estimates probability over (K+1 background) classes Regression layer: predicts box b=(x,y,w,h) for each class Multi-task loss: L(p, k*, b, b*) = Lcls (p, k*) + λ*δ(k*) Lloc (b,b*) k* is true label: δ(k* ) =1 and 0 for all other class Lcls is regular log-loss: Lcls (p, k*) = -log (pk*) Lloc is smooth L1 loss: Lloc (b,b*) = [ x-x’ ] + [ y-y’ ] + [ h-h’ ] +[ w-w’ ], where [t] is smooth L1 norm: [t] = min ( 0.5*|t| , 0.5*t2 )

Fast-RCNN: One pass training Training is combines together fine-tuning of convolutional layers, training of of classifier and box-regressors: Each batch consists of 2 images x 64 RoI per image: 16 “true” RoI ( IoU > ½ with ground truth box) 48 ‘background” RoI ( IoU < ½) , labeled as k=0 Data augmentation: each image is horizontally flipped with prob= ½ Option: multi-scale image pyramid SGD: Momentum =0.9, weight decay=0.0005 weight initialization for FC layers – gaussian (σ = 0.01, 0.001) initial global lr =0.001, after 30,000 iterations lr =0.0001

Fast R-CNN: detection time optimization For classification FC layers usually take small amount of time comparing to convolutional layers. For detection FC will become heavy compute since we have to do it for each RoI. We can compress a single fully-connected layer W using SVD-like decomposition of W = U*∑*VT into two fully-connected layers without a non-linearity between them: First FC corresponds to (U*∑) and second to VT

BACKUP

ILSVRC: Localization

Overfeat: classification

Overfeat: ”fast” net topology Input 3x231x231 convo: 11×11 stride 4×4; ReLU; maxpool: 2×2 stride 2×2; output: 96x24x24 convo: 5×5 stride 1×1; ReLU; maxpool: 2×2 stride 2×2; output: 256x12x12 convo: 3×3 stride 1×1 0-padded; ReLU; output: 512x12x12 convo: 3×3 stride 1×1 0-padded; ReLU; output: 1024x12x12 convo: 3×3 stride 1×1 0-padded; ReLU; maxpool: 2×2 stride 2×2; output: 1024x6x6 convo: 6×6 stride 1×1; ReLU; output: 3072x1x1 full; ReLU; output : 4096x1x1 full; output: 1000x1x1 softmax; output: 1000x1x1

Single-class Regression vs Per- Class Regression Using a different top layer for each class in the regressor network for each class (Per-Class Regressor (PCR) surprisingly did not outperform using only a single network shared among all classes (44.1% vs. 31.3%).

Overfeat: Detection The detection task differ from localization in that there can be any number of object in each image (including zero), and that false positives are penalized by the mean average precision (mAP) measure The main difference with the localization task, is the necessity to predict a background class when no object is present. Traditionally, negative examples are initially taken at random for training. Then the most offending negative errors are added to the training set in bootstrapping passes.

R-CNN: PASCAL VOC performance 2012 SIFT, HOG,…

R-CNN: PASCAL VOC performance 2014: Regions with CNN

R-CNN CODE https://github.com/rbgirshick/rcnn Requires Matlab!

CNN regression Szegedy et all ( Google) 2010, “Deep Neural Networks for Object Detection” start with Alexnet, replace last soft-max layer with regression layer which generates an binary mask “d x d” : 1 if pixel is inside box, 0- otherwise; train net by minimizing L2 error vs ground truth mask m:

CNN regression Multi-scale

CNN regression Issues: Issue1: Overlapping masks for multiple touching objects Localization accuracy Recognition of small objects Issue1: To deal with multiple touching objects, we generate not one but several masks, each representing either the full object or part of it. we use one network to predict the object box mask and four additional networks to predict four halves of the box: bottom, top, left and right halves