Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.

Similar presentations


Presentation on theme: "What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo."— Presentation transcript:

1 What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo Li ECE, Duke University Dec. 13rd, 2010

2 Outline Introduction Model Architecture Training Protocol Experiments  Caltech 101 Dataset  NORB Dataset  MNIST Dataset Conclusions

3 Introduction (I) Feature extraction stages:  A filter bank  A non-linear operation  A pooling operation Recognition architectures: Single stage of features + supervised classifier: SIFT, HoG, etc. Two or more successive stages of feature extractors + supervised classifier: convolutional networks

4 Introduction (II) Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy? Q2: Is there any advantage to using an architecture with two successive stages of features extraction, rather than with a single stage? Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard- wired filters or even random filters?

5 Model Architecture (I) Input: Output : Filter : A filter bank layer with 64 filters of size 9x9 : is the j-th feature map

6 Model Architecture (II)  Subtractive normalization operation  Divisive normalization operation

7 Model Architecture (III) An average pooling layer with 4x4 down-sampling: A max-pooling layer with 4x4 down-sampling:

8 Model Architecture (IV) Combining Modules into a Hierarchy

9 Training Protocol (I) Optimal sparse coding: Under sparse condition, this problem can be written as an optimization problem: Given training samples, learning proceeds: 1)Minimize the loss function 2)Find by running a rather expensive optimization algorithm.

10 Training Protocol (II) Predictive Sparse Decomposition (PSD) PSD trains a regressor to approximate the sparse solution for all training samples, where Learning proceeds by minimizing the loss function where Thus, (dictionary) and (filters) are simultaneously optimized.

11 Training Protocol (III) A single letter: an architecture with a single stage of feature extraction followed by a classifier; A double letter: an architecture with two stages if feature extraction followed by a classifier.  Filters are set to random values and kept fixed. Classifiers are trained in supervised mode.  Filters are trained using unsupervised PSD algorithm, and kept fixed. Classifiers are trained in supervised model.  Filters are initialized with random values. The entire system (Feature stages + classifiers) is trained in supervised mode with gradient descent.  Filters are initialized with the PSD unsupervised learning algorithm. The entire system (feature stages + classifiers) is trained in supervised mode by gradient descent.

12 Experiments (I) – Caltech 101 Data pre-processing: 1)Convert to gray-scale and resize to 151x151 pixels; 2)Subtract the image mean and divide by the image standard deviation; 3)Apply subtractive/divisive normalization (N layer with c=1); 4)Zero-padding the shorter side to 143 pixels. Recognition rates are averaged over 5 drawings of the training set (30 images per class). Hyper-parameters are selected to maximize the performance on the validation set of 5 samples per class taken out of the training sets.

13 Experiments (I) – Caltech 101 Using a Single Stage of Feature Extraction: Using Two Stages of Feature Extraction: Multinomial logistic regression PMK-SVM 64 26x26 feature maps Multinomial logistic regression PMK-SVM 256 4x4 feature maps 64 26x26 feature maps

14 Experiments (I) – Caltech 101

15 Random filters and no filter learning whatsoever with can achieve decent performance; Supervised fine tuning improves the performance; Two-stage systems are better than their single-stage counterparts; With rectification and normalization, unsupervised training does not improve the performance; abs rectification is a crucial component for good performance; Single-stage system with PMK-SVM reaches the same performance with a two-stage with logistic regression;

16 Experiments (II) – NORB Dataset NORB dataset has 5 object categories; 24300 training samples and 24300 test samples (4860 per class); Each image is gray-scale with 96x96 pixels; Only consider the protocols; 1)Random filters do not perform as well as learned filters with more labels samples. 2)The use of abs and normalization makes a big difference.

17 Experiments (II) – NORB Dataset Use gradient descent to find the optimal input patterns in a architecture. In the left figure: (1-a) random stage-1 filters; (1-b) corresponding optimal inputs; (2-a) PSD filters; (2-b) Optimal input patterns; (3) subset of stage-2 filters after PSD and supervised refinement on Caltech-101. (3) (1-a) (2-b) (2-a) (1-b)

18 Experiments (III) – MNIST Dataset 60,000 gray-scale 28x28 pixel images for training and 10,000 images for testing; 2-stage of feature extraction: convolution 50 7x7 filters Max-pooling 2*2 windows 50 28x28 feature maps 50 14x14 feature maps Input Image 34x3 4 convolution 1024 5x5filters 64 10x10 feature maps Max-pooling 2x2 windows 64 5x5 feature maps the first stage the second stage 10-way multinomial logistic regression

19 Experiments (III) – MNIST Dataset Parameters are trained with PSD: the only hyper- parameter is tuned with a validation set of 10,000 training samples. The classifier is randomly initialized; The whole system is tuned in supervised mode. A test error rate of 0.53% was obtained.

20 Conclusions (I) Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy? 1) A rectifying non-linearity is the single most important factor. 2) A local normalization layer can also improve the performance. Q2: Is there any advantage to using an architecture with two successive stages of feature extraction, rather than with a single stage? 1) Two stages are better than one. 2) The performance of two-stage system is similar to that of the best single-stage systems based on SIFT and PMK-SVM.

21 Conclusions (II) Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard-wired filters or even random filters? 1) Random filters yield good performance only in the case of small training set. 2) The optimal input patterns for a randomly initialized stage are similar to the optimal inputs for a stage that use learned filters. 3) The global supervised learning of filters yields good recognition rate if with the proper non-linearites. 4) Unsupervised pre-training followed by supervised refinement yields the best overall accuracy.


Download ppt "What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo."

Similar presentations


Ads by Google