A Theoretical Analysis of Feature Pooling in Visual Recognition Y-Lan Boureau, Jean Ponce and Yann LeCun ICML 2010 Presented by Bo Chen.

Slides:



Advertisements
Similar presentations
Learning deformable models Yali Amit, University of Chicago Alain Trouvé, CMLA Cachan.
Advertisements

Aggregating local image descriptors into compact codes
Linear Regression.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
HMAX Models Architecture Jim Mutch March 31, 2010.
Patch-based Image Deconvolution via Joint Modeling of Sparse Priors Chao Jia and Brian L. Evans The University of Texas at Austin 12 Sep
Integration of sensory modalities
Computer Vision – Image Representation (Histograms)
What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object.
Ziming Zhang *, Ze-Nian Li, Mark Drew School of Computing Science, Simon Fraser University, Vancouver, B.C., Canada {zza27, li, Learning.
Face Recognition and Biometric Systems
What is Statistical Modeling
Learning Convolutional Feature Hierarchies for Visual Recognition
Visual Recognition Tutorial
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
Spatial Pyramid Pooling in Deep Convolutional
Applications in GIS (Kriging Interpolation)
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
Relationships Among Variables
Image Classification using Sparse Coding: Advanced Topics
What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Multiclass object recognition
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
Group Sparse Coding Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow Google Mountain View, CA (NIPS2009) Presented by Miao Liu July
Functional Brain Signal Processing: EEG & fMRI Lesson 7 Kaushik Majumdar Indian Statistical Institute Bangalore Center M.Tech.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Regression. Types of Linear Regression Model Ordinary Least Square Model (OLS) –Minimize the residuals about the regression linear –Most commonly used.
Dengsheng Zhang and Melissa Chen Yi Lim
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Deep Convolutional Nets
Local Predictability of the Performance of an Ensemble Forecast System Liz Satterfield and Istvan Szunyogh Texas A&M University, College Station, TX Third.
Classification Ensemble Methods 1
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.
Ultra-high dimensional feature selection Yun Li
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
NICTA SML Seminar, May 26, 2011 Modeling spatial layout for image classification Jakob Verbeek 1 Joint work with Josip Krapac 1 & Frédéric Jurie 2 1: LEAR.
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Lecture 04: Logistic Regression
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Learning Mid-Level Features For Recognition
CH 5: Multivariate Methods
Overview of Supervised Learning
4.2 Data Input-Output Representation
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Chapter 8 - Estimation.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Oral presentation for ACM International Conference on Multimedia, 2014
EE513 Audio Signals and Systems
On Convolutional Neural Network
Deep Learning for Non-Linear Control
Outline Background Motivation Proposed Model Experimental Results
Feature space tansformation methods
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

A Theoretical Analysis of Feature Pooling in Visual Recognition Y-Lan Boureau, Jean Ponce and Yann LeCun ICML 2010 Presented by Bo Chen

Outline 1. Max-pooling and average pooling 2. Successful stories about max-pooling 3. Pooling binary features 4. Pooling continuous sparse codes 5. Discussions

Max-pooling and Average-pooling Max-pooling: take the maximum value in each block Average-pooling: average all values in each block After Pooling

Successful Stories about Max-Pooling 1. Vector quantization + Spatial pyramid model (Lazebnik et al.,2006) 2. Pooling spatial pyramid model after sparse coding or Hard quantization (Yang et al.,2009, Y-Lan Boureau., 2010) 3. Convolutional (deep) Networks (LeCun et al., 1998, Ranzato et al., 2007, Lee et al, 2009)

Why Pooling? 1. In general terms, the objective of pooling is to transform the joint feature representation into a new, more usable one that preserves important information while discarding irrelevant detail, the crux of the matter being to determine what falls in which category. 2. Achieving invariance to changes in position or lighting conditions, robustness to clutter, and compactness of representation, are all common goals of pooling.

Pooling Binary Features Notations: 1. If the unpooled data is a P × k matrix of 1-of-k codes taken at P locations, we extract a single P-dimensional column v of 0s and 1s, indicating the absence or presence of the feature at each location. 2. The vector v is reduced by a pooling operation to a single scalar f(v) 3. Max pooling: 4. Given two classes C 1 and C 2, we examine the separation of conditional distributions, Model: Consider two distributions. The larger the distance of their means, (or the smaller the variance), the better their separation. Average pooling:

Distribution Separability Average pooling: The sum over P i.i.d. Bernoulli variables of mean α follows a binomial distribution B(P, α). : Consequently: The expected value of f a is independent of sample size P, and the variance decreases like 1/P; therefore the separation ratio of means’ difference over standard deviation decreases monotonically like Max pooling: : Bernoulli variable The separation of class-conditional expectations of max-pooled features: There exists a range of pooling cardinalities for which the distance is greater with max pooling than average pooling if and only if P M > 1. For variance, it’s increasing then decreasing and reaching its maximum of 0.5 at

Empirical Experiments and Predictions 1.Max pooling is particularly well suited to the separation of features that are very sparse (i.e., have a very low probability of being active). 2. Using all available samples to perform the pooling may not be optimal. 3. The optimal pooling cardinality should increase with dictionary size.

Experiments about Optimal Pooling Cardinality Empirical: an empirical average of the max over different subsamples. Expectation:, here P has been a parameter.

Pooling Cardinalities

Pooling Continuous Sparse Codes Two Conclusions: 1. when no smoothing is performed, larger cardinalities provide a better signal-to-noise ratio. 2. this ratio grows slower than when simply using the additional samples to smooth the estimate.

Transition from Average to Max Pooling 1. P-norm: 2. Softmax Function: 3.

Discussions 1.By carefully adjusting the pooling step of feature extraction, relatively simple systems of local features and classifiers can become competitive to more complex ones. 2. For binary case, max pooling may account for this good performance, and shown that this pooling strategy was well adapted to features with a low probability of activation. (1) use directly the formula for the expectation of the maximum to obtain a smoother estimate in the case of binary codes; (2) pool over smaller samples and take the average. 3. When using sparse coding, some limited improvement may be obtained by pooling over subsamples of smaller cardinalities and averaging, and conducting a search for the optimal pooling cardinality, but this is not always the case.