Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object.

Similar presentations


Presentation on theme: "What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object."— Presentation transcript:

1 What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.(Cited by 396 till 2014.11.12)

2 Usual architecture of the neural networks Each part of the neural networks Unsupervised learning conception Experiment Contribution of this paper

3

4

5

6 Deep learning methods aims at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features[2] Neural Networks with many hidden layers Graphical Models with many levels of hidden layers Other methods Deep Learning Methods [2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?.

7 Usual architecture of neural networks Non-linear Operation: Quantization, Winner-take-all, Sparsification, Normalization, S-function Pooling Operation: Max, average, histogramming operator Classifier: Neural Networks(NN), k-Nearest Neighbor(KNN), Support Vector Machine(SVM), Logistic Regression(LR)

8 This paper addresses three questions: How does the non-linearities that follow the filter banks influence the recognition accuracy? Does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hardwired filters? Is there any advantage to using an architecture with two stages of feature extraction, rather than one? Questions to address

9 To address these three questions, they experimented with various combinations of architectures: One stage or two stages of feature extraction Different types of non-linearities Different types of filters Different filter learning methods(random, unsupervised and supervised) Test Dataset: Caltech-101 dataset; NORB object dataset; MNIST dataset Experiments Architecture

10 Filter Bank Layers(F CSG ) Local Contrast Normalization Layer(N) Pooling and Subsampling Layer(PA or PM) Model Architecture

11 The module computes: Filter Bank Layer(F CSG ) * is the convolution operater, tanh is hyperbolic tangent non-linearity, g is a trainable scalar coefficient. Output size: assume each map is n1 x n2, each kernel is l1 x l2, then the output y is (n1-l1+1) x (n2-l2+1) The kernel here could be either supervised trained or unsupervised pre- trained

12 Local Contrast Normalization Layer(N) C is the mean( ) I am not quiet understand this part W pq is Gaussian weighting window

13 Local Contrast Normalization Layer(N) The result of this module: It seems like this module is doing edge extraction

14 Pooling and Subsampling Layer(PA or PM) For each of the small area: Where is a uniform weighting window or max weighting window Each output feature map is then subsampled spatially by a factory S horizontally and vertically

15 Combine Modules There could be three types of architectures of this network: F CSG ---- PA F CSG ---- N ---- PA F CSG ---- PM

16 Training Protocol Random Features and Supervised Classifier – R and RR Unsupervised Features, Supervised Classifier - U and UU Random Features, Global Supervised Refinement - R+ and R+R+ Unsupervised Feature, Global Supervised Refinement U+ and U+U+

17 Unsupervised Training of Filter Banks For a given input X, a matrix W whose columns are the dictionary elements, feature vector Z ∗ is obtained by minimizing the following energy function where λ is a sparsity hyper-parameter. For any input X, one needs to run a rather expensive optimization algorithm to find Z ∗, To alleviate the problem, the PSD method is imported.

18 Predictive Sparse Decomposition(PSD)[3] [3] Kavukcuoglu, Koray, Marc'Aurelio Ranzato, and Yann LeCun. "Fast inference in sparse coding algorithms with applications to object recognition." arXiv preprint arXiv:1010.3467 (2010).(cited by 94) where S ∈ R m×n is a filter matrix, D ∈ R m is a vector of biases

19 Result

20 Why does Unsupervised Pre-training Help Deep Discriminant Learning?[2]

21 Reference of the graph [2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?. [3] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554. [4] Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128. [5] Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi- supervised embedding. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08) (pp. 1168–1175). New York, NY, USA: ACM. [6]LeCun, Yann, et al. "Gradient-based learning applied to document recognition."Proceedings of the IEEE 86.11 (1998): 2278-2324. [7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.

22 non-convex function In deep learning, the objective function is usually a highly non-convex function of the parameters, so there must be many local minima in the model parameter space Supervised Learning use a fix point or a random point as the initialization. So in some or most situations, it converges at a local minima

23 Local Minima

24 Random Initialization

25 Unsupervised Pre-training

26 Reason There are a few reasonable hypotheses why pre-training might work. One possibility that unsupervised pre-training acts as a kind of regularizer, putting the parameter values in the appropriate range for discriminant training Another possibility, is that pre-training initializes the model to a point in parameter space that somehow renders the optimization process more effective, in the sense of achieving a lower minimum of the empirical cost function.

27 Conclusion Future work should clarify this hypothesis. Understanding and improving deep architectures remains a challenge. This work helps with such understanding via extensive simulations and puts forward and confirms a hypothesis explaining the mechanisms behind the effect of unsupervised pre-training for the final discriminant learning task.

28 Reference [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.(Cited by 396 till 2014.11.12) [2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?. [3] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554. [4] Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128. [5] Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08) (pp. 1168–1175). New York, NY, USA: ACM. [6]LeCun, Yann, et al. "Gradient-based learning applied to document recognition."Proceedings of the IEEE 86.11 (1998): 2278-2324. [7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.

29 Thank You!


Download ppt "What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object."

Similar presentations


Ads by Google