Download presentation
1
Lecture 3b: CNN: Advanced Layers
2
Agenda Advanced Layers Dropout (Hinton et al )
Stochastic pooling (Zeiler, Fergus) Maxout (IGoodfellow) Network–in –Network (Min Lin et al ) GoogLeNet (Szegedy et al) Siamese networks
3
Dropout Dropout very powerful training technique, usually used for fully connected layers (Hinton et al. ) Training: Set to 0 the output of each hidden neuron with probability 0.5 (‘drop”). The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back-propagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights . Note that smaller weight initialization ( ) should be used Testing Use all the neurons but multiply their outputs by 0.5. See )
4
Training with dropout The ideal training regime for dropout is when the training procedure resembles training an ensemble with bagging under parameter sharing constraints: Each Dropout update can be seen as update to a different model on a different subset of the training set. Training with dropout is very different from ordinary SGD: SGD moves slowly and steadily in the most promising direction. SGD usually works best with a small learning rate that results in a smoothly decreasing objective function Dropout rapidly explores many different directions and rejects the ones that worsen performance. Dropout works best with a large learning rate, resulting in a constantly fluctuating objective function.
5
Zeiler & Fergus http://arxiv.org/pdf/1302.4389.pdf
Stochastic Pooling Similar to dropout technique, used instead of max- pooling. Training: Compute probability for each element in pooling region through normalization of activation inside pooling region: 𝑝 𝑖 = 𝑎 𝑖 𝑘 ∈𝑅 𝑎 𝑘 Pool activation based on probabilities from step 1. Testing: weighted pooling: 𝑠= 𝑘∈𝑅 𝑝 𝑘 𝑎 𝑘 Zeiler & Fergus
6
Stochastic Pooling
7
Maxout Maxout is new type of non-linear activation function which takes maximum across k affine features (“pool across channels”). Example: Classical MLP: ℎ 𝑖 =𝑅𝑒𝑙𝑢( 𝑗=1..1 𝑧 𝑖𝑗 ) = Relu( 𝑗=1..𝑀 ( 𝑤 𝑖𝑗 ∗ 𝑣 𝑗 + 𝑏 𝑖𝑗 )) MAXOUT: ℎ 𝑖 = max 𝑗=1..𝑀 𝑧 𝑖𝑗 = max 𝑗=1..𝑀 ( 𝑤 𝑖𝑗 ∗ 𝑣 𝑗 + 𝑏 𝑖𝑗 ) MLP with 2 maxout units Goodfellow:
8
Maxout + Dropout Maxout works exceptionally well with dropout
9
Network-in-Network Replace conventional linear filter with micro non-liner filter (small MLP), which is slided over the input (similar to CNN) Min Lin et al,
10
https://github.com/BVLC/caffe/wiki/Model-Zoo
Network-in-Network - 2 The overall structure of NIN is a stack of mlpconv layers: stack of mlp layers the global average pooling -.used instead of FC layers: the last mlpconv layer has # of output feature maps = # of classes loss layer
11
Network-in-Network: performance
NIN - one of top performers:
12
GoogLeNet (2014) Winner of ILSVRC Very deep network with 22 layers: Network–in-network–in-network …., Removed fully connected layers small # of parameters (5 mln weights) Convolution Pooling Softmax Other Overfeat , 2013
13
GoogLeNet (2014) Inception layer
14
GoogLeNet (2014) First “naïve” version of Inception module:
Very expensive if on top of a convolutional layer with a large number of filters. Becomes even more expensive when pooling units are used: the number of output filters equals to the number of filters in the previous stage.
15
GoogLeNet (2014) Adding 1x1 convolutions reduce dimensions less compute
16
GoogLeNet (2014) Auxiliary classifiers Main classifier
17
GoogLeNet (2014) Details of GoogleNet architecture
18
GoogLeNet (2014) Training: Distbelief CPU clusyter
Asynchronous SGD with momentum 0.9 Fixed lr ( decreased by 4% each 8 epochs) Polyak averaging for final model Data augmentation Testing: Softmax averaged over 7 models Multi-scale detection
19
Siamese Networks Face verification problem: The idea:
the number of categories is very large and not known during training the number of training samples for a single category is very small The idea: to learn a function that maps input patterns into a target space such that the distance will be small for pairs of faces from the same person, and large for pairs from different persons The mapping from raw to the target space is a convolutional network The system is trained on pairs of patterns taken from a training set.
20
Siamese Network Training set composed of equal number of “true” and false samples Each sample consists of two pair of images and label (“true”, “false”).
21
Siamese Network Minimize contrastive Loss between outputs of images from the same class, and maximize for different classes.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.