Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3b: CNN: Advanced Layers

Similar presentations


Presentation on theme: "Lecture 3b: CNN: Advanced Layers"— Presentation transcript:

1 Lecture 3b: CNN: Advanced Layers

2 Agenda Advanced Layers Dropout (Hinton et al )
Stochastic pooling (Zeiler, Fergus) Maxout (IGoodfellow) Network–in –Network (Min Lin et al ) GoogLeNet (Szegedy et al) Siamese networks

3 Dropout Dropout very powerful training technique, usually used for fully connected layers (Hinton et al. ) Training: Set to 0 the output of each hidden neuron with probability 0.5 (‘drop”). The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back-propagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights . Note that smaller weight initialization ( ) should be used Testing Use all the neurons but multiply their outputs by 0.5. See )

4 Training with dropout The ideal training regime for dropout is when the training procedure resembles training an ensemble with bagging under parameter sharing constraints: Each Dropout update can be seen as update to a different model on a different subset of the training set. Training with dropout is very different from ordinary SGD: SGD moves slowly and steadily in the most promising direction. SGD usually works best with a small learning rate that results in a smoothly decreasing objective function Dropout rapidly explores many different directions and rejects the ones that worsen performance. Dropout works best with a large learning rate, resulting in a constantly fluctuating objective function.

5 Zeiler & Fergus http://arxiv.org/pdf/1302.4389.pdf
Stochastic Pooling Similar to dropout technique, used instead of max- pooling. Training: Compute probability for each element in pooling region through normalization of activation inside pooling region: 𝑝 𝑖 = 𝑎 𝑖 𝑘 ∈𝑅 𝑎 𝑘 Pool activation based on probabilities from step 1. Testing: weighted pooling: 𝑠= 𝑘∈𝑅 𝑝 𝑘 𝑎 𝑘 Zeiler & Fergus

6 Stochastic Pooling

7 Maxout Maxout is new type of non-linear activation function which takes maximum across k affine features (“pool across channels”). Example: Classical MLP: ℎ 𝑖 =𝑅𝑒𝑙𝑢( 𝑗=1..1 𝑧 𝑖𝑗 ) = Relu( 𝑗=1..𝑀 ( 𝑤 𝑖𝑗 ∗ 𝑣 𝑗 + 𝑏 𝑖𝑗 )) MAXOUT: ℎ 𝑖 = max 𝑗=1..𝑀 𝑧 𝑖𝑗 = max 𝑗=1..𝑀 ( 𝑤 𝑖𝑗 ∗ 𝑣 𝑗 + 𝑏 𝑖𝑗 ) MLP with 2 maxout units Goodfellow:

8 Maxout + Dropout Maxout works exceptionally well with dropout

9 Network-in-Network Replace conventional linear filter with micro non-liner filter (small MLP), which is slided over the input (similar to CNN) Min Lin et al,

10 https://github.com/BVLC/caffe/wiki/Model-Zoo
Network-in-Network - 2 The overall structure of NIN is a stack of mlpconv layers: stack of mlp layers the global average pooling -.used instead of FC layers: the last mlpconv layer has # of output feature maps = # of classes loss layer

11 Network-in-Network: performance
NIN - one of top performers:

12 GoogLeNet (2014) Winner of ILSVRC Very deep network with 22 layers: Network–in-network–in-network …., Removed fully connected layers  small # of parameters (5 mln weights) Convolution Pooling Softmax Other Overfeat , 2013

13 GoogLeNet (2014) Inception layer

14 GoogLeNet (2014) First “naïve” version of Inception module:
Very expensive if on top of a convolutional layer with a large number of filters. Becomes even more expensive when pooling units are used: the number of output filters equals to the number of filters in the previous stage.

15 GoogLeNet (2014) Adding 1x1 convolutions  reduce dimensions  less compute

16 GoogLeNet (2014) Auxiliary classifiers Main classifier

17 GoogLeNet (2014) Details of GoogleNet architecture

18 GoogLeNet (2014) Training: Distbelief CPU clusyter
Asynchronous SGD with momentum 0.9 Fixed lr ( decreased by 4% each 8 epochs) Polyak averaging for final model Data augmentation Testing: Softmax averaged over 7 models Multi-scale detection

19 Siamese Networks Face verification problem: The idea:
the number of categories is very large and not known during training the number of training samples for a single category is very small The idea: to learn a function that maps input patterns into a target space such that the distance will be small for pairs of faces from the same person, and large for pairs from different persons The mapping from raw to the target space is a convolutional network The system is trained on pairs of patterns taken from a training set.

20 Siamese Network Training set composed of equal number of “true” and false samples Each sample consists of two pair of images and label (“true”, “false”).

21 Siamese Network Minimize contrastive Loss between outputs of images from the same class, and maximize for different classes.


Download ppt "Lecture 3b: CNN: Advanced Layers"

Similar presentations


Ads by Google