Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3a Analysis of training of NN

Similar presentations


Presentation on theme: "Lecture 3a Analysis of training of NN"β€” Presentation transcript:

1 Lecture 3a Analysis of training of NN

2 Agenda Analysis of deep networks Variance analysis
Non-linear units Weight initialization Local Response Normalization (LRN) Batch Normalization

3 Understanding the difficulty of training convolutional networks
The key idea : debug training through monitoring of mean and variance of activation variables 𝑦 𝑙 (output of non-linear unit) mean and variance of gradients πœ• 𝐸 πœ• 𝑦 π‘™βˆ’1 , and πœ• 𝐸 πœ• π‘Š 𝑙 . Reminder: variance of x: π‘‰π‘Žπ‘Ÿ π‘₯ =𝐸π‘₯𝑝 (π‘₯βˆ’ π‘₯ ) 2 , We will compute scalar mean and variance for each layer, and then average over images in the test set. X. Glorot ,Y. Bengio, Understanding the difficulty of training deep feedforward neural networks

4 Understanding the difficulty of training convolutional networks
Activation graph for MLP: 4 x [fully connected layers + sigmoid] Mean and standard deviation of the activation (output of the sigmoid) during learning, for 4 hidden layers . The top hidden layer quickly saturates at 0 (slowing down all learning), but then slowly desaturates ~ epoch 100. Sigmoid is non-symmetric -> difficult to train

5 Understanding the non-linear function behavior
Let’s try MLP with symmetric non-linear functions: tanh & soft-sign Tanh: 𝒆 𝒙 βˆ’ 𝒆 βˆ’π’™ 𝒆 𝒙 + 𝒆 βˆ’π’™ | Soft-sign : 𝒙 𝟏+|𝒙| 98 % (markers only) and standard deviation (solid lines with markers)

6 MLP: Debugging the forward path
How we can use variance analysis to debug NN training? Let’ start with classical Multi Layer Perceptron (MLP) Forward propagation for FC layer: 𝑦 𝑖 =𝑓( 𝑖=1 𝑛 𝑀 𝑖𝑗 βˆ— π‘₯ 𝑗 ) where π‘₯=[π‘₯ 𝑗 ] – layer inputs, 𝑛 – number of inputs W – layer matrix 𝑦=[ 𝑦 𝑖 ] – layer outputs (hidden nodes) Assume that 𝑓 is symmetric non-linear activation function. For ini tial analysis we will ignore non-linear unit𝑓 and it’s derivative ( 𝑓 is not saturate and f’ ~ 1).

7 MLP: Debugging the forward path
Assume that: all π‘₯ 𝑗 are independent and have the variance π‘‰π‘Žπ‘Ÿ 𝑋 , all 𝑀 𝑖𝑗 are independent and have the variance π‘‰π‘Žπ‘Ÿ π‘Š . Then π‘‰π‘Žπ‘Ÿ 𝑦 = 𝑛 𝑖𝑛 βˆ— π‘‰π‘Žπ‘Ÿ 𝑀 βˆ— π‘‰π‘Žπ‘Ÿ(π‘₯) We want to keep the output 𝑦 at the same dynamic range as input π‘₯: 𝑛 𝑖𝑛 βˆ— π‘‰π‘Žπ‘Ÿ 𝑀 =1  π‘‰π‘Žπ‘Ÿ π‘Š = 1 𝑛 𝑖𝑛 Xavier rule for weight initialization with uniform rand( ): π‘Š=𝑒𝑛𝑖_π‘Ÿπ‘Žπ‘›π‘‘ βˆ’ 3 𝑛 𝑖𝑛 ; 𝑛 𝑖𝑛 X. Glorot ,Y. Bengio, Understanding the difficulty of training deep feedforward neural networks

8 MLP: Debugging the backward propagation
Backward propagation of gradients: πœ• 𝐸 πœ•π‘₯ = πœ• 𝐸 πœ•π‘¦ βˆ— 𝑀 𝑇 ; πœ• 𝐸 πœ•π‘€ = πœ• 𝐸 πœ•π‘¦ βˆ—π‘₯ Then π‘‰π‘Žπ‘Ÿ πœ•πΈ πœ•π‘₯ = 𝑛 π‘œπ‘’π‘‘ βˆ—π‘‰π‘Žπ‘Ÿ 𝑀 βˆ— π‘‰π‘Žπ‘Ÿ( πœ•πΈ πœ•π‘¦ ) π‘‰π‘Žπ‘Ÿ πœ•πΈ πœ•π‘€ = π‘‰π‘Žπ‘Ÿ π‘₯ βˆ—π‘‰π‘Žπ‘Ÿ( πœ•πΈ πœ•π‘¦ ) We want to keep gradients πœ• 𝐸 πœ•π‘₯ from vanishing and from exploding: 𝑛 π‘œπ‘’π‘‘ βˆ—π‘‰π‘Žπ‘Ÿ 𝑀 =1  π‘‰π‘Žπ‘Ÿ π‘Š = 1 𝑛 π‘œπ‘’π‘‘ . Combining with formula from forward path: π‘‰π‘Žπ‘Ÿ π‘Š = 2 𝑛 𝑖𝑛 +𝑛 π‘œπ‘’π‘‘ ; Xavier rule 2 for weight initialization with uniform rand( ): π‘Š=𝑒𝑛𝑖_π‘Ÿπ‘Žπ‘›π‘‘ βˆ’ 6 𝑛 𝑖𝑛 + 𝑛 π‘œπ‘’π‘‘ ; 6 𝑛 𝑖𝑛 + 𝑛 π‘œπ‘’π‘‘

9 Extension of gradient analysis for convolutional networks
Convolutional layer: Forward propagation: π‘Œ 𝑖 =𝑓( 𝑗=1 𝑀 π‘Š 𝑖𝑗 βˆ— 𝑋 𝑗 ) where: π‘Œ 𝑖 - output feature map (H’ x W’) π‘Š 𝑖𝑗 - convolutional filter (K x K) 𝑋 𝑗 - input feature map (H x W) π‘Š 𝑖𝑗 βˆ— 𝑋 𝑖 - convolution of input feature map X with filter W M – number of input features (each feature map H x W ) Backward propagation: πœ•πΈ πœ• 𝑋 𝑗 = 𝑖=1 𝑁 πœ•πΈ πœ• π‘Œ 𝑖 βˆ— π‘Š 𝑖𝑗 , πœ•πΈ πœ• π‘Š 𝑖𝑗 = πœ•πΈ πœ• π‘Œ 𝑖 * 𝑋 𝑗 Here : * is convolution.

10 Extension of gradient analysis for convolutional networks
Convolutional layer: Forward propagation: π‘‰π‘Žπ‘Ÿ π‘Œ = 𝒏 π’Šπ’ βˆ— π‘‰π‘Žπ‘Ÿ π‘Š βˆ—π‘‰π‘Žπ‘Ÿ(𝑋) 𝒏 π’Šπ’ = # input feature maps * k2 Backward propagation: π‘‰π‘Žπ‘Ÿ πœ•πΈ πœ•π‘₯ = 𝑛 π‘œπ‘’π‘‘ βˆ— π‘‰π‘Žπ‘Ÿ π‘Š βˆ—π‘‰π‘Žπ‘Ÿ( πœ•πΈ πœ•π‘¦ ) 𝒏 𝒐𝒖𝒕 = # output feature maps * k2 For weight gradients: π‘‰π‘Žπ‘Ÿ πœ•πΈ πœ•π‘€ ~ (π‘―βˆ—π‘Ύ)βˆ— π‘‰π‘Žπ‘Ÿ 𝑋 βˆ—π‘‰π‘Žπ‘Ÿ( πœ•πΈ πœ•π‘¦ ) We can compensate (π‘―βˆ—π‘Ύ) with layer learning rate

11 Local Contrast Normalization
Local Contrast Normalization - can be performed on the state of every layer, including the input Subtractive Local Contrast Normalization: Subtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter) Divisive Local Contrast Normalization Divides every value in a layer by the standard deviation of its neighbors over space and over all feature maps Subtractive + Divisive LCN

12 Local Response Normalization Layer
LRN layer β€œdamps” responses that are too large by normalization in a local neighborhood inside the feature map: 𝑦 𝑙 π‘₯,𝑦 = π’š π’βˆ’πŸ π‘₯,𝑦 𝟏+ 𝛼 𝑡 𝟐 βˆ— π‘₯ β€² =π‘₯βˆ’π‘/2 π‘₯+𝑁/2 𝑦 β€² =π‘¦βˆ’π‘/2 𝑦+𝑁/2 𝑦 π‘™βˆ’1 ( π‘₯ β€² , 𝑦 β€² ) 2 whereΒ : y lβˆ’1 π‘₯,𝑦 Β is the activity mapΒ prior to normalization, NΒ is the size of the region to use for normalization. - 1 is used to prevent numerical issues for small numbers.

13 Local Response Normalization Layer
Soft Max Inner Product LRN layer Pooling ReLUP Convolutional layer LRN layer Pooling ReLUP Convolutional layer Data Layer

14 Batch Normalization Layer
Layer which normalizes the output of conv. layer before non-linear layer: Whitening: normalize each element of feature map over mini-batch. All locations of the same feature map are normalized in the same way. Adaptive scale Ξ³ and shift Ξ² (per map) – learned parameters S. Ioffe, C. Szegedy Batch Normalization: Accelerating Deep Network Training , 2015

15 Batch Normalization Layer
Soft Max Inner Product Pooling ReLUP Batch Normalization layer Convolutional layer Pooling ReLUP Batch Normalization layer Convolutional layer Data Layer

16 Batch Normalization: training
Back-propagation for BN layer: Implemented in caffe:

17 Batch Normalization: inference
During inference we don’t have batch to normalize, so we use instead fixed mean and variance over all train set: π‘₯ = π‘₯βˆ’πΈ[π‘₯] π‘‰π‘Žπ‘Ÿ π‘₯ + πœ– For testing during training we can use estimation of E[x] and Var[x]:

18 Batch Normalization: performance
Networks with batch normalization train much faster : Much high learning rate with fast exponential decay No need in LRN Baseline: caffe cifar_full VGG-16:caffe VGG_ILSVRC_16

19 Batch Normalization: Imagenet performance
Models: Goog;le Inception:(ILSVRC 2014) with the learning rate of BN-Baseline: Inception + Batch Normalization before each ReLU BN-x5: Inception + Batch Normalization w/o dropout and LRN. The initial learning rate was increased by 5x to BN-x30: Like BN-x5, but with the initial learning rate (30x of Inception). BN-x5-Sigmoid: Like BN-x5, but with sigmoid instead of ReLU


Download ppt "Lecture 3a Analysis of training of NN"

Similar presentations


Ads by Google