Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift.

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift

In This Presentation Internal Covariate Shift. Normalization
Normalization Via Mini Batch Statistics Batch Normalized Convolutional Networks Accelerating BN Networks Results Conclusion

Internal Covariate Shift
Distribution of each layer’s input changes during training, as parameters of previous layers change.

Consider a network computing: 𝒍= 𝑭 𝟐 𝑭 𝟏 𝒖, 𝜽 𝟏 , 𝜽 𝟐 𝒙= 𝑭 𝟏 𝒖, 𝜽 𝟏 𝒍=𝑭 𝟐 𝒙, 𝜽 𝟏 The change in 𝜽 𝟏 affects the input distribution to F2. 𝜽 𝟐 will have to readjust to compensate for the changes.

Consider the following layer: 𝒛=𝒈 𝑾𝒖+𝒃 𝒈 𝒙 = 𝟏 𝟏+ 𝒆 −𝒙 W, b are the layer’s parameters to be learned. g‘ will suffer from vanishing gradients for all dimensions of x=𝑾𝒖+𝒃 except those with small absolute values, as lim 𝒙 →∞ 𝝏𝒈 𝝏𝒙 =𝟎. Changes to W,b during training will likely move many dimensions of x into saturated regime.

Up to this article, this problem was handled with the following practices: Using ReLU as a Nonlinear function. Careful initialization. Small learning rates.

Normalization An input distribution is considered white for a vector x if it has: Zero Means Unit Variances Input is Decorrelated (correlation matrix is I). Network training is promised to converge faster if its inputs are whitened (LeCun et al., 1998b; Wiesler & Ney, 2011).

Normalization Gradient Descent should take into consideration that normalization occurres. What if it doesn’t? Let’s take for example a layer that adds bias: 𝑥=𝑢+𝑏 Using a simple normalization for x after this layer: 𝑥 =𝑥−𝐸 𝑥 A gradient descent step would be: 𝑏←𝑏+ ∆𝑏 If GD step ignores normalization, after GD step: 𝑥 = 𝑢+ 𝑏+∆𝑏 −𝐸 𝑢+𝑏+∆𝑏 = 𝑢+𝑏−𝐸 𝑢+𝑏 The loss remains fixed. This problem gets worse when normalization also scales.

Normalization Why not use whitening?
Ideally, using normalization 𝑥 =𝑁𝑜𝑟𝑚 𝑥,Χ . need to take into consideration both Jacobians for back propagation: 𝜕𝑁𝑜𝑟𝑚 𝑥,Χ 𝜕𝑥 , 𝜕𝑁𝑜𝑟𝑚 𝑥,Χ 𝜕Χ need to compute: 𝐶𝑜𝑣 𝑥 − 𝑥−𝐸 𝑥 , and its derivatives. heavy computation and complicated.

Normalization Via Mini-Batch Statistics
Two necessary simplification: 𝑥 (𝑘) = 𝑥 (𝑘) −𝐸 𝑥 (𝑘) 𝑉𝑎𝑟 𝑥 (𝑘) 𝑦 (𝑘) = 𝛾 (𝑘) 𝑥 (𝑘) + 𝛽 (𝑘) not decorrelated. Can recover original network by using: 𝛾 (𝑘) = 𝑉𝑎𝑟 𝑥 (𝑘) , 𝛽 (𝑘) = 𝐸 𝑥 (𝑘) Use a Mini Batch of size m. each mini batch produces estimates of the mean and variance of each activation.

Effectively, for each layer we wish to normalize its inputs, we add a new normalization layer just before it. The new input is the output of the normalization layer. mark the output of the normalization operation over a mini batch as: 𝑦 𝑖 = 𝐵𝑁 𝛾,𝛽 𝑥 𝑖 . 𝛾,𝛽 are added in order to keep the same representation power as before: we want to normalize without limiting the network to a subset of inputs. BN introduces 2*dim(x) new learning parameters for each normalization.

Computing BN:

BN derivatives for back propagation:

During inference, we want the output to depend only on the input, deterministically. For this, we use statistics over the population (rather than over a mini- batch), using unbiased variance estimate (m is the size of each batch, over some batches B): 𝑥 = 𝑥−𝐸 𝑥 𝑉𝑎𝑟 𝑥 +𝜀 𝑉𝑎𝑟 𝑥 = 𝑚 𝑚−1 𝐸 𝐵 𝜎 𝐵 2

Overall algorithm:

Batch Normalized Convolutional Networks
For convolutional networks, we want different elements of a feature map in different locations to be normalized the same way. Therefore, Jointly normalize all the activations in a feature map, over all locations. For Batch of size 𝑚, and feature maps of size 𝑝×𝑞, we use effective batch size of m𝑝𝑞. Learn 𝛾 (𝑘) , 𝛽 (𝑘) per feature map (rather than per activation).

Batch Normalized Convolutional Networks
Note: when normalizing x=𝑊𝑢+𝑏, there is no need to keep the bias 𝑏 , the normalization will already take care of that using 𝛾,𝛽.

Accelerating BN Networks:
𝐵𝑁 𝑊𝑢 =𝐵𝑁 𝛼𝑊𝑢 , thus: 𝜕𝐵𝑁 𝑊𝑢 𝜕𝑢 = 𝜕𝐵𝑁 𝛼𝑊𝑢 𝜕𝑢 - the scale of W does not affect the Jacobian, nor the gradient propagation. 𝜕𝐵𝑁 𝑊𝑢 𝜕 𝛼𝑊 = 1 𝛼 𝜕𝐵𝑁 𝑊𝑢 𝜕𝑊 - Larger weights lead to smaller gradients – batch normalization stabilizes parameter growth. System is more stabilized – we can use higher learning rates.

Accelerating BN Networks:
Increase learning rate – has no il side effects. Remove dropout – conjecture: it is redundant. Same regularization since there’s a random selection of examples in each mini-batch. Shuffle training examples more thoroughly – prevent the same examples from always appearing together in the same mini-batch. reduce the L2 weight regularization. Remove Local Response Normalization. Reduce photometric distortions – since batch normalized networks train faster, it sees every example fewer times –focus on more “real” images.

Results (a) shows the test accuracy.
This test shows the difference between a simple 3 hidden fully connected layered network and the same network with batch normalization, working on MNIST dataset. Using batch size of 60. (a) shows the test accuracy. (b) and (c) show the input distribution’s 15, 50 and 85 percentile to one typical activation of a sigmoid function in the last hidden layer.

Results (a) shows the test accuracy.
MNIST Classification: This test shows the difference between a simple 3 hidden fully connected layered network and the same network with batch normalization, working on MNIST dataset. Using batch size of 60. (a) shows the test accuracy. (b) and (c) show the input distribution’s 15, 50 and 85 percentile to one typical activation of a sigmoid function in the last hidden layer.

Results ImageNet Classification:
BN has enabled the researchers to achieve better then state of the art on top 5 error rate. BN has enabled training a Network that uses sigmoid with meaningful results. The same Inception Network, replacing every ReLU with sigmoid. BN modification to Inception networks reached the same accuracy in fewer training steps. An ensemble of BN modification reached better than state of the art results in top 5 error rate.

Single Network Comparison over ImageNet
Ensemble Comparison over ImageNet

Conclusion Merely using Batch Normalization speeds up convergence of the network and allows better results over a single network. Applying other modification (higher learning rates, removing / decreasing drop out, etc…) that are affordable by Batch Normalization reaches former state of the art in a fraction of the steps. Allows reaching better than state of the art by a significant margin.

Residual Neural Networks

Vanishing / exploding gradients Degradation at increasing depths
Problems: Vanishing / exploding gradients Degradation at increasing depths

Vanishing / exploding gradients
𝑾 𝒕 = 𝑾 𝒕−𝟏 −𝜶 𝝏𝑬 𝝏𝒘 𝒕−𝟏 𝝏𝑬 𝝏 𝒚 𝒊+𝟏 = 𝝏 𝒚 𝒍 𝒘, 𝒚 𝒊 𝝏 𝒚 𝒊 ∗ 𝝏𝑬 𝝏 𝒚 𝒊+𝟏 𝝏𝑬 𝝏 𝒘 𝒊+𝟏 = 𝝏 𝒚 𝒍 (𝒘, 𝒚 𝒊 ) 𝝏 𝒘 𝒊+𝟏 ∗ 𝝏𝑬 𝝏 𝒚 𝒊+𝟏 During backprop, gradients reach very low / high values, resulting in problematic weights that “kill” information. Fixing methods: Weight normalization Batch normalization

Degradation First layers = coarse features ; last layers = fine features. Experiment: take a working shallow network and make it deeper by inserting weighted “identity layers”. We expect performance to remain same or increase. But this is empirically wrong! (Deep Residual Learning for Image Recognition, Kaiming He, Fig1)

Degradation Counter-intuitively, increasing depth may decrease accuracy, even with normalizations of inputs/weights Conclusion: hard for CNN’s to approximate identity connections (due to non-linear modules?) (Deep Residual Learning for Image Recognition, Kaiming He, Fig1)

Residual approach to NN
Basic idea: “bypass” groups of n layers, using “skip connections”. Usually n=2 or n=3. Mostly: skip connections = identity matrices Desired function H(x) H(x) := F(x) + x F(x) = H(x) - x input output residual (Deep Residual Learning for Image Recognition, Kaiming He, Fig2)

Networks need to learn many residual functions instead of one huge function Dotted lines stand for skip-connections between layers of different dimensions (Deep Residual Learning for Image Recognition, Kaiming He, Fig3)

Hypothesis: residual functions can be learnt faster and more accurately than plain ones Resembles perturbations model (Deep Residual Learning for Image Recognition, Kaiming He, Fig3)

Gradient magnitudes are smaller in ResNets compared to plain nets. Also, the deeper the networks, the smaller the magnitudes. Strengthens the conclusion that only small perturbations are needed on top of the direct inputs (Deep Residual Learning for Image Recognition, Kaiming He, Fig7)

Handling dimension differences
(A) Zero-pad extra bits in higher dimension, no extra parameters (weights) (B) Identity matrices when possible, weighted projection matrices for increasing dimensions (C) Use only weighted projection matrices (Deep Residual Learning for Image Recognition, Kaiming He, Table3)

Depth matters Increasing residual network’s depth does improve its performance (Deep Residual Learning for Image Recognition, Kaiming He, Table4)

Depth matters Trend stops at extremely deep networks, possibly due to overfitting (exaggerated number of parameters in system compared to amount of training data) Better regularization might help? (Deep Residual Learning for Image Recognition, Kaiming He, Table6)

Depth matters (Deep Residual Learning for Image Recognition, Kaiming He, Fig6)

Unraveled view Residual networks can be viewed as a recursive occurrence of a single building block Module i “sees” 2 𝑖−1 paths, including a direct path to the input, and all of the possible paths containing all former modules. (Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig1)

Unraveled view In a simple 3 module example: Residual network:
(Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig1)

Unraveled view In a simple 3 module example: Plain network:
(Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig2)

Lesion study at test time
Let’s assume we can “turn off” module 𝑓 𝑖 by having the signal pass only through its skip-connection. Denoting a working module by 1 and a disconnected module by 0, it follows that there is a total of possible 2 𝑛 paths in the network Path length distribution: binomial (Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig2)

Disabling individual layers has little effect on ResNets, and a severe effect on plain networks In ResNets this means reducing number of paths from 2 𝑛 to 2 𝑛−1 ; In plain networks - ? Down- sampling layers (Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig3)

Disabling multiple layers increases the error smoothly (on average). This indicates an ensemble-like behavior of the network. (Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig5)

Effective paths in ResNets
Path length distribution: binomial Distribution is tightly centered around n/2 E.g in a 54-module ResNet, 95% of the paths go through modules. (Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig6)

Sample k modules, feed them forward, then back- propagate the learning gradients and observe them. Taking into account the abundance of each path length and its gradient magnitude, we surprisingly find that almost all gradient contribution arises from paths containing 7-15 modules (named effective paths). These paths constitute just 0.45% of paths in network! (Residual Networks Behave Like Ensembles of Relatively Shallow Network, Andreas Veit, Fig6)

Experiment: Train ResNet using only the effective paths. Result: Effective paths only = 6.10% error rate Full network = 5.96% error rate  Comparable! (Residual Networks Behave Like Ensembles of Relatively Shallow Networks, Andreas Veit, Fig6)

Alternative methods Highway networks:
Weight is split between skip-connection and module Empirical experiments show that after convergence, weights “prefer” skip-connections Dropout: During training, randomly drop neurons to average over an ensemble of paths. Test with all neurons. Stochastic depth: During training, randomly drop whole layers/modules to average over an ensemble of networks. Test with all layers/modules.

ResNets: conclusions & discussion
ResNets introduce the concept of skip- connections. Main idea is to converge on residual functions rather than one big function ResNets address the degradation problem (adding more layers decreases accuracy). Skip-connection help learning small functions on top of identity matrices. Limitation on extremely deep networks (overfitting?)

ResNets: conclusions & discussion
Effective paths in ResNets are surprisingly short Paths in ResNets do not strongly depend on each other; They act like an ensemble. ResNets do not solve the vanishing gradient problem. Rather, they utilize the flexible nature of an ensemble of paths to improve gradient flow and increase accuracy. Tension between: Improvement due to increasing depth Discovery that effective paths are actually short

Questions?

Thank You

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift.

Similar presentations

Presentation on theme: "Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift.

Similar presentations

Presentation on theme: "Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift."— Presentation transcript:

Similar presentations

About project

Feedback