Presentation is loading. Please wait.

Presentation is loading. Please wait.

ALL YOU NEED IS A GOOD INIT

Similar presentations


Presentation on theme: "ALL YOU NEED IS A GOOD INIT"β€” Presentation transcript:

1 ALL YOU NEED IS A GOOD INIT
Dmytro Mishkin, Jiri Matas Center for Machine Perception Czech Technical University in Prague Presenter: Qi Sun

2 Weight Initialization
Why A proper initialization of the weights in a neural network is critical to its convergence, avoiding exploding or vanishing gradients. How: To Keep Signal’s Variance Constant Gaussian noise with variance π‘‰π‘Žπ‘Ÿ π‘Š 𝑙 = 2 𝑛 𝑙 + 𝑛 𝑙+1 (Glorot et. al. 2010) π‘‰π‘Žπ‘Ÿ π‘Š 𝑙 = 2 𝑛 𝑙 (He et.al. 2015) Orthogonal initial conditions on weights SVD->w=U or V (Saxe et. al. 2013) Data-dependent: LSUV Other method: Batch normalization (Loffe et. Al.2015) Helpful for relaxing the careful tuning of weight initialization

3 Weight Initialization

4 Layer-Sequential Unit-Variance Initialization

5 Pre-initialize Pre-initialize network with orthonormal matrices as in Saxe et al (2014) Why Fundamental action: repeated matrix multiplications Orthonormal matrices: all the eigenvalues of an orthogonal matrix have absolute value 1 The resulting matrix doesn't explode or vanish How (Briefly) Initialize a matrix with standard Gaussian distribution Apply Singular Value Decomposition (SVD) to the matrix Initializes the array with either side of resultant orthogonal matrices, depending on the shape of the input array

6 Iterative Procedure Using minibathes of data, rescale weights to have variance of 1 for each layer Why Data driven: batch normalization performed only on the first mini-batch The similarity to batch normalization is the unit variance normalization procedure How (Briefly) For each layer: STEP1: Given a minibatch, compute the activation output variance STEP2: For each layer, compute variance by the threshold defined as π‘‡π‘œπ‘™ π‘£π‘Žπ‘Ÿ to the target variance 1 If below max number of iterations or above the π‘‡π‘œπ‘™ π‘£π‘Žπ‘Ÿ , rescale the layer weights by the squared variance of the minibatch, GOTO STEP1 Else finish initializing this layer Variant LSUV Normalizing input activations of each layer instead of output ones. SAME Pre-initialization of weights with Gaussian noise instead of orthonormal matrix. SAMLL DECREASE

7 Main contribution A simple initialization procedure leads to state-of-the-art thin and very deep neural nets. Explored the initialization with different activation functions in very deep networks. (ReLU, hyperbolic tangent, sigmoid, maxout, and VLReLU) Absence of a general, repeatable and efficient procedure for their end-to-end training Romero et al(2015) stated that deep and thin networks are very hard to train by back propagation if deeper than five layers, especially with uniform initialization

8 Validation Network architecture Datasets MNIST CIFAR-10/100
ILSVRC-2012

9 Validation

10 Validation

11 Validation

12 Comparison to batch normalization
As good as batch-normalized one No extra computations

13 Conclusion LSUV initialization allows learning of very deep nets via standard SGD, and leads to (near) state-of-the art results on MNIST, CIFAR, ImageNet datasets, outperforming the sophisticated systems designed specifically for very deep nets such as FitNets The proposed initialization works well with different activation functions.

14 Questions?


Download ppt "ALL YOU NEED IS A GOOD INIT"

Similar presentations


Ads by Google