Random walk initialization for training very deep feedforward networks

Random walk initialization for training very deep feedforward networks
Wenhua Jiao 10/30/207

outline Introduction Analysis and Proposed Initialization
Experiment and Results Summary

Introduction Author: David Sussillo Research Scientist @ Google Brain
Adjunct Stanford Paper Purpose: Using mathematical analysis to address the vanishing gradient problem looking for the features which effect training very deep networks

Analysis and proposed initialization
Since the early 90s, the vanishing gradient (VG) problem has been found. Because of VG, adding many extra layers in FFN does not usually improve performance. Back-propagation involves applying similar matrices repeatedly to compute the error gradient. The outcome of this process depends on whether the magnitudes of the leading eigenvalues tend to be greater than or less than one. Only if the magnitudes of the leading eigenvalues are tightly constrained can there be a useful “non-vanishing” gradient. It can be achieved by appropriate initialization.

Feedforward networks of the form: 𝑎 𝑑 =𝑔 𝑊 𝑑 ℎ 𝑑−1 + 𝑏 𝑑 ℎ 𝑑 =𝑓( 𝑎 𝑑 ) ℎ 𝑑 is vector of hidden activations, 𝑊 𝑑 is linear transformation, 𝑏 𝑑 is biases, all at depth 𝑑, with 𝑑=0,1,…,𝐷. 𝑓is an element-wise nonlinear with 𝑓 ′ 0 =1, and 𝑔 is a scale factor on the matrices. Assume the network has 𝐷 layers and each layer has width 𝑁, the elements of 𝑊 𝑑 are initially drawn from a Gaussian distribution with 0 mean and variance 1 𝑁 , the elements of 𝑏 𝑑 are initialized 0.Define ℎ 0 to be inputs, ℎ 𝐷 to be outputs. 𝐸 is objective function.

𝛿 𝑑 ≡ 𝜕𝐸 𝜕𝑎 | 𝑑 =𝑔 𝑊 𝑑+1 𝛿 𝑑+1 back-propagation equation(gradient vector) 𝑊 𝑑 𝑖,𝑗 = 𝑓 ′ ( 𝑎 𝑑 (𝑖)) 𝑊 𝑑 (𝑗,𝑖) 𝛿 𝑑 2 = g 2 𝑧 𝑑+1 𝛿 𝑑+1 2 squared magnitude of gradient vector 𝑧 𝑑 = 𝑊 𝑑 𝛿 𝑑 / 𝛿 𝑑 2 𝑍= 𝛿 0 2 𝛿 𝐷 2 = 𝑔 2𝐷 𝑑=1 𝐷 𝑧 𝑑 across-all-layer gradient magnitude Solving the VG problem amounts to keep Z to be 1, appropriately adjusting g. The matrices 𝑊and 𝑊 change during learning, so author think we can only do this for the initial configuration of the network before learning has made these changes

Because the matrices 𝑊 are initially random, we can think of the z variables as random variables. Then, Z is proportional to a product of random variables. ln 𝑍 =𝐷 ln 𝑔 2 + 𝑑=1 𝐷 ln ( 𝑧 𝑑 ) ln (𝑍) as being the result of a random walk, with step d in the walk given by the random variable ln ( 𝑧 𝑑 ) . The goal of Random Walk Initialization is to chose 𝑔 to make ln (𝑍) as chose to zero as possible.

Calculation of the optimal 𝑔 values
To make analytic statements that apply to all networks, author average over realization of the matrices 𝑊 applied during back- propagation. ln (𝑍) =𝐷 ln 𝑔 ln (𝑧) )=0→𝑔=exp⁡(− ln (𝑧) ) Here 𝑧 is a random variable determined by 𝑧= 𝑊 𝛿/ 𝛿 2 ,with 𝑊 ̃ and 𝛿 are same distribution as the 𝑊 𝑑 and 𝛿 𝑑 variables of the different layers of the network.

When W̃ is Gaussian 𝑊̃𝛿/|𝛿|is Gaussian for any vector 𝛿, then 𝑧 is 𝜒 2 distributed. With the N*N matrix 𝑊 have variance 1/N, writhing 𝑧=𝜂/𝑁, 𝜂 is distributed according to 𝜒 𝑁 2 . Expanding the logarithm in a Taylor series about 𝑧=1 and using the mean and variance of the distribution ln⁡(𝑧) ≈ 𝑧−1 − 𝑧−1 2 =− 1 𝑁 𝑔 𝑙𝑖𝑛𝑒 =exp⁡( 1 2𝑁 )

For the ReLU case Thus the derivative of the ReLU function sets 1-M rows of 𝑊 to 0 and leaves M rows with Gaussian entries. 𝑧 is the sum of the squares of M random variables with variance 1/N. Writing 𝑧=𝜂/𝑁, 𝜂 is distributed according to 𝜒 𝑀 2 . computing ln⁡(𝑧) in a Taylor series, with the 𝑧=1/2, ln⁡(𝑧) ≈− ln 𝑧 − 2 𝑁 Author computed ⟨ln⁡(𝑧)⟩ numerically and fit simple analytic expressions to the results to obtain: ln⁡(𝑧) ≈− ln 2 − 2.4 max 𝑁,6 −2.4 𝑔 𝑅𝑒𝐿𝑈 = 2 exp⁡( 1.2 max 𝑁,6 −2.4 )

Computational verification
Sample random walks of random vectors back- propagated through a linear network. With N=100, D=500, and 𝑔=1.005

Predicted 𝑔 values as a function of the layer width N and the nonlinearity, with D=200.

The growth of the magnitude of 𝛿 0 in comparison to 𝛿 𝐷 for a fixed N = 100, as a function of the 𝑔 scaling parameter and the nonlinearity

Experiment and Results
A slight adjustment to g may be helpful, as most real-world data is far from a normal distribution. The initial scaling of the final output layer may need to be adjusted separately, as the back-propagating errors will be affected by the initialization of the final output layer. Random Walk Initialization requires tuning of three parameters: input scaling (or 𝑔 1 ), 𝑔 𝐷 , and g, the first two to handle transient effects of the inputs and errors, and the last to generally tune the entire network. By far the most important of the three is g.

Experiments on both the MNIST and TIMIT datasets with a standard FFN. MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. The TIMIT speech corpus contains a total of 6300 sentences, 10 sentences spoken by 630 speakers selected from 8 major dialect regions of the USA. 70% of the speakers are male, 30% are female.

whether training a real-world dataset would be affected by choosing g according to the Random Walk Initialization?

Does increased depth actually help to decrease the objective function?

Summary The g values can be successfully trained on real datasets for depths upwards of 200 layers. Simply increase N to decrease the fluctuations in the norm of the back- propagated errors. The learning rate scheduling made a huge difference in performance in very deep networks.

Random walk initialization for training very deep feedforward networks

Similar presentations

Presentation on theme: "Random walk initialization for training very deep feedforward networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Random walk initialization for training very deep feedforward networks

Similar presentations

Presentation on theme: "Random walk initialization for training very deep feedforward networks"— Presentation transcript:

Similar presentations

About project

Feedback