Presentation is loading. Please wait.

Presentation is loading. Please wait.

Random walk initialization for training very deep feedforward networks

Similar presentations


Presentation on theme: "Random walk initialization for training very deep feedforward networks"β€” Presentation transcript:

1 Random walk initialization for training very deep feedforward networks
Wenhua Jiao 10/30/207

2 outline Introduction Analysis and Proposed Initialization
Experiment and Results Summary

3 Introduction Author: David Sussillo Research Scientist @ Google Brain
Adjunct Stanford Paper Purpose: Using mathematical analysis to address the vanishing gradient problem looking for the features which effect training very deep networks

4 Analysis and proposed initialization
Since the early 90s, the vanishing gradient (VG) problem has been found. Because of VG, adding many extra layers in FFN does not usually improve performance. Back-propagation involves applying similar matrices repeatedly to compute the error gradient. The outcome of this process depends on whether the magnitudes of the leading eigenvalues tend to be greater than or less than one. Only if the magnitudes of the leading eigenvalues are tightly constrained can there be a useful β€œnon-vanishing” gradient. It can be achieved by appropriate initialization.

5 Analysis and proposed initialization
Feedforward networks of the form: π‘Ž 𝑑 =𝑔 π‘Š 𝑑 β„Ž π‘‘βˆ’1 + 𝑏 𝑑 β„Ž 𝑑 =𝑓( π‘Ž 𝑑 ) β„Ž 𝑑 is vector of hidden activations, π‘Š 𝑑 is linear transformation, 𝑏 𝑑 is biases, all at depth 𝑑, with 𝑑=0,1,…,𝐷. 𝑓is an element-wise nonlinear with 𝑓 β€² 0 =1, and 𝑔 is a scale factor on the matrices. Assume the network has 𝐷 layers and each layer has width 𝑁, the elements of π‘Š 𝑑 are initially drawn from a Gaussian distribution with 0 mean and variance 1 𝑁 , the elements of 𝑏 𝑑 are initialized 0.Define β„Ž 0 to be inputs, β„Ž 𝐷 to be outputs. 𝐸 is objective function.

6 Analysis and proposed initialization
𝛿 𝑑 ≑ πœ•πΈ πœ•π‘Ž | 𝑑 =𝑔 π‘Š 𝑑+1 𝛿 𝑑+1 back-propagation equation(gradient vector) π‘Š 𝑑 𝑖,𝑗 = 𝑓 β€² ( π‘Ž 𝑑 (𝑖)) π‘Š 𝑑 (𝑗,𝑖) 𝛿 𝑑 2 = g 2 𝑧 𝑑+1 𝛿 𝑑+1 2 squared magnitude of gradient vector 𝑧 𝑑 = π‘Š 𝑑 𝛿 𝑑 / 𝛿 𝑑 2 𝑍= 𝛿 0 2 𝛿 𝐷 2 = 𝑔 2𝐷 𝑑=1 𝐷 𝑧 𝑑 across-all-layer gradient magnitude Solving the VG problem amounts to keep Z to be 1, appropriately adjusting g. The matrices π‘Šand π‘Š change during learning, so author think we can only do this for the initial configuration of the network before learning has made these changes

7 Analysis and proposed initialization
Because the matrices π‘Š are initially random, we can think of the z variables as random variables. Then, Z is proportional to a product of random variables. ln 𝑍 =𝐷 ln 𝑔 2 + 𝑑=1 𝐷 ln ( 𝑧 𝑑 ) ln (𝑍) as being the result of a random walk, with step d in the walk given by the random variable ln ( 𝑧 𝑑 ) . The goal of Random Walk Initialization is to chose 𝑔 to make ln (𝑍) as chose to zero as possible.

8 Calculation of the optimal 𝑔 values
To make analytic statements that apply to all networks, author average over realization of the matrices π‘Š applied during back- propagation. ln (𝑍) =𝐷 ln 𝑔 ln (𝑧) )=0→𝑔=exp⁑(βˆ’ ln (𝑧) ) Here 𝑧 is a random variable determined by 𝑧= π‘Š 𝛿/ 𝛿 2 ,with π‘ŠΒ Μƒ and 𝛿 are same distribution as the π‘Š 𝑑 and 𝛿 𝑑 variables of the different layers of the network.

9 Calculation of the optimal 𝑔 values
When WΜƒ is Gaussian π‘ŠΜƒπ›Ώ/|𝛿|is Gaussian for any vector 𝛿, then 𝑧 is πœ’ 2 distributed. With the N*N matrix π‘Š have variance 1/N, writhing 𝑧=πœ‚/𝑁, πœ‚ is distributed according to πœ’ 𝑁 2 . Expanding the logarithm in a Taylor series about 𝑧=1 and using the mean and variance of the distribution ln⁑(𝑧) β‰ˆ π‘§βˆ’1 βˆ’ π‘§βˆ’1 2 =βˆ’ 1 𝑁 𝑔 𝑙𝑖𝑛𝑒 =exp⁑( 1 2𝑁 )

10 Calculation of the optimal 𝑔 values
For the ReLU case Thus the derivative of the ReLU function sets 1-M rows of π‘Š to 0 and leaves M rows with Gaussian entries. 𝑧 is the sum of the squares of M random variables with variance 1/N. Writing 𝑧=πœ‚/𝑁, πœ‚ is distributed according to πœ’ 𝑀 2 . computing ln⁑(𝑧) in a Taylor series, with the 𝑧=1/2, ln⁑(𝑧) β‰ˆβˆ’ ln 𝑧 βˆ’ 2 𝑁 Author computed ⟨ln⁑(𝑧)⟩ numerically and fit simple analytic expressions to the results to obtain: ln⁑(𝑧) β‰ˆβˆ’ ln 2 βˆ’ 2.4 max 𝑁,6 βˆ’2.4 𝑔 π‘…π‘’πΏπ‘ˆ = 2 exp⁑( 1.2 max 𝑁,6 βˆ’2.4 )

11 Computational verification
Sample random walks of random vectors back- propagated through a linear network. With N=100, D=500, and 𝑔=1.005

12 Computational verification
Predicted 𝑔 values as a function of the layer width N and the nonlinearity, with D=200.

13 Computational verification
The growth of the magnitude of 𝛿 0 in comparison to 𝛿 𝐷 for a fixed N = 100, as a function of the 𝑔 scaling parameter and the nonlinearity

14 Experiment and Results
A slight adjustment to g may be helpful, as most real-world data is far from a normal distribution. The initial scaling of the final output layer may need to be adjusted separately, as the back-propagating errors will be affected by the initialization of the final output layer. Random Walk Initialization requires tuning of three parameters: input scaling (or 𝑔 1 ), 𝑔 𝐷 , and g, the first two to handle transient effects of the inputs and errors, and the last to generally tune the entire network. By far the most important of the three is g.

15 Experiment and Results
Experiments on both the MNIST and TIMIT datasets with a standard FFN. MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. The TIMIT speech corpus contains a total of 6300 sentences, 10 sentences spoken by 630 speakers selected from 8 major dialect regions of the USA. 70% of the speakers are male, 30% are female.

16 Experiment and Results
whether training a real-world dataset would be affected by choosing g according to the Random Walk Initialization?

17 Experiment and Results
Does increased depth actually help to decrease the objective function?

18 Summary The g values can be successfully trained on real datasets for depths upwards of 200 layers. Simply increase N to decrease the fluctuations in the norm of the back- propagated errors. The learning rate scheduling made a huge difference in performance in very deep networks.

19


Download ppt "Random walk initialization for training very deep feedforward networks"

Similar presentations


Ads by Google