Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tips for Training Neural Network

Similar presentations


Presentation on theme: "Tips for Training Neural Network"— Presentation transcript:

1 Tips for Training Neural Network
scratch the surface

2 Two Concerns There are two things you have to concern.
Optimization Can I find the “best” parameter set θ* in limited of time? Generalization Is the “best” parameter set θ* good for testing data as well?

3 Initialization For gradient descent, we need to pick an initialization parameter θ0. Do not set all the parameters θ0 equal Set the parameters in θ0 randomly

4 Learning Rate Set the learning rate η carefully Toy Example
Training Data (20 examples) x = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5] y = [0.1, 0.4, 0.9, 1.6, 2.2, 2.5, 2.8, 3.5, 3.9, 4.7, 5.1, 5.3, 6.3, 6.5, 6.7, 7.5, 8.1, 8.5, 8.9, 9.5]

5 Learning Rate Toy Example Error Surface: C(w,b) start target

6 Learning Rate Toy Example Different learning rate η

7 Gradient descent Gradient descent Stochastic Gradient descent
Pick an example xr Two approaches update the parameters towards the same direction, but stochastic is faster Faster! If all example xr have equal probabilities to be picked

8 Gradient descent Stochastic Gradient descent One epoch Training Data:
Starting at θ0 pick x1 pick x2 Seen all the examples once  [ˋɛpək] pick xr One epoch pick xR pick x1

9 Gradient descent Toy Example See only one example See all examples
Update 20 times in an epoch Gradient descent Stochastic Gradient descent 1 epoch

10 Gradient descent Gradient descent Stochastic Gradient descent
Pick an example xr Mini Batch Gradient Descent What is the meaning of shuffle your data? Pick B examples as a batch b (B is batch size) Average the gradient of the examples in the batch b Shuffle your data

11 Gradient descent Real Example: Handwriting Digit Classification
Batch size = 1 Gradient descent

12 Two Concerns There are two things you have to concern.
Optimization Can I find the “best” parameter set θ* in limited of time? Generalization Is the “best” parameter set θ* good for testing data as well?

13 Training data and testing data have different distribution.
Generalization You pick a “best” parameter set θ* Training Data: However, Testing Data: Training Data: Testing Data: Training data and testing data have different distribution.

14 Panacea Have more training data if possible ……
Create more training data (?) Handwriting recognition: Original Training Data: Created Training Data: In speech recognition: add noise warping Shift 15。

15 Reference Chapter 3 of Neural network and Deep Learning
ap3.html

16 Appendix A lot of reference
Who is afraid about the nonconvex function? Can the structure also be Learned? I do not know. Maybe baysian The story of cat Rprop, Momenton: (OK, very good gradient) Minecraft……: 補充?: B-diagram NN VC dim

17 Overfitting The function that performs well on the training data does not necessarily perform well on the testing data. Training Data: Testing Data: Different view for regularization The picked hypothesis fit the training data well, but fail to generalize to testing data Overfitting in our daily life: Memorize the answers of the previous examples ……

18 Joke for overfiting

19 Initialization For gradient descent, we need to pick an initialization parameter θ0. Do not set all the parameters θ0 equal Or your parameters will always be equal, no matter how many times you update the parameters Randomly pick θ0 If the last layer has more neurons, the initialization values should be smaller. E.g. Last layer has Nl-1

20 MNIST The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. git clone and-deep-learning.git

21 MNIST The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. At that level the performance is close to human- equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence.

22 Early Stopping For iteration Layer

23 Difficulty of Deep Lower layer cannot plan


Download ppt "Tips for Training Neural Network"

Similar presentations


Ads by Google