Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

Similar presentations


Presentation on theme: "Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4."— Presentation transcript:

1 Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4

2 2 ALVINN drives70mph on highways

3 3 Speech Recognition

4 4 Hidden Node Functions

5 5

6 6

7 7

8 8

9 9

10 10 Head Pose Recognition

11 Neural Networks - Berrin Yanıkoğlu11 MLP & Backpropagation Issues

12 12 Considerations Network architecture Typically feedforward, however you may also use local receptive fields for hidden nodes; and in sequence learning, recursive nodes Number of input, hidden, output nodes Number of hidden nodes is quite important, others determined by problem setup Activation functions Careful: regression requires linear activation on the output For others, sigmoid or hyperbolic tangent is a good choice Learning rate Typically software adjusts this

13 13 Considerations Preprocessing Important (see next slides) Learning algorithm Backpropagation with momentum or Levenberg-Marquart suggested When to stop training Important (see next slides)

14 14 Preprocessing Input variables should be decorrelated and with roughly equal variance But typically, a very simple linear transformation is applied to the input to obtain zero-mean - unit variance input: x i = ( x i - x i _ mean )/  i where  i = 1/(N-1)  ( x pi - x i _ mean ) 2 patterns p More complex preprocessing is also commonly done: E.g. Principal component analysis

15 15 When to stop training No precise formula: 1) At local minima, the gradient magnitude is 0 –Stop when the gradient is sufficiently small need to calculate the gradient over the whole set of patterns May need to measure the gradient in several directions, to avoid errors caused by numerical instability 2) Local minima is a stationary point of the performance index (the error) –Stop when the absolute change in weights is small How to measure? Typically, rates: 0.01% 3) We are interested in generalization ability –Stop when the generalization, measured as the performance on validation set, starts to increase

16 16 Effects of Sequential versus Batch Mode: Summary –Batch: –Better estimation of the gradient –Sequential (online) –Better if data is highly correlated –Better in terms of local minima (stochastic search) –Easier to implement

17 Neural Networks - Berrin Yanıkoğlu17 Performance Surface Motivation for some of the practical issues

18 18 Local Minima of the Performance Criteria - The performance surface is a very high-dimensional (W) space full of local minima. - Your best bet using gradient descent is to locate one of the local minima. –Start the training from different random locations (we will later see how we can make use of several thus trained networks) –You may also use simulated annealing or genetic algorithms to improve the search in the weight space.

19 19 Performance Surface Example Network ArchitectureNominal Function Parameter Values Layer numbers are shown as superscripts

20 20 Squared Error vs. w 1 1,1 and b 1 1 w 1 1,1 b11b11 b11b11

21 21 Squared Error vs. w 1 1,1 and w 2 1,1 w 1 1,1 w 2 1,1 w 1 1,1 w 2 1,1

22 22 Squared Error vs. b 1 1 and b 1 2 b11b11 b21b21 b21b21 b11b11

23 Neural Networks - Berrin Yanıkoğlu23 MLP & Backpropagation Summary REST of the SLIDES are ADVANCED MATERIAL (read only if you are interested, or if there is something you do^’t understand…) These slides are thanks to John Bullinaria

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 Neural Networks - Berrin Yanıkoğlu41 Alternatives to Gradient Descent ADVANCED MATERIAL (read only if interested)

42 42 SUMMARY There are alternatives to standard backpropagation, intended to deal with speeding up its convergence. These either choose a different search direction (p) or a different step size (  ). In this course, we will cover updates to standard backpropagation as an overview, namely momentum and variable rate learning, skipping the other alternatives (those that do not follow steepest descent, such as conjugate gradient method). –Remember that you are never responsible of the HİDDEN slides (that do not show in show mode but are visible when you step through the slides!)

43 43 Variations of Backpropagation –Momentum: Adds a momentum term to effectively increase the step size when successive updates are in the same direction. –Adaptive Learning Rate: Tries to increase the step size and if the effect is bad (causes oscillations as evidenced by a decrease in performance) Newton’s Method: Conjugate Gradient Levenberg-Marquardt Line search

44 44 Motivation for momentum (Bishop 7.5)

45 45 Effect of momentum  w ij (n) =  E/dw ij (n) +   w ij (n-1) n  w ij (n) =    n-t  E/dw ij (t) t=0 If same sign in consecutive iterations => magnitude grows If opposite sign in consecutive iterations => magnitude shrinks For  w ij (n) not to diverge,  must be < 1. Effectively adds inertia to the motion through the weight space and smoothes out the oscillations The smaller the , the smoother the trajectory

46 46 Effect of momentum

47 47 Effect of momentum (Bishop 7.7)

48 48 Convergence Example of Backpropagation w 1 1,1 w 2 1,1

49 49 Learning Rate Too Large w 1 1,1 w 2 1,1

50 50 Momentum Backpropagation w 1 1,1 w 2 1,1

51 51 Variable Learning Rate If the squared error decreases after a weight update the weight update is accepted the learning rate is multiplied by some factor  >1. If the momentum coefficient  has been previously set to zero, it is reset to its original value. If the squared error (over the entire training set) increases by more than some set percentage  after a weight update weight update is discarded the learning rate is multiplied by some factor (1  >    >  0) the momentum coefficient  is set to zero. If the squared error increases by less than , then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

52 52 Example w 1 1,1 w 2 1,1


Download ppt "Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4."

Similar presentations


Ads by Google