Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

Slides:



Advertisements
Similar presentations
Beyond Linear Separability
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The back-propagation training algorithm
Artificial Neural Networks ML Paul Scheible.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Radial Basis Functions
Chapter 5 NEURAL NETWORKS
September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 
Neural Networks Marco Loog.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 6: Multilayer Neural Networks
12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.
Artificial Neural Networks
CS 4700: Foundations of Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Collaborative Filtering Matrix Factorization Approach
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
Artificial Neural Networks
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 23 Nov 2, 2005 Nanjing University of Science & Technology.
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Artificial Intelligence Techniques Multilayer Perceptrons.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 31: Feedforward N/W; sigmoid.
Multi-Layer Perceptron
EE459 Neural Networks Backpropagation
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.
CS621 : Artificial Intelligence
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
EEE502 Pattern Recognition
Variations on Backpropagation.
Neural Networks 2nd Edition Simon Haykin
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
CSC321: Neural Networks Lecture 9: Speeding up the Learning
Machine Learning Supervised Learning Classification and Regression
Supervised Learning in ANNs
Real Neurons Cell structures Cell body Dendrites Axon
A Simple Artificial Neuron
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Variations on Backpropagation.
of the Artificial Neural Networks.
Ch2: Adaline and Madaline
Artificial Neural Networks
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Backpropagation David Kauchak CS159 – Fall 2019.
Variations on Backpropagation.
Presentation transcript:

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4

2 ALVINN drives70mph on highways

3 Speech Recognition

4 Hidden Node Functions

5

6

7

8

9

10 Head Pose Recognition

Neural Networks - Berrin Yanıkoğlu11 MLP & Backpropagation Issues

12 Considerations Network architecture Typically feedforward, however you may also use local receptive fields for hidden nodes; and in sequence learning, recursive nodes Number of input, hidden, output nodes Number of hidden nodes is quite important, others determined by problem setup Activation functions Careful: regression requires linear activation on the output For others, sigmoid or hyperbolic tangent is a good choice Learning rate Typically software adjusts this

13 Considerations Preprocessing Important (see next slides) Learning algorithm Backpropagation with momentum or Levenberg-Marquart suggested When to stop training Important (see next slides)

14 Preprocessing Input variables should be decorrelated and with roughly equal variance But typically, a very simple linear transformation is applied to the input to obtain zero-mean - unit variance input: x i = ( x i - x i _ mean )/  i where  i = 1/(N-1)  ( x pi - x i _ mean ) 2 patterns p More complex preprocessing is also commonly done: E.g. Principal component analysis

15 When to stop training No precise formula: 1a) At local minima, the gradient magnitude is 0 –Stop when the gradient is sufficiently small need to calculate the gradient over the whole set of patterns May need to measure the gradient in several directions, to avoid errors caused by numerical instability 1b) Local minima is a stationary point of the performance index (the error) –Stop when the absolute change in weights is small How to measure? Typically, rates: 0.01% 2) We are interested in generalization ability –Stop when the generalization, measured as the performance on validation set, starts to increase

16 Effects of Sequential versus Batch Mode: Summary –Batch: –Better estimation of the gradient –Sequential (online) –Better if data is highly correlated –Better in terms of local minima (stochastic search) –Easier to implement

Neural Networks - Berrin Yanıkoğlu17 Performance Surface Motivation for some of the practical issues

18 Local Minima of the Performance Criteria - The performance surface is a very high-dimensional (W) space full of local minima. - Your best bet using gradient descent is to locate one of the local minima. –Start the training from different random locations (we will later see how we can make use of several thus trained networks) –You may also use simulated annealing or genetic algorithms to improve the search in the weight space.

19 Performance Surface Example Network ArchitectureNominal Function Parameter Values Layer numbers are shown as superscripts

20 Squared Error vs. w 1 1,1 and b 1 1 w 1 1,1 b11b11 b11b11

21 Squared Error vs. w 1 1,1 and w 2 1,1 w 1 1,1 w 2 1,1 w 1 1,1 w 2 1,1

22 Squared Error vs. b 1 1 and b 1 2 b11b11 b21b21 b21b21 b11b11

Neural Networks - Berrin Yanıkoğlu23 MLP & Backpropagation Summary REST of the SLIDES are ADVANCED MATERIAL (read only if you are interested, or if there is something you do^’t understand…) These slides are thanks to John Bullinaria

Gradient Descent Learning Summary: –The purpose of neural network learning or training is to minimise the output errors on a particular set of training data by adjusting the network weights wij. –We start by defining an appropriate Error or Cost Function E(wij) that “measures” how far the current network is from the desired (correctly trained) one. –Gradients given by partial derivatives of the error function ∂E(wij)/∂wij then tell us which direction we need to move in weight space to reduce the error. –The gradients are multiplied by a learning rate η that specifies the step sizes we take in weight space for each iteration of the weight update equation. –We keep stepping through weight space until the errors are “small enough”. 24

If the neuron activation functions have derivatives that take on particularly simple forms, that can make the weight update computations very efficient. These factors lead to powerful learning algorithms for training neural networks. 25

Practical Considerations for Gradient Descent Learning The general idea is straightforward, but there remain a number of important questions about training single layer neural networks that still need to be resolved: –Do we need to pre-process the training data? If so, how? –How do we choose the initial weights from which the training is started? –How do we choose an appropriate learning rate η? –Should we change the weights after each training pattern, or after the whole set? –How can we avoid local minima in the error function? –How can we avoid flat spots in the error function? –How do we know when we should stop the training? We shall now consider each of these practical issues in turn. 26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

Neural Networks - Berrin Yanıkoğlu41 Alternatives to Gradient Descent ADVANCED MATERIAL (read only if interested)

42 SUMMARY There are alternatives to standard backpropagation, intended to deal with speeding up its convergence. These either choose a different search direction (p) or a different step size (  ). In this course, we will cover updates to standard backpropagation as an overview, namely momentum and variable rate learning, skipping the other alternatives (those that do not follow steepest descent, such as conjugate gradient method). –Remember that you are never responsible of the HİDDEN slides (that do not show in show mode but are visible when you step through the slides!)

43 Variations of Backpropagation –Momentum: Adds a momentum term to effectively increase the step size when successive updates are in the same direction. –Adaptive Learning Rate: Tries to increase the step size and if the effect is bad (causes oscillations as evidenced by a decrease in performance) Newton’s Method: Conjugate Gradient Levenberg-Marquardt Line search

44 Motivation for momentum (Bishop 7.5)

45 Effect of momentum  w ij (n) =  E/dw ij (n) +   w ij (n-1) n  w ij (n) =    n-t  E/dw ij (t) t=0 If same sign in consecutive iterations => magnitude grows If opposite sign in consecutive iterations => magnitude shrinks For  w ij (n) not to diverge,  must be < 1. Effectively adds inertia to the motion through the weight space and smoothes out the oscillations The smaller the , the smoother the trajectory

46 Effect of momentum

47 Effect of momentum (Bishop 7.7)

48 Convergence Example of Backpropagation w 1 1,1 w 2 1,1

49 Learning Rate Too Large w 1 1,1 w 2 1,1

50 Momentum Backpropagation w 1 1,1 w 2 1,1

51 Variable Learning Rate If the squared error decreases after a weight update the weight update is accepted the learning rate is multiplied by some factor  >1. If the momentum coefficient  has been previously set to zero, it is reset to its original value. If the squared error (over the entire training set) increases by more than some set percentage  after a weight update weight update is discarded the learning rate is multiplied by some factor (1  >    >  0) the momentum coefficient  is set to zero. If the squared error increases by less than , then the weight update is accepted, but the learning rate and the momentum coefficient are unchanged.

52 Example w 1 1,1 w 2 1,1