CSC321: Neural Networks Lecture 9: Speeding up the Learning

Slides:



Advertisements
Similar presentations
Perceptron Lecture 4.
Advertisements

Introduction to Neural Networks Computing
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
B.Macukow 1 Lecture 12 Neural Networks. B.Macukow 2 Neural Networks for Matrix Algebra Problems.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The FIR Adaptive Filter The LMS Adaptive Filter Stability and Convergence.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Jonathan Richard Shewchuk Reading Group Presention By David Cline
Simple Neural Nets For Pattern Classification
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 8: Modeling text using a recurrent neural network trained with a really fancy.
Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.
September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Variations on Backpropagation.
Wednesday, Sept. 24, 2003PHYS , Fall 2003 Dr. Jaehoon Yu 1 PHYS 1443 – Section 003 Lecture #9 Forces of Friction Uniform and Non-uniform Circular.
Neural Networks for Machine Learning Lecture 3a Learning the weights of a linear neuron Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Fall 2004 Backpropagation CS478 - Machine Learning.
Gravity.
CSC321 Lecture 18: Hopfield nets and simulated annealing
Neural Networks Winter-Spring 2014
Chapter 13 Gravitation.
Chapter 2 Single Layer Feedforward Networks
Neural Networks CS 446 Machine Learning.
Classification with Perceptrons Reading:
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Properties Of the Quadratic Performance Surface
Artificial neural networks – Lecture 4
ECE 471/571 - Lecture 17 Back Propagation.
Classification Neural Networks 1
Collaborative Filtering Matrix Factorization Approach
Logistic Regression & Parallel SGD
CSC 578 Neural Networks and Deep Learning
PHYS 1443 – Section 001 Lecture #9
Instructor :Dr. Aamer Iqbal Bhatti
METHOD OF STEEPEST DESCENT
CSE 541 – Numerical Methods
Multilayer Perceptron & Backpropagation
Chapter 13 Gravitation.
Capabilities of Threshold Neurons
The loss function, the normal equation,
6.3 Gradient Search of Chi-Square Space
Mathematical Foundations of BME Reza Shadmehr
Neural networks (1) Traditional multi-layer perceptrons
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Chapter - 3 Single Layer Percetron
CSC2535: 2011 Lecture 5a More ways to fit energy-based models
Computer Vision Lecture 19: Object Recognition III
PHYS 1443 – Section 002 Lecture #10
Introduction to Neural Networks
Section 3: Second Order Methods
Presentation transcript:

CSC321: Neural Networks Lecture 9: Speeding up the Learning Geoffrey Hinton

The error surface for a linear neuron The error surface lies in a space with a horizontal axis for each weight and one vertical axis for the error. It is a quadratic bowl. i.e. the height can be expressed as a function of the weights without using powers higher than 2. Quadratics have constant curvature (because the second derivative must be a constant) Vertical cross-sections are parabolas. Horizontal cross-sections are ellipses. w1 E w w2

Convergence speed The direction of steepest descent does not point at the minimum unless the ellipse is a circle. The gradient is big in the direction in which we only want to travel a small distance. The gradient is small in the direction in which we want to travel a large distance. This equation is sick. The RHS needs to be multiplied by a term of dimension w^2 to make the dimensions balance.

How the learning goes wrong If the learning rate is big, it sloshes to and fro across the ravine. If the rate is too big, this oscillation diverges. How can we move quickly in directions with small gradients without getting divergent oscillations in directions with big gradients? E w

Five ways to speed up learning Use an adaptive global learning rate Increase the rate slowly if its not diverging Decrease the rate quickly if it starts diverging Use separate adaptive learning rate on each connection Adjust using consistency of gradient on that weight axis Use momentum Instead of using the gradient to change the position of the weight “particle”, use it to change the velocity. Use a stochastic estimate of the gradient from a few cases This works very well on large, redundant datasets. Don’t go in the direction of steepest descent. The gradient does not point at the minimum. Can we preprocess the data or do something to the gradient so that we move directly towards the minimum?

The momentum method Imagine a ball on the error surface with velocity v. It starts off by following the gradient, but once it has velocity, it no longer does steepest descent. It damps oscillations by combining gradients with opposite signs. It builds up speed in directions with a gentle but consistent gradient. On an inclined plane it reaches a terminal velocity.

Adaptive learning rates on each connection Use a global learning rate multiplied by a local gain on each connection. Increase the local gains if the gradient does not change sign. Use additive increases and multiplicative decreases. This ensures that big learning rates decay rapidly when oscillations start.

Online versus batch learning Online learning updates the weights after each training case. It zig-zags around the direction of steepest descent. Batch learning does steepest descent on the error surface constraint from training case 1 w1 w1 constraint from training case 2 w2 w2

Stochastic gradient descent If the dataset is highly redundant, the gradient on the first half is almost identical to the gradient on the second half. So instead of computing the full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half. The extreme version is to update the weights after each example, but balanced mini-batches are just as good and faster in matlab.

Newton’s method Newton did not have a computer. So he had a lot of motivation to find efficient numerical methods. The basic problem is that the gradient is not the direction we want to go in. If the error surface had circular cross-sections, the gradient would be fine. So lets apply a linear transformation that turns ellipses into circles. The covariance matrix of the input vectors determines how elongated the ellipse is.

Covariance Matrices For each pair of input dimensions we can compute the covariance of their values over a set of data vectors. These covariances can be arranged in a symmetric matrix. The terms along the diagonal are the variances of the individual input dimensions. i j k i j k

Fixing up the error surface So multiply each input vector by the inverse of the covariance matrix. The preprocessed input vectors will now have uncorrelated components. So the error surface will have circular cross-sections j i Suppose that inputs i and j are highly correlated. Changing wi will change the error in the same way as changing wj. So changes in wi will have a big effect on

Curvature Matrices (optional material!) Each element in the curvature matrix specifies how the gradient in one direction changes as we move in some other direction. For a linear network with a squared error, the curvature matrix of the error, E, is the covariance matrix of the inputs. The reason steepest descent goes wrong is that the ratio of the magnitudes of the gradients for different weights changes as we move down the gradient. i j k i j k

Another, more general way to fix up the error surface We can leave the error surface alone, but apply the correction to the gradient vector. We multiply the vector of gradients by the inverse of the curvature matrix. This produces a direction that points straight at the minimum for a quadratic surface. The curvature matrix has too many terms to be of use in a big network. Maybe we can get some benefit from just using the terms along the leading diagonal (Le Cun).

Extra problems that occur in multilayer non-linear networks If we start with a big learning rate, the bias and all of the weights for one of the output units may become very negative. The output unit is then very firmly off and it will never produce a significant error derivative. So it will never recover (unless we have weight-decay). In classification networks that use a squared error, the best guessing strategy is to make each output unit produce an output equal to the proportion of time it should be a 1. The network finds this strategy quickly and takes a long time to improve on it. So it looks like a local minimum.