How to do backpropagation in a brain

Slides:



Advertisements
Similar presentations
CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Advertisements

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Deep Learning Bing-Chen Tsai 1/21.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Advanced topics.
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Deep Learning.
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
Machine Learning Neural Networks
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Autoencoders Mostafa Heidarpour
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
How to do backpropagation in a brain
Artificial Neural Networks
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Machine Learning Chapter 4. Artificial Neural Networks
A shallow introduction to Deep Learning
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
Neural Networks and Deep Learning Slides credit: Geoffrey Hinton and Yann LeCun.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Cognitive models for emotion recognition: Big Data and Deep Learning
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
Neural Networks William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Learning Amin Sobhani.
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Matt Gormley Lecture 16 October 24, 2016
Deep Learning Qing LU, Siyuan CAO.
Structure learning with deep autoencoders
Neural Networks and Backpropagation
Neural Networks Geoff Hulten.
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
Neural networks (3) Regularization Autoencoder
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Autoencoders David Dohan.
Presentation transcript:

How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto Google Inc.

Prelude I will start with three slides explaining a popular type of deep learning. It is this kind of deep learning that makes back propagation easy to implement.

Pre-training a deep network First train a layer of features that receive input directly from the pixels. The features are trained to be good at reconstructing the pixels. Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer. They are good at reconstructing the activities in the first hidden layer. Each time we add another layer of features we capture more complex, longer range, regularities in the ensemble of training images.

Discriminative fine-tuning First train multiple hidden layers greedily to be good autoencoders. This is unsupervised learning. Then connect some classification units to the top layer of features and do back-propagation through all of the layers to fine-tune all of the feature detectors. On a dataset of handwritten digits called MNIST this worked much better than standard back-propagation and better than Support Vector Machines. (2006) On a dataset of spoken sentences called TIMIT it beat the state of the art and led to a major shift in the way speech recognition is done. (2009).

Why does pre-training followed by fine-tuning work so well? Greedily learning one layer at a time scales well to really big networks, especially if we have locality in each layer. We do not start backpropagation until we already have sensible features in each layer. So the initial gradients are sensible and back-propagation only needs to perform a local search. Most of the information in the final weights comes from modeling the distribution of input vectors. The precious information in the labels is only used for the final fine-tuning. It slightly modifies the features. It does not need to discover features. So we can do very well when most of the training data is unlabelled.

But how can the brain back-propagate through a multilayer neural network? Some very good researchers have postulated inefficient algorithms that use random perturbations. Do you really believe that evolution could not find an efficient way to adapt a feature so that it is more useful to higher-level features in the same sensory pathway? (have faith!)

Three obvious reasons why the brain cannot be doing backpropagation Cortical neurons do not communicate real-valued activities. They send spikes. The neurons need to send two different types of signal Forward pass: signal = activity = y Backward pass: signal = dE/dx Neurons do not have point-wise reciprocal connections with the same weight in both directions.

Small data: A good reason for spikes Synapses are much cheaper than training cases. We have 10^14 synapses and live for 10^9 seconds. A good way to throw a lot of parameters at a task is to use big neural nets with strong, zero-mean noise in the activities. Noise in the activities has the same regularization advantages as averaging big ensembles of models but makes much more efficient use of hardware. In the small data regime, noise is good so sending random spikes from a Poisson process is better than sending real values. Poisson noise is special because it is exactly neutral about the sparsity of the codes. Multiplicative noise penalizes sparse codes.

A way to simplify the explanations Lets ignore the Poisson noise for now. We are going to pretend that neurons communicate real analog values. Once we have understood how to do backprop in a brain, we can treat these real analog values as the underlying rates of a Poisson. We will get the same expected value for the derivatives from the Poisson spikes, but with added noise. Stochastic gradient descent is very robust to added noise so long as it is not biased.

A way to encode error derivatives Consider a logistic output unit , j, with a cross-entropy error function. derivative of the error w.r.t. The total input to j target value output probability when driven bottom-up Suppose we start with pure bottom-up output, , and then we take a weighted average of the target value and the bottom-up output. We make the weight on the target value grow linearly with time.

A fundamental representational decision: temporal derivatives represent error derivatives This allows the rate of change of the blended output to represent the error derivative w.r.t. the neuron’s input error derivative temporal derivative = This allows the same neuron to code both the normal activity and the error derivative (for a limited time).

The payoff In a pre-trained stack of auto-encoders, this way of representing error derivatives makes back-propagation through multiple layers of neurons happen automatically.

If the auto-encoder is perfect, replacing the bottom-up input to i by the top down input will have no effect on the output of i. j k i If we then start moving and towards their target values, we get:

The synaptic update rule First do an upward (forward) pass as usual. Then do top-down reconstructions at each level. Then perturb the top-level activities by blending them with the target values so that the rate of change of activity of a top-level unit represents the derivative of the error w.r.t. the total input to that unit. This will make the activity changes at every level represent error derivatives. Then update each synapse in proportion to: pre-synaptic activity X rate-of-change of post-synaptic activity

If this is what is happening, what should neuroscientists see? weight change relative time of post-synaptic spike Spike-time-dependent plasticity is just a derivative filter. You need a computational theory to recognize what you discovered!

An obvious prediction For the top-down weights to stay symmetric with the bottom-up weights, their learning rule should be: rate-of-change of pre-synaptic activity X post-synaptic activity

A problem (this is where the woffle starts) This way of performing backpropagation requires symmetric weights But auto-encoders can still be trained if we first split each symmetric connection into two oppositely directed connections and then we randomly remove many of these directed connections.

Functional symmetry If the representations are highly redundant, the state of a hidden unit can be estimated very well from the states of the other hidden units in the same layer that have similar receptive fields. So top-down connections from these other correlated units could learn to mimic the effect of the missing top-down part of a symmetric connection. All we require is functional symmetry on and near the data manifold.

But what about the backpropagation required to learn the stack of autoencoders? One step Contrastive Divergence learning was initially viewed as a poor approximation to running a Markov chain to equilibrium. The equilibrium statistics are needed for maximum likelihood learning. But a better view is that its a neat way of doing gradient descent to learn an auto-encoder when the hidden units are stochastic with discrete states. It uses temporal derivatives as error derivatives.

Contrastive divergence learning: A quick way to learn an RBM j j Start with a training vector on the visible units. Update all the hidden units in parallel Update all the visible units in parallel to get a “reconstruction”. Update all the hidden units again. i i t = 0 t = 1 data reconstruction This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another objective function.

One-step CD is just backpropagation learning of an auto-encoder using temporal derivatives W W h h + W v negligible to first order true gradient (to first order) CD:

THE END