CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Advertisements

CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Learning Bing-Chen Tsai 1/21.
Chapter 2.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
What kind of a Graphical Model is the Brain?
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
How to do backpropagation in a brain
Restricted Boltzmann Machines and Deep Belief Networks
Artificial Neural Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Deep Boltzman machines Paper by : R. Salakhutdinov, G. Hinton Presenter : Roozbeh Gholizadeh.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Highlights of Hinton's Contrastive Divergence Pre-NIPS Workshop Yoshua Bengio & Pascal Lamblin USING SLIDES FROM Geoffrey Hinton, Sue Becker & Yann Le.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
Cognitive models for emotion recognition: Big Data and Deep Learning
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Preliminary version of 2007 NIPS Tutorial on: Deep Belief Nets Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science.
Deep learning Tsai bing-chen 10/22.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Today’s Lecture Neural networks Training
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Energy models and Deep Belief Networks
All lecture slides will be available as .ppt, .ps, & .htm at
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Deep Learning Qing LU, Siyuan CAO.
Deep Belief Networks Psychology 209 February 22, 2013.
Structure learning with deep autoencoders
Deep Architectures for Artificial Intelligence
Deep Belief Nets and Ising Model-Based Network Construction
2007 NIPS Tutorial on: Deep Belief Nets
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC2535: 2011 Lecture 5a More ways to fit energy-based models
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Presentation transcript:

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton

Learning multilayer networks We want to learn models with multiple layers of non-linear features. Perceptrons: Use a layer of hand-coded, non-adaptive features followed by a layer of adaptive decision units. Needs supervision signal for each training case. Only one layer of adaptive weights. Back-propagation: Use multiple layers of adaptive features and train by backpropagating error derivatives Learning time scales poorly for deep networks. Support Vector Machines: Use a very large set of fixed features Does not learn multiple layers of features

Learning multi-layer networks (continued) Boltzmann Machines: Nice local learning rule that works in arbitrary networks. Getting the pairwise statistics required for learning can be very slow if the hidden units are interconnected. Maximum likelihood learning requires unbiased samples from the model’s distribution. These are very hard to get. Restricted Boltzmann Machines: Exact inference is very easy when the states of the visible units are known because then the hidden states are independent. Maximum likelihood learning is still slow because of the need to get samples from the model, but contrastive divergence learning is fast and often works well. But we can only learn one layer of adaptive features!

Recursive Restricted Boltzmann Machines First learn a layer of hidden features. Then treat the feature activations as data and learn a second layer of hidden features. And so on for as many hidden layers as we want. second layer of features RBM2 data is activities of first layer of features first layer of features RBM1 data

Recursive Restricted Boltzmann Machines Is learning a model of the hidden activities just a hack? It does not seem as if we are learning a proper multilayer model because the lower weights do not depend on the higher ones. Can we treat the hidden layers of the whole stack of RBM’s as part of one big generative model rather than a model plus a model of a model etc.? If it is one big model, it definitely is not a Boltzmann machine. The first hidden layer has two sets of weights (above and below) which would make the hidden activities very different from the activities of those units in either of the RBM’s.

The overall model produced by composing two RBM’s second layer of features W2 second layer of features data is activities of first layer of features W2 first layer of features W1 first layer of features data W1 data

The generative model h3 h2 h1 data To generate data: Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling for a long time. Perform a single top-down pass to get states for all the other layers. The lower-level, bottom-up connections are not part of the generative model. They are there to do fast approximate inference. h3 h2 h1 data

Why does stacking RBM’s produce this kind of generative model? It is not at all obvious that stacking RBM’s produces a model in which the top two layers of features form an RBM, but the layers beneath that are not at all like a Boltzmann Machine. To understand why this happens we need to ask how an RBM defines a probability distribution over visible vectors.

How an RBM defines the probabilities of hidden and visible vectors The weights in an RBM define p(h|v) and p(v|h) in a very straightforward way (lets ignore biases for now) To sample from p(v|h), sample the binary state of each visible unit from a logistic that depends on its weight vector times h. To sample from p(h|v), sample the binary state of each hidden unit from a logistic that depends on its weight vector times v. If we use these two conditional distributions to do alternating Gibbs sampling for a long time, we can get p(v) or p(h) i.e. we can sample from the model’s distribution over the visible or hidden units.

Why does layer-by-layer learning work? The weights, W, in the bottom level RBM define p(v|h) and they also, indirectly, define p(h). So we can express the RBM model as conditional probability joint probability index over all hidden vectors If we leave p(v|h) alone and build a better model of p(h), we will improve p(v). We need a better model of the posterior hidden vectors produced by applying W to the data.

An analogy In a mixture model, we define the probability of a datavector to be The learning rule for the mixing proportions is to make them match the posterior probability of using each Gaussian. The weights of an RBM implicitly define a mixing proportion for each possible hidden vector. To fit the data better, we can leave p(v|h) the same and make the mixing proportion of each hidden vector more like the posterior over hidden vectors. mixing proportion of Gaussian probability of v given Gaussian h index over all Gaussians

A guarantee Can we prove that adding more layers will always help? It would be very nice if we could learn a big model one layer at a time and guarantee that as we add each new hidden layer the model gets better. We can actually guarantee the following: There is a lower bound on the log probability of the data. Provided that the layers do not get smaller and the weights are initialized correctly (which is easy), every time we learn a new hidden layer this bound is improved (unless its already maximized). The derivation of this guarantee is quite complicated.

Back-fitting After we have learned all the layers greedily, the weights in the lower layers will no longer be optimal. The weights in the lower layers can be fine-tuned in several ways. For the generative model that comes next, the fine-tuning involves a complicated and slow stochastic learning procedure. If our ultimate goal is discrimination, we can use backpropagation for the fine-tuning.

A neural network model of digit recognition The top two layers form a restricted Boltzmann machine whose free energy landscape models the low dimensional manifolds of the digits. The valleys have names: 2000 top-level units 10 label units 500 units The model learns a joint density for labels and images. To perform recognition we can start with a neutral state of the label units and do one or two iterations of the top-level RBM. Or we can just compute the harmony of the RBM with each of the 10 labels 500 units 28 x 28 pixel image

See the movie at http://www.cs.toronto.edu/~hinton/adi/index.htm

Samples generated by running the top-level RBM with one label clamped Samples generated by running the top-level RBM with one label clamped. There are 1000 iterations of alternating Gibbs sampling between samples.

Examples of correctly recognized MNIST test digits (the 49 closest calls)

How well does it discriminate on MNIST test set with no extra information about geometric distortions? Up-down net with RBM pre-training + CD10 1.25% SVM (Decoste & Scholkopf) 1.4% Backprop with 1000 hiddens (Platt) 1.5% Backprop with 500 -->300 hiddens 1.5% Separate hierarchy of RBM’s per class 1.7% Learned motor program extraction ~1.8% K-Nearest Neighbor ~ 3.3% Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input.

All 125 errors

The receptive fields of the first hidden layer

The generative fields of the first hidden layer