CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
CSC2535: 2013 Advanced Machine Learning Lecture 8b Image retrieval using multilayer neural networks Geoffrey Hinton.
Advertisements

CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Deep Learning Bing-Chen Tsai 1/21.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Advanced topics.
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
What kind of a Graphical Model is the Brain?
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Deep Learning.
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
How to do backpropagation in a brain
Y. Weiss (Hebrew U.) A. Torralba (MIT) Rob Fergus (NYU)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Efficient Image Search and Retrieval using Compact Binary Codes
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
CSC2515 Lecture 10 Part 2 Making time-series models with RBM’s.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Cognitive models for emotion recognition: Big Data and Deep Learning
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
Preliminary version of 2007 NIPS Tutorial on: Deep Belief Nets Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science.
Deep learning Tsai bing-chen 10/22.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
Neural Networks William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC321: Extra Lecture (not on the exam) Non-linear dimensionality reduction Geoffrey Hinton.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Learning Amin Sobhani.
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
All lecture slides will be available as .ppt, .ps, & .htm at
Matt Gormley Lecture 16 October 24, 2016
Restricted Boltzmann Machines for Classification
Multimodal Learning with Deep Boltzmann Machines
Structure learning with deep autoencoders
Department of Electrical and Computer Engineering
ECE 599/692 – Deep Learning Lecture 9 – Autoencoder (AE)
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton

Some problems with backpropagation The amount of information that each training case provides about the weights is at most the log of the number of possible output labels. –So to train a big net we need lots of labeled data. In nets with many layers of weights the backpropagated derivatives either grow or shrink multiplicatively at each layer. –Learning is tricky either way. Dumb gradient descent is not a good way to perform a global search for a good region of a very large, very non- linear space. –So deep nets trained by backpropagation are rare in practice.

The obvious solution to all of these problems: Use greedy unsupervised learning to find a sensible set of weights one layer at a time. Then fine-tune with backpropagation Greedily learning one layer at a time scales well to really big networks, especially if we have locality in each layer. Most of the information in the final weights comes from modeling the distribution of input vectors. –The precious information in the labels is only used for the final fine-tuning. We do not start backpropagation until we already have sensible weights that already do well at the task. –So the learning is well-behaved and quite fast.

Modelling the distribution of digit images 2000 units 500 units 28 x 28 pixel image The network learns a density model for unlabeled digit images. When we generate from the model we often get things that look like real digits of all classes. More hidden layers make the generated fantasies look better (YW Teh, Simon Osindero). But do the hidden features really help with digit discrimination? Add 10 softmaxed units to the top and do backprop. The top two layers form a restricted Boltzmann machine whose free energy landscape should model the low dimensional manifolds of the digits.

Results on permutation-invariant MNIST task Very carefully trained backprop net with 1.53% one or two hidden layers (Platt; Hinton) SVM (Decoste & Schoelkopf) 1.4% Generative model of joint density of 1.25% images and labels (with unsupervised fine-tuning) Generative model of unlabelled digits 1.2% followed by gentle backpropagtion Generative model of joint density 1.1% followed by gentle backpropagation

Deep Autoencoders They always looked like a really nice way to do non-linear dimensionality reduction: –They provide mappings both ways –The learning time is linear (or better) in the number of training cases. –The final model is compact and fast. But it turned out to be very very difficult to optimize deep autoencoders using backprop. –We now have a much better way to optimize them.

How to find documents that are similar to a query document Convert each document into a “bag of words”. –This is a vector of word counts ignoring the order. –Ignore stop words (like “the” or “over”) We could compare the word counts of the query document and millions of other documents but this is too slow. –So we reduce each query vector to a much smaller vector that still contains most of the information about the content of the document. fish cheese vector count school query reduce bag pulpit iraq word

The non-linearity used for reconstructing bags of words Divide the counts in a bag of words vector by N, where N is the total number of non-stop words in the document. –The resulting probability vector gives the probability of getting a particular word if we pick a non-stop word at random from the document. At the output of the autoencoder, we use a softmax. –The probability vector defines the desired outputs of the softmax. When we train the first RBM in the stack we use the same trick. –We treat the word counts as probabilities, but we make the visible to hidden weights N times bigger than the hidden to visible because we have N observations from the probability distribution.

Performance of the autoencoder at document retrieval Train on bags of 2000 words for 400,000 training cases of business documents. –First train a stack of RBM’s. Then fine-tune with backprop. Test on a separate 400,000 documents. –Pick one test document as a query. Rank order all the other test documents by using the cosine of the angle between codes. –Repeat this using each of the 400,000 test documents as the query (requires 0.16 trillion comparisons). Plot the number of retrieved documents against the proportion that are in the same hand-labeled class as the query document.