CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Deep Learning Bing-Chen Tsai 1/21.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
What kind of a Graphical Model is the Brain?
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
BAYESIAN INFERENCE Sampling techniques
How to do backpropagation in a brain
Nonlinear and Non-Gaussian Estimation with A Focus on Particle Filters Prasanth Jeevan Mary Knox May 12, 2006.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Restricted Boltzmann Machines and Deep Belief Networks
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
CSC Lecture 8a Learning Multiplicative Interactions Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Fields of Experts: A Framework for Learning Image Priors (Mon) Young Ki Baik, Computer Vision Lab.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Csc Lecture 8 Modeling image covariance structure Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
CSC2515 Lecture 10 Part 2 Making time-series models with RBM’s.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Markov Networks.
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC2535: 2011 Lecture 5a More ways to fit energy-based models
Presentation transcript:

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton

A different kind of hidden structure Instead of trying to find a set of independent hidden causes, try to find factors of a different kind. Capture structure by finding constraints that are Frequently Approximately Satisfied. Violations of FAS constraints reduce the probability of a data vector. If a constraint already has a big violation, violating it more does not make the data vector much worse (i.e. assume the distribution of violations is heavy-tailed.)

Energy-Based Models with deterministic hidden units Use multiple layers of deterministic hidden units with non-linear activation functions. Hidden activities contribute additively to the global energy, E. data j k EkEk EjEj

Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Generation from model is easy Inference can be hard Learning is easy after inference Energy-based models that associate an energy (or free energy) with each data vector Generation from model is hard Inference can be easy Is learning hard?

Reminder Maximum likelihood learning is hard in Energy-Based Models To get high probability for d we need low energy for d and high energy for its main rivals, c We need to find the serious rivals to d and raise their energy. This seems hard. It is easy to lower the energy of d

Reminder: Maximum likelihood learning is hard To get high log probability for d we need low energy for d and high energy for its main rivals, c To sample from the model use Markov Chain Monte Carlo. But what kind of chain can we use when the hidden units are deterministic?

Hybrid Monte Carlo We could find good rivals by repeatedly making a random perturbation to the data and accepting the perturbation with a probability that depends on the energy change. –Diffuses very slowly over flat regions –Cannot cross energy barriers easily In high-dimensional spaces, it is much better to use the gradient to choose good directions and to use momentum. –Beats diffusion. Scales well. –Can cross energy barriers. –Back-propagation can give us this gradient

Trajectories with different initial momenta

Backpropagation can compute the gradient that Hybrid Monte Carlo needs 1.Do a forward pass computing hidden activities. 2.Do a backward pass all the way to the data to compute the derivative of the global energy w.r.t each component of the data vector. works with any smooth non-linearity data j k EkEk EjEj

The online HMC learning procedure 1.Start at a datavector, d, and use backprop to compute for every parameter. 2.Run HMC for many steps with frequent renewal of the momentum to get equilbrium sample, c. 3.Use backprop to compute 4.Update the parameters by :

A surprising shortcut Instead of taking the negative samples from the equilibrium distribution, use slight corruptions of the datavectors. Only add random momentum once, and only follow the dynamics for a few steps. –Much less variance because a datavector and its confabulation form a matched pair. –Seems to be very biased, but maybe it is optimizing a different objective function. If the model is perfect and there is an infinite amount of data, the confabulations will be equilibrium samples. So the shortcut will not cause learning to mess up a perfect model.

Intuitive motivation It is silly to run the Markov chain all the way to equilibrium if we can get the information required for learning in just a few steps. –The way in which the model systematically distorts the data distribution in the first few steps tells us a lot about how the model is wrong. –But the model could have strong modes far from any data. These modes will not be sampled by confabulations. Is this a problem in practice?

Contrastive divergence Aim is to minimize the amount by which a step toward equilibrium improves the data distribution. Minimize Contrastive Divergence Minimize divergence between data distribution and model’s distribution Maximize the divergence between confabulations and model’s distribution data distribution model’s distribution distribution after one step of Markov chain

Contrastive divergence. changing the parameters changes the distribution of confabulations Contrastive divergence makes the awkward terms cancel

A simple 2-D dataset The true data is uniformly distributed within the 4 squares. The blue dots are samples from the model.

The network for the 4 squares task Each hidden unit contributes an energy equal to its activity times a learned scale. 2 input units 20 logistic units 3 logistic units E

Frequently Approximately Satisfied constraints The intensities in a typical image satisfy many different linear constraints very accurately, and violate a few constraints by a lot. The constraint violations fit a heavy-tailed distribution. The negative log probabilities of constraint violations can be used as energies. Violation 0 Gauss Cauchy energyenergy -+ On a smooth intensity patch the sides balance the middle -

Learning the constraints on an arm 3-D arm with 4 links and 5 joints linear squared outputs Energy for non- zero outputs For each link: + _

Biases of top-level units Mean total input from layer below Coordinates of joint 5 Coordinates of joint 4 Negative weight Positive weight Weights of a top-level unit Weights of a hidden unit

Dealing with missing inputs The network learns the constraints even if 10% of the inputs are missing. –First fill in the missing inputs randomly –Then use the back-propagated energy derivatives to slowly change the filled-in values until they fit in with the learned constraints. Why don’t the corrupted inputs interfere with the learning of the constraints? –The energy function has a small slope when the constraint is violated by a lot. –So when a constraint is violated by a lot it does not adapt. Don’t learn when things don’t make sense.

Learning constraints from natural images (Yee-Whye Teh) We used 16x16 image patches and a single layer of 768 hidden units (3 x over-complete). Confabulations are produced from data by adding random momentum once and simulating dynamics for 30 steps. Weights are updated every 100 examples. A small amount of weight decay helps.

A random subset of 768 basis functions

The distribution of all 768 learned basis functions

How to learn a topographic map image Linear filters Global connectivity Local connectivity The outputs of the linear filters are squared and locally pooled. This makes it cheaper to put filters that are violated at the same time next to each other. Cost of first violation Cost of second violation Pooled squared filters

Faster mixing chains Hybrid Monte Carlo can only take small steps because the energy surface is curved. With a single layer of hidden units, it is possible to use alternating parallel Gibbs sampling. –Step 1: each student-t hidden unit picks a variance from the posterior distribution over variances given the violation produced by the current datavector. If the violation is big, it picks a big variance. –With the variances fixed, the hidden units define one-dimensional Gaussians in the dataspace. –Step 2: pick a datavector from the product of all the one-dimensional Gaussians.

Pro’s and Con’s of Gibbs sampling Advantages of Gibbs sampling –Much faster mixing –Can be extended to use pooled second layer (Max Welling) Disadvantages of Gibbs sampling –Can only be used in deep networks by learning hidden layers (or pairs of layers) greedily. But maybe this is OK.

Density models Causal models Energy-Based Models Tractable posterior mixture models, sparse bayes nets factor analysis Compute exact posterior Intractable posterior Densely connected DAG’s Markov Chain Monte Carlo or Minimize variational free energy Stochastic hidden units Full Boltzmann Machine Full MCMC Restricted Boltzmann Machine Minimize contrastive divergence Deterministic hidden units Markov Chain Monte Carlo Fix the features (“maxent”) Minimize contrastive divergence or

Stochastic Causal Generative models The posterior distribution is intractable. Deterministic Energy-Based Models Partition function I is intractable ICA Two views of Independent Components Analysis Z becomes determinant Posterior collapses When the number of linear hidden units equals the dimensionality of the data, the model has both marginal and conditional independence.