Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

Slides:

Advertisements

Similar presentations

The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel

Advertisements

Deep Learning Bing-Chen Tsai 1/21.

CS590M 2008 Fall: Paper Presentation

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

What kind of a Graphical Model is the Brain?

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net

How to do backpropagation in a brain

Restricted Boltzmann Machines and Deep Belief Networks

Artificial Neural Networks

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.

CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models

Neural Networks Lecture 8: Two simple learning algorithms

How to do backpropagation in a brain

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 5: Mixture models, EM and variational inference All lecture slides will be available as.ppt,.ps,

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.

CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Markov Random Fields Probabilistic Models for Images

Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.

CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.

CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.

MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Lecture 2: Statistical learning primer for biologists

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.

Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

CSC2535: 2011 Advanced Machine Learning Lecture 2: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

CSC2535 Lecture 5 Sigmoid Belief Nets

CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at

CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.

CSC2535: 2013 Advanced Machine Learning Lecture 2b: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.

Bayesian Neural Networks

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

Learning Deep Generative Models by Ruslan Salakhutdinov

Deep Feedforward Networks

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Structure learning with deep autoencoders

Regulation Analysis using Restricted Boltzmann Machines

Boltzmann Machine (BM) (§6.4)

CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.

CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.

Presentation transcript:

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto

Overview of the talk Causal Model: Learns to represent images using multiple, simultaneous, hidden, binary causes. –Introduce the variational approximation trick Boltzmann Machines: Learning to model the probabilities of binary vectors. –Introduce the brief Monte Carlo trick Hybrid model: Use a Boltzmann machine to model the prior distribution over configurations of binary causes. Uses both tricks Causal hierarchies of MRF’s: Generalize the hybrid model to many hidden layers –The causal connections act as insulators that keep the local partition functions separate.

Bayes Nets: Hierarchies of causes It is easy to generate an unbiased example at the leaf nodes. It is typically hard to compute the posterior distribution over all possible configurations of hidden causes. Given samples from the posterior, it is easy to learn the local interactions Hidden cause Visible effect

Two of the training images A simple set of images probabilities of turning on the binary hidden units reconstructions of the images

The generative model To generate a datavector: first generate a code from the prior distribution then generate an ideal datavector from the code then add Gaussian noise. value that code c predicts for the i’th component of the data vector weight from hidden unit j to pixel i binary state of hidden unit j in code vector c bias

Learning the model For each image in the training set we ought to consider all possible codes. This is exponentially expensive. prior probability of code posterior probability of code c prediction error of code c

How to beat the exponential explosion of possible codes Instead of considering each code separately, we could use an approximation to the true posterior distribution. This makes it tractable to consider all the codes at once. Instead of computing a separate prediction error for each binary code, we compute the expected squared error given the approximate posterior distribution over codes –Then we just change the weights to minimize this expected squared error.

A factorial approximation For a given datavector, assume that each code unit has a probability of being on, but that the code units are conditionally independent of each other. Use this term if code unit j is on in code vector c otherwise use this term product over all code units

The expected squared prediction error expected prediction additional squared error caused by the variance in the prediction The variance term prevents it from cheating by using the precise real- valued q values to make precise predictions.

Approximate inference We use an approximation to the posterior distribution over hidden configurations. – assume the posterior factorizes into a product of distributions for each hidden cause. If we use the approximation for learning, there is no guarantee that learning will increase the probability that the model would generate the observed data. But maybe we can find a different and sensible objective function that is guaranteed to improve at each update.

A trade-off between how well the model fits the data and the tractability of inference This makes it feasible to fit models that are so complicated that we cannot figure out how the model would generate the data, even if we know the parameters of the model. How well the model fits the data The inaccuracy of inference parametersdata approximating posterior distribution true posterior distribution new objective function

Where does the approximate posterior come from? We have a tractable cost function expressed in terms of the approximating probabilities, q. So we can use the gradient of the cost function w.r.t. the q values to train a “recognition network” to produce good q values. assume that the prior over codes also factors, so it can be represented by generative biases. data

Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Generation from model is easy Inference can be hard Learning is easy after inference Energy-based models that associate an energy with each data vector Generation from model is hard Inference can be easy Is learning hard?

A simple energy-based model Connect a set of binary stochastic units together using symmetric connections. Define the energy of a binary configuration, alpha, to be The energy of a binary vector determines its probability via the Boltzmann distribution.

Maximum likelihood learning is hard in energy-based models To get high probability for d we need low energy for d and high energy for its main rivals, r We need to find the serious rivals to d and raise their energy. This seems hard. It is easy to lower the energy of d

Markov chain monte carlo It is easy to set up a Markov chain so that it finds the rivals to the data with just the right probability sample rivals with this probability?

A picture of the learning rule for a fully visible Boltzmann machine ii i i t = 0 t = 1 t = 2 t = infinity Start with a training vector. Then pick units at random and update their states stochastically using the rule: a fantasy The maximum likelihood learning rule is then

A surprising shortcut Instead of taking the negative samples from the equilibrium distribution, use slight corruptions of the datavectors. Only run the Markov chain for for a few steps. –Much less variance because a datavector and its confabulation form a matched pair. –Seems to be very biased, but maybe it is optimizing a different objective function. If the model is perfect and there is an infinite amount of data, the confabulations will be equilibrium samples. So the shortcut will not cause learning to mess up a perfect model.

Intuitive motivation It is silly to run the Markov chain all the way to equilibrium if we can get the information required for learning in just a few steps. –The way in which the model systematically distorts the data distribution in the first few steps tells us a lot about how the model is wrong. –But the model could have strong modes far from any data. These modes will not be sampled by brief Monte Carlo. Is this a problem in practice? Apparently not.

Mean field Boltzmann machines Instead of using binary units with stochastic updates, approximate the Markov chain by using deterministic units with real-valued states, q, that represent a distribution over binary states. We can then run a deterministic approximation to the brief Markov chain:

The hybrid model We can use the same factored distribution over code units in a causal model and in a mean field Boltzmann machine that learns to model the prior distribution over codes. The stochastic generative model is: –First sample a binary vector from the prior distribution that is specified by the lateral connections between code units –Then use this code vector to produce an ideal data vector –Then add Gaussian noise.

A hybrid model recognition model The partition function is independent of the causal model expected energy minus entropy

Do a forward pass through the recognition model to compute q+ values for the code units Use the q+ values to compute top-down predictions of the data and use the expected prediction errors to compute: –derivatives for the generative weights –likelihood derivatives for the q+ values Run the code units for a few steps ignoring the data to get the q- values. Use these q- values to compute –The derivatives for the lateral weights. –The derivatives for the q+ values that come from the prior. Combine the likelihood and prior derivatives of the q+ values and backpropagate through the recognition net. The learning procedure

Simulation by Kejie Bao

Generative weights of hidden units

Adding more hidden layers Recognition model

The cost function for a multilayer model Conditional partition function that depends on the current top-down inputs to each unit

The learning procedure for multiple hidden layers The top down inputs control the conditional partition function of a layer, but all the required derivatives can still be found using the differences between the q+ and the q- statistics. The learning procedure is just the same except that the top down inputs to a layer from the layer above must be frozen in place while each layer separately runs its brief Markov chain.

Advantages of a causal hierarchy of Markov Random Fields Allows clean-up at each stage of generation in a multilayer generative model. This makes it easy to maintain constraints. The lateral connections implement a prior that squeezes the redundancy out of each hidden layer by making most possible configurations very unlikely. This creates a bottleneck of the appropriate size. The causal connections between layers separate the partition functions so that the whole net does not have to settle. Each layer can settle separately. –This solves Terry’s problem. data

THE END

Energy-Based Models with deterministic hidden units Use multiple layers of deterministic hidden units with non-linear activation functions. Hidden activities contribute additively to the global energy, E. data j k EkEk EjEj

Contrastive divergence Aim is to minimize the amount by which a step toward equilibrium improves the data distribution. Minimize Contrastive Divergence Minimize divergence between data distribution and model’s distribution Maximize the divergence between confabulations and model’s distribution data distribution model’s distribution distribution after one step of Markov chain

Contrastive divergence. changing the parameters changes the distribution of confabulations Contrastive divergence makes the awkward terms cancel