CSC2535 2011 Lecture 6a Learning Multiplicative Interactions Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Geoffrey Hinton University of Toronto
Deep learning with multiplicative interactions
Deep Learning Bing-Chen Tsai 1/21.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSC321: Neural Networks Lecture 3: Perceptrons
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
CSC Advanced Machine Learning Lecture 4 Restricted Boltzmann Machines
How to do backpropagation in a brain
Multimodal Interaction Dr. Mike Spann
Learning Multiplicative Interactions many slides from Hinton.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
CSC Lecture 8a Learning Multiplicative Interactions Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Learning to perceive how hand-written digits were drawn Geoffrey Hinton Canadian Institute for Advanced Research and University of Toronto.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Multifactor GPs Suppose now we wish to model different mappings for different styles. We will add a latent style vector s along with x, and define the.
Fields of Experts: A Framework for Learning Image Priors (Mon) Young Ki Baik, Computer Vision Lab.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Csc Lecture 8 Modeling image covariance structure Geoffrey Hinton.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
CSC2515 Lecture 10 Part 2 Making time-series models with RBM’s.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321: Lecture 25: Non-linear dimensionality reduction Geoffrey Hinton.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Deep Belief Nets and Ising Model-Based Network Construction
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
Neural networks (3) Regularization Autoencoder
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

CSC Lecture 6a Learning Multiplicative Interactions Geoffrey Hinton

Two different meanings of “multiplicative” If we take two density models and multiply together their probability distributions at each point in data-space, we get a “product of experts”. –The product of two Gaussian experts is a Gaussian. If we take two variables and we multiply them together to provide input to a third variable we get a “multiplicative interaction”. –The distribution of the product of two Gaussian- distributed variables is NOT Gaussian distributed. It is a heavy-tailed distribution. One Gaussian determines the standard deviation of the other Gaussian. –Heavy-tailed distributions are the signatures of multiplicative interactions between latent variables.

The heavy-tailed world The prediction errors for financial time-series are typically heavy-tailed. This is mainly because the variance is much higher in times of uncertainty. The prediction errors made by a linear dynamical systems are usually heavy-tailed on real data. –Occasional very weird things happen. This violates the conditions of the central limit theorem. The outputs of linear filters applied to images are heavy- tailed. –Gabor filters nearly always output almost exactly zero. But occasionally they have large outputs.

Learning multiplicative interactions It is fairly easy to learn multiplicative interactions if all of the variables are observed. –This is possible if we control the variables used to create a training set (e.g. pose, lighting, identity …) It is also easy to learn energy-based models in which all but one of the terms in each multiplicative interaction are observed. –Inference is still easy. If more than one of the terms in each multiplicative interaction are unobserved, the interactions between hidden variables make inference difficult. –Alternating Gibbs can be used if the latent variables form a bi-partite graph.

Higher order Boltzmann machines (Sejnowski, ~1986) The usual energy function is quadratic in the states: But we could use higher order interactions: Hidden unit h acts as a switch. When h is on, it switches in the pairwise interaction between unit i and unit j. –Units i and j can also be viewed as switches that control the pairwise interactions between j and h or between i and h.

Learning how style and content interact Tenenbaum and Freeman (2000) describe a model in which a “style” vector and a “content” vector interact multiplicatively to determine a datavector (e.g. and image). The outer-product of the style and content vectors determines a set of coefficients for basis functions. –This is not at all like the way a user vector and a movie vector interact to determine a rating. The rating is the inner-product.

It is an unfortunate coincidence that the number of components in each pose vector is equal to the number of different pose vectors. The model is only really interesting if we have less components per style or content vector than style or content vectors

A higher-order Boltzmann machine with one visible group and two hidden groups retina-based features object-based features viewing transform We can view it as a Boltzmann machine in which the inputs create interactions between the other variables. –This type of model is now called a conditional random field. –Inference can be hard in this model. –Inference is much easier with two visible groups and one hidden group Is this an I or an H ?

Using higher-order Boltzmann machines to model image transformations (Memisevic and Hinton, 2007) A global transformation specifies which pixel goes to which other pixel. Conversely, each pair of similar intensity pixels, one in each image, votes for a particular global transformation. image(t)image(t+1) image transformation

Making the reconstruction easier Condition on the first image so that only one visible group needs to be reconstructed. –Given the hidden states and the previous image, the pixels in the second image are conditionally independent. image(t)image(t+1) image transformation

The main problem with 3-way interactions There are far too many of them. We can reduce the number in several straight- forward ways: –Do dimensionality reduction on each group before the three way interactions. –Use spatial locality to limit the range of the three- way interactions. A much more interesting approach (which can be combined with the other two) is to factor the interactions so that they can be specified with fewer parameters. –This leads to a novel type of learning module.

Factoring three-way interactions If three-way interactions are being used to model a nice regular multi-linear structure, we may not need cubically many degrees of freedom. –For modelling effects like viewpoint and illumination many fewer degrees of freedom may be sufficient. There are many ways to factor 3-D interaction tensors. We use factors that correspond to 3-way outer- products. –Each factor only has 3N parameters. –By using about N/3 factors we get quadratically many parameters which is the same as a simple weight matrix.

Factoring the three-way interactions factored unfactored How changing the binary state of unit j changes the energy contributed by factor f. What unit j needs to know in order to do Gibbs sampling

A picture of the rank 1 tensor contributed by factor f Its a 3-way outer product. Each layer is a scaled version of the same rank 1 matrix.

The dynamics The visible and hidden units get weighted input from the factors and use this input in the usual stochastic way. –They have stochastic binary states (or a mean-field approximation to stochastic binary states). The factors are deterministic and implement a type of belief propagation. They do not have “states”. –Each factor computes three separate sums by adding up the input it gets from each separate group of units. –Then it sends the product of the summed inputs from two groups to the third group.

Belief propagation The outgoing message at each vertex of the factor is the product of the weighted sums at the other two vertices.

A nasty numerical problem In a standard Boltzmann machine the gradient of a weight on a training case always lies between 1 and -1. With factored three-way interactions, the gradient contains the product of two sums each of which can be large, so the gradient can explode. We can keep a running average of each sum over many training cases and divide the gradient by this average (or its square). This helps. –For any particular weight, we must divide the gradient by the same quantity on all training cases to guarantee a positive correlation with the true gradient. Updating the weights on every training case may also help because we get feedback faster when weights are blowing up.

receptive field in pre-image receptive field in post-image Showing what a factor learns by alternating between its pre- and post- fields pre-image post-image

The factor receptive fields The network is trained on translated random dot patterns.

The factor receptive fields The network is trained on translated random dot patterns.

The network is trained on rotated random dot patterns.

How does it perceive two overlaid sparse dot patterns moving in different directions? First we train a second hidden layer. Each of these units prefers motion in a different direction. Then we compute the perceived motion by adding up the preferences of the active units in the second hidden layer. If the two motions are within about 30 degrees it sees a single average motion. If they are further apart it sees two separate motions. –The separate motions are slightly further apart than the real ones. –This is just like human perception and it was not trained on transparent motion. –The training is entirely unsupervised.

Time series models Inference is difficult in directed models of time series if we use non-linear, distributed representations in the hidden units. –It is hard to fit directed graphical models to high- dimensional sequences (e.g motion capture data). So people tend to use methods with much less representational power – HMM’s give up on distributed representations –Linear Dynamical Systems give up on non- linearity.

The conditional RBM model (a partially observed bipartite CRF) Start with a generic RBM. Add two types of conditioning connections. Given the data, the hidden units at time t are conditionally independent. The autoregressive weights can model most short-term temporal structure very well, leaving the hidden units to model nonlinear irregularities. t-2 t-1 t h v

Causal generation from a learned model Keep the previous visible states fixed. –They provide a time-dependent bias for the hidden units. Perform alternating Gibbs sampling for a few iterations between the hidden units and the most recent visible units. –This picks new hidden and visible states that are compatible with each other and with the recent history.

Higher level models Once we have trained the model, we can add more layers. Treat the hidden activities of the first CRBM as data for training the next CRBM. –Add “autoregressive” connections to a layer when it becomes the visible layer. Adding a second layer makes it generate more realistic sequences. t-2 t-1 t

An application to modeling motion capture data Human motion can be captured by placing reflective markers on the joints –Use lots of infrared cameras to track the 3-D positions of the markers Given a skeletal model, the 3-D positions of the markers can be converted into –The joint angles –The 3-D translation of the pelvis –The roll, pitch and delta yaw of the pelvis

6 earlier visible frames current visible frame 600 hidden units 100 style features style: 1-of-N Using a style variable to modulate the interactions (there is additional weight sharing: Taylor&Hinton, ICML 2009) 200 factors

Show demo’s of multiple styles of walking These can be found at