Restricted Boltzmann Machines and Deep Belief Networks

Slides:



Advertisements
Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Advertisements

Deep Belief Nets and Restricted Boltzmann Machines
Deep Learning Bing-Chen Tsai 1/21.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
Tutorial on: Deep Belief Nets
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
What kind of a Graphical Model is the Brain?
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Wake-Sleep algorithm for Representational Learning
How to do backpropagation in a brain
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Highlights of Hinton's Contrastive Divergence Pre-NIPS Workshop Yoshua Bengio & Pascal Lamblin USING SLIDES FROM Geoffrey Hinton, Sue Becker & Yann Le.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
Cognitive models for emotion recognition: Big Data and Deep Learning
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Preliminary version of 2007 NIPS Tutorial on: Deep Belief Nets Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science.
Deep learning Tsai bing-chen 10/22.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
Neural Networks William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]
CSC321: Computation in Neural Networks Lecture 21: Stochastic Hopfield nets and simulated annealing Geoffrey Hinton.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Classification with Perceptrons Reading:
Read Chapter of Russell and Norvig
Deep Architectures for Artificial Intelligence
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

Restricted Boltzmann Machines and Deep Belief Networks Presented by Matt Luciw USING A VAST, VAST MAJORITY OF SLIDES ORIGINALLY FROM: Geoffrey Hinton, Sue Becker, Yann Le Cun, Yoshua Bengio, Frank Wood

Motivations Supervised training of deep models (e.g. many-layered NNets) is difficult (optimization problem) Shallow models (SVMs, one-hidden-layer NNets, boosting, etc…) are unlikely candidates for learning high-level abstractions needed for AI Unsupervised learning could do “local-learning” (each module tries its best to model what it sees) Inference (+ learning) is intractable in directed graphical models with many hidden variables Current unsupervised learning methods don’t easily extend to learn multiple levels of representation

Belief Nets A belief net is a directed acyclic graph composed of stochastic variables. Can observe some of the variables and we would like to solve two problems: The inference problem: Infer the states of the unobserved variables. The learning problem: Adjust the interactions between variables to make the network more likely to generate the observed data. stochastic hidden cause visible effect Use nets composed of layers of stochastic binary variables with weighted connections. Later, we will generalize to other types of variable.

Stochastic binary neurons These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons. 1 0.5

Stochastic units Replace the binary threshold units by binary stochastic units that make biased random decisions. The temperature controls the amount of noise. Decreasing all the energy gaps between configurations is equivalent to raising the noise level. temperature

The Energy of a joint configuration binary state of unit i in joint configuration v, h bias of unit i weight between units i and j Energy with configuration v on the visible units and h on the hidden units indexes every non-identical pair of i and j once

Weights  Energies  Probabilities Each possible joint configuration of the visible and hidden units has an energy The energy is determined by the weights and biases (as in a Hopfield net). The energy of a joint configuration of the visible and hidden units determines its probability: The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it.

Restricted Boltzmann Machines Restrict the connectivity to make learning easier. Only one layer of hidden units. Deal with more layers later No connections between hidden units. In an RBM, the hidden units are conditionally independent given the visible states. So can quickly get an unbiased sample from the posterior distribution when given a data-vector. This is a big advantage over directed belief nets hidden j i visible

Maximizing the training data log likelihood Standard PoE form We want maximizing parameters Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters. The derivation is nasty. Assuming d’s drawn independently from p() Over all training data. Frank Wood - fwood@cs.brown.edu

Equilibrium Is Hard to Achieve With: can now train our PoE model. But… there’s a problem: is computationally infeasible to obtain (esp. in an inner gradient ascent loop). Sampling Markov Chain must converge to target distribution. Often this takes a very long time! Frank Wood - fwood@cs.brown.edu

A very surprising fact Everything that one weight needs to know about the other weights and the data in order to do maximum likelihood learning is contained in the difference of two correlations. Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units Expected value of product of states at thermal equilibrium when nothing is clamped Derivative of log probability of one training vector

The (theoretical) batch learning algorithm Positive phase Clamp a data vector on the visible units. Let the hidden units reach thermal equilibrium at a temperature of 1 Sample for all pairs of units Repeat for all data vectors in the training set. Negative phase Do not clamp any of the units Let the whole network reach thermal equilibrium at a temperature of 1 (where do we start?) Sample for all pairs of units Repeat many times to get good estimates Weight updates Update each weight by an amount proportional to the difference in in the two phases.

Solution: Contrastive Divergence! Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically) Why does this work? Attempts to minimize the ways that the model distorts the data. Frank Wood - fwood@cs.brown.edu

Contrastive Divergence Maximum likelihood gradient: pull down energy surface at the examples and pull it up everywhere else, with more emphasis where model puts more probability mass Contrastive divergence updates: pull down energy surface at the examples and pull it up in their neighborhood, with more emphasis where model puts more probability mass

Restricted Boltzmann Machines In an RBM, the hidden units are conditionally independent given the visible states. It only takes one step to reach thermal equilibrium when the visible units are clamped. Can quickly get the exact value of : hidden j i visible

A picture of the Boltzmann machine learning algorithm for an RBM j j j j a fantasy i i i i t = 0 t = 1 t = 2 t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

Contrastive divergence learning: A quick way to learn an RBM j j Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. i i t = 0 t = 1 data reconstruction This is not following the gradient of the log likelihood. But it works well. When we consider infinite directed nets it will be easy to see why it works.

How to learn a set of features that are good for reconstructing images of the digit 2 50 binary feature neurons 50 binary feature neurons Increment weights between an active pixel and an active feature Decrement weights between an active pixel and an active feature 16 x 16 pixel image 16 x 16 pixel image data (reality) reconstruction (better than reality)

Using an RBM to learn a model of a digit class Reconstructions by model trained on 2’s Data Reconstructions by model trained on 3’s 100 hidden units (features) j j 256 visible units (pixels) i i data reconstruction

Each neuron grabs a different feature. The final 50 x 256 weights Each neuron grabs a different feature.

How well can we reconstruct the digit images from the binary feature activations? Reconstruction from activated binary features Reconstruction from activated binary features Data Data New test images from the digit class that the model was trained on Images from an unfamiliar digit class (the network tries to see every image as a 2)

Deep Belief Networks

Explaining away (Judea Pearl) Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence. If we learn that there was an earthquake it reduces the probability that the house jumped because of a truck. -10 -10 truck hits house earthquake posterior 20 20 p(1,1)=.0001 p(1,0)=.4999 p(0,1)=.4999 p(0,0)=.0001 -20 house jumps

Why multilayer learning is hard in a sigmoid belief net. To learn W, we need the posterior distribution in the first hidden layer. Problem 1: The posterior is typically intractable because of “explaining away”. Problem 2: The posterior depends on the prior created by higher layers as well as the likelihood. So to learn W, we need to know the weights in higher layers, even if we are only approximating the posterior. All the weights interact. Problem 3: We need to integrate over all possible configurations of the higher variables to get the prior for first hidden layer. Yuk! hidden variables hidden variables prior hidden variables likelihood W data

Solution: Complementary Priors There is a special type of multi-layer directed model in which it is easy to infer the posterior distribution over the hidden units because it has complementary priors. This special type of directed model is equivalent to an undirected model. At first, this equivalence just seems like a neat trick But it leads to a very effective new learning algorithm that allows multilayer directed nets to be learned one layer at a time. The new learning algorithm resembles boosting with each layer being like a weak learner.

An example of a complementary prior etc. Infinite DAG with replicated weights. An ancestral pass of the DAG is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium. This infinite DAG defines the same distribution as an RBM. h2 v2 h1 v1 h0 v0

Inference in a DAG with replicated weights etc. h2 The variables in h0 are conditionally independent given v0. Inference is trivial. We just multiply v0 by This is because the model above h0 implements a complementary prior. Inference in the DAG is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium starting at the data. v2 h1 v1 h0 v0

Divide and conquer multilayer learning Re-representing the data: Each time the base learner is called, it passes a transformed version of the data to the next learner. Can we learn a deep, dense DAG one layer at a time, starting at the bottom, and still guarantee that learning each layer improves the overall model of the training data? This seems very unlikely. Surely we need to know the weights in higher layers to learn lower layers?

Multilayer contrastive divergence Start by learning one hidden layer. Then re-present the data as the activities of the hidden units. The same learning algorithm can now be applied to the re-presented data. Can we prove that each step of this greedy learning improves the log probability of the data under the overall model? What is the overall model?

Learning a deep directed network etc. h2 First learn with all the weights tied This is exactly equivalent to learning an RBM Contrastive divergence learning is equivalent to ignoring the small derivatives contributed by the tied weights between deeper layers. v2 h1 v1 h0 h0 v0 v0

Then freeze the first layer of weights in both directions and learn the remaining weights (still tied together). This is equivalent to learning another RBM, using the aggregated posterior distribution of h0 as the data. etc. h2 v2 h1 v1 v1 h0 h0 v0

A simplified version with all hidden layers the same size The RBM at the top can be viewed as shorthand for an infinite directed net. When learning W1 we can view the model in two quite different ways: The model is an RBM composed of the data layer and h1. The model is an infinite DAG with tied weights. After learning W1 we untie it from the other weight matrices. We then learn W2 which is still tied to all the matrices above it. h3 h2 h1 data

Why greedy learning works Each time we learn a new layer, the inference at the layer below becomes incorrect, but the variational bound on the log prob of the data improves (only true in theory -ml). Since the bound starts as an equality, learning a new layer never decreases the log prob of the data, provided we start the learning from the tied weights that implement the complementary prior. Now that we have a guarantee we can loosen the restrictions and still feel confident. Allow layers to vary in size. Do not start the learning at each layer from the weights in the layer below.

Why the hidden configurations should be treated as data when learning the next layer of weights After learning the first layer of weights: If we freeze the generative weights that define the likelihood term and the recognition weights that define the distribution over hidden configurations, we get: Maximizing the RHS is equivalent to maximizing the log prob of “data” that occurs with probability

Fine-tuning with a contrastive version of the “wake-sleep” algorithm After learning many layers of features, we can fine-tune the features to improve generation. 1. Do a stochastic bottom-up pass Adjust the top-down weights to be good at reconstructing the feature activities in the layer below. Do a few iterations of sampling in the top level RBM -- Adjust the weights in the top-level RBM. Do a stochastic top-down pass Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above. Not required! But helps the recognition rate (-ml).

A neural network model of digit recognition The top two layers form a restricted Boltzmann machine whose free energy landscape models the low dimensional manifolds of the digits. The valleys have names: 2000 top-level units 10 label units 500 units The model learns a joint density for labels and images. To perform recognition we can start with a neutral state of the label units and do one or two iterations of the top-level RBM. Or we can just compute the free energy of the RBM with each of the 10 labels 500 units 28 x 28 pixel image

Show the movie of the network generating digits (available at www. cs Show the movie of the network generating digits (available at www.cs.toronto/~hinton)

Limits of the Generative Model 1. Designed for images where non-binary values can be treated as probabilities. Top-down feedback only in the highest (associative) layer. No systematic way to deal with invariance. Assumes segmentation already performed and does not learn to attend to the most informative parts of objects.