Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.

Slides:

Advertisements

Similar presentations

Contrastive Divergence Learning

Advertisements

Greedy Layer-Wise Training of Deep Networks

Deep Belief Nets and Restricted Boltzmann Machines

Deep Learning Bing-Chen Tsai 1/21.

CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.

Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.

CS590M 2008 Fall: Paper Presentation

Nathan Wiebe, Ashish Kapoor and Krysta Svore Microsoft Research ASCR Workshop Washington DC Quantum Deep Learning.

CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)

Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009

Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.

How to do backpropagation in a brain

Machine Learning CUNY Graduate Center Lecture 7b: Sampling.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

11/14  Continuation of Time & Change in Probabilistic Reasoning Project 4 progress? Grade Anxiety? Make-up Class  On Monday?  On Wednesday?

Computer vision: models, learning and inference Chapter 10 Graphical Models.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.

Deep Boltzman machines Paper by : R. Salakhutdinov, G. Hinton Presenter : Roozbeh Gholizadeh.

Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero

CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.

How to do backpropagation in a brain

6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):

Deep Boltzmann Machines

Boltzmann Machines and their Extensions S. M. Ali Eslami Nicolas Heess John Winn March 2013 Heriott-Watt University.

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.

Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.

Markov Random Fields Probabilistic Models for Images

CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.

Lecture 2: Statistical learning primer for biologists

CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.

Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.

Markov Random Fields & Conditional Random Fields

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.

CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

Learning Deep Generative Models by Ruslan Salakhutdinov

MCMC Output & Metropolis-Hastings Algorithm Part I

Energy models and Deep Belief Networks

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

A Practical Guide to Training Restricted Boltzmann Machines

Restricted Boltzmann Machines for Classification

LECTURE ??: DEEP LEARNING

CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Multimodal Learning with Deep Boltzmann Machines

Deep Learning Qing LU, Siyuan CAO.

An introduction to Graphical Models – Michael Jordan

Regulation Analysis using Restricted Boltzmann Machines

CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.

CSC2535: 2011 Lecture 5a More ways to fit energy-based models

Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR.

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm Persistent Contrastive Divergence)

A problem with MRFs Markov Random Fields for unsupervised learning (data density modeling). Intractable in general. Popular workarounds: –Very restricted connectivity. –Inaccurate gradient approximators. –Decide that MRFs are scary, and avoid them. This paper: there is a simple solution.

Details of the problem MRFs are unnormalized. For model balancing, we need samples. –In places where the model assigns too much probability, compared to the data, we need to reduce probability. –The difficult thing is to find those places: exact sampling from MRFs is intractable. Exact sampling: MCMC with infinitely many Gibbs transitions.

Approximating algorithms Contrastive Divergence; Pseudo-Likelihood Use surrogate samples, close to the training data. Thus, balancing happens only locally. Far from the training data, anything can happen. –In particular, the model can put much of its probability mass far from the data.

CD/PL problem, in pictures

Better would be: Samples from an RBM that was trained with CD-1: CD/PL problem, in pictures

Solution Gradient descent is iterative. –We can reuse data from the previous estimate. Use a Markov Chain for getting samples. Plan: keep the Markov Chain close to equilibrium. Do a few transitions after each weight update. –Thus the Chain catches up after the model changes. Do not reset the Markov Chain after a weight update (hence ‘Persistent’ CD). Thus we always have samples from very close to the model.

More about the Solution If we would not change the model at all, we would have exact samples (after burn- in). It would be a regular Markov Chain. The model changes slightly, –So the Markov Chain is always a little behind. Known in statistics as ‘stochastic approximation’. –Conditions for convergence have been analyzed.

In practice… You use 1 transition per weight update. You use several chains (e.g. 100). You use smaller learning rate than for CD-1. Convert CD-1 program.

Results on fully visible MRFs Data: MNIST 5x5 patches. Fully connected. No hidden units, so training data is needed only once.

Results on RBMs Mini-RBM data density modeling: Classification (see also Hugo Larochelle’s poster)

More experiments Infinite data, i.e. training data = test data: Bigger data (horse image segmentations):

More experiments Full-size RBM data density modeling (see also Ruslan Salakhutdinov’s poster)

Balancing now works

Conclusion Simple algorithm. Much closer to likelihood gradient.

Notes: learning rate PCD not always best. Not with: –Little training time –(i.e. big data set) PCD has high variance CD-10 occasionally better

Notes: weight decay WD helps all CD algorithms, including PCD. –EVEN WITH INFINITE DATA! PCD needs less. Reason: PCD is less dependent on mixing rate. In fact, zero works fine.

Acknowledgements Supervisor and inspiration in general: Geoffrey Hinton Useful discussions: Ruslan Salakhutdinov Data sets: Nikola Karamanov & Alex Levinshtein. Financial support: NSERC and Microsoft. Reviewers (suggested extensive experiments)