Wake-Sleep algorithm for Representational Learning

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Deep Learning Bing-Chen Tsai 1/21.
Artificial Neural Networks (1)
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
CS590M 2008 Fall: Paper Presentation
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
Learning Representations. Maximum likelihood s r s?s? World Activity Probabilistic model of neuronal firing as a function of s Generative Model.
What kind of a Graphical Model is the Brain?
Adaptive Resonance Theory (ART) networks perform completely unsupervised learning. Their competitive learning algorithm is similar to the first (unsupervised)
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Visual Recognition Tutorial
How to do backpropagation in a brain
Simple Neural Nets For Pattern Classification
Connectionist models. Connectionist Models Motivated by Brain rather than Mind –A large number of very simple processing elements –A large number of weighted.
Deep Belief Networks for Spam Filtering
Visual Recognition Tutorial
Restricted Boltzmann Machines and Deep Belief Networks
AN ANALYSIS OF SINGLE- LAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
Crash Course on Machine Learning
Neural Networks Lecture 8: Two simple learning algorithms
How to do backpropagation in a brain
Bayesian Networks. Male brain wiring Female brain wiring.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 Neural Network.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Randomized Algorithms for Bayesian Hierarchical Clustering
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CS Statistical Machine learning Lecture 24
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Cognitive models for emotion recognition: Big Data and Deep Learning
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1 Neural networks 2. 2 Introduction: Neural networks The nervous system contains 10^12 interconnected neurons.
CSC2535 Lecture 5 Sigmoid Belief Nets
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Restricted Boltzmann Machines for Classification
Multimodal Learning with Deep Boltzmann Machines
Deep Learning Qing LU, Siyuan CAO.
Counter propagation network (CPN) (§ 5.3)
network of simple neuron-like computing elements
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Wake-Sleep algorithm for Representational Learning Hamid Reza Maei Physiol. & Neurosci. Program University of Toronto

Motivation The Brain is able to learn the underlying representation of received input data (e.g. images) in an unsupervised manner. Challenge for neural networks: 1. It needs a specific teacher for desired output 2. It needs training all the connections Wake-Sleep algorithm avoids these two problems d G1 G2 R1 R2 V/H

Logistic belief network d i G X Y xi yj Gijxy j gjy Advantage Conditional distributions are factorial:

Learning Generative weights The inference is intractable Sprinkle Rain Wet Explaining away: Sprinkle and Rain conditionally are dependent Though it is very crude, but let’s approximate P(h|d; G) with a factorial distribution Q(h|d; R). Recognition weight

Any guarantee for the improvement of learning? Using Jensen’s inequality we find a lower bound for log likelihood: YES! Free energy Thus, decreasing the free energy increases the lower bound and therefore increases log likelihood. This leads to Wake Phase.

Wake phase Replaced by Q(h|d; R). Remind:: Get samples (xo and yo) from factorial distribution Q(h|d;R) (bottom-up pass) use these samples in generative model for changing the generative weights.

Learning recognition weights Derivative of free energy with respect to R ,gives complicated results that computationally is intractable What should be done?! Switch! (KL is not a symmetric function! ) Change the recognition weights to minimize the above free energy. This leads to sleep phase.

Sleep Phase NO! Any guarantee for improvement? (for sleep phase) 1. Get samples (x●,y●), generated by generative model using data coming from nowhere! 2. Change the recognition connections using the above delta rule. Any guarantee for improvement? (for sleep phase) Sleep phase approximation Wake phase approximation P Q -In the sleep phase we are minimizing KL(P, Q) which is wrong! -In the wake phase we are minimizing KL(Q, P) which is right thing to do. NO!

The wake-sleep algorithm Wake-phase: -Use recognition weights to perform a bottom-up pass in order to create samples for layers above (from data). -Train generative weights using samples obtained from recognition model. 2. Sleep-phase: -Use generative weights to reconstruct data by performing a top-down pass. -Train recognition weights using samples obtained from generative model G2 R2 G1 R1 d What Wake-Sleep algorithm really is trying to achieve?! It turns out that the goal of wake-sleep algorithm is to learn representation that are economical to describe: We can describe it using Shannon’s coding theory.

Simple example Training: For 4X4 images, we use belief network with one visible layer and two hidden layers (binary neurons): -The visible layer has 16 neurons. -First hidden layer (8 neurons) decides all possible orientations. -The top hidden layer (1 neuron) decides vertical and horizontal bars 2. The network was trained on 2x106 random examples. Hinton et. al. Science (1995)

Wake-sleep algorithm on 20 news group data set -contains about 20,000 articles. -many categories fall into overlapping clusters -we used tiny version of this data set with binary occurrence of 100 words across 16242 posting which could be divided with 4 classes: comp.* sci.* rec.* talk.*

Training Visible layer: 100 visible units hidden Visible layer: 100 visible units First hidden layer: 50 hidden units in the first hidden layer Second hidden layer: 20 hidden units in the top layer. For training we used %60 of data (9745 training examples) and kept remaining for testing the model (6497 testing examples).

Just for fun! Performance for model- Comp.* (class 1) `windows',`win',`video',`card',`dos', `memory',`program',`ftp',`help',`system‘ … Performance for model-talk.* (class 4) 'god‘, 'bible‘, 'jesus‘, 'question‘, 'christian', 'israel‘, 'religion‘, 'card‘, 'jews' 'email' `world',`jews',`war',`religion',`god',`jesus', `christian',`israel',`children',`food‘

Testing (classification) Learn two different Wake-Sleep algorithm on two different classes 1 and 4; that is comp.* and talk.* respectively. Present the training examples from classes 1 and 4 to each of the two learned algorithm and compute the following free energy as score under each model. Presented examples from classes 1 and 4 to the learned wake-sleep algorithm under model comp.* Presented examples from classes 1 and 4 to the learned wake-sleep algorithm under model talk.* (Class 1) (Class 4)

Naïve Bayes classifier Assumptions: P(cj): frequency of classes in the training examples (9745). Conditional Independence Assumption. Use Bayes rule Learn model parameter using Maximum likelihood (e.g. for classes 1 and 4). Correct prediction on testing examples: Present testing example from class 1 and 2 to the trained model and predict which class it belongs to. %80 correct prediction Most probable words in each class: Comp.*: -’windows‘, 'help‘, 'email‘,'problem' 'system‘, 'computer''software’,'program' 'university''drive‘ Talk.*: -'fact‘, 'god‘,'government’,'question''world‘,'christian‘,'case''course''state' 'jews' McCallum et. al. (1998)

Conclusion Wake-Sleep is unsupervised learning algorithm. higher hidden layers store representations. Although we have used very crude approximations it works very well on some of realistic data. Wake-Sleep is trying to describe the representation economical (Shannon’s coding theory).

Flaws of wake-sleep algorithm Sleep phase has horrible assumptions (although it worked!) -it minimized KL(P||Q) rather KL(Q|P) -The recognition weights are trained not from data space but dream space! *Variation approximations.

Using complementary priors to eliminate explaining away 1. Because of explaining away there .. Remove the correlations in hidden layers—complementary priors etc G GT H1 hi1 Do complementary priors exist? Very hard questions and not obvious! G GT V1 vj1 But it is possible to remove the effect of explaining away using this architecture: G GT H0 hi0 G GT V0 vj0 Restricted Boltzman Machine: Inference is very easy Because of factorial distributions Hinton et al. Neural Computation (2006) Hinton et. al. Science (2006)