Download presentation

Presentation is loading. Please wait.

1
**Greedy Layer-Wise Training of Deep Networks**

Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny

2
Story so far … Deep neural nets are more expressive: Can learn wider classes of functions with less hidden units (parameters) and training examples. Unfortunately they are not easy to train with randomly initialized gradient-based methods.

3
Story so far … Hinton et. al. (2006) proposed greedy unsupervised layer-wise training: Greedy layer-wise: Train layers sequentially starting from bottom (input) layer. Unsupervised: Each layer learns a higher-level representation of the layer below. The training criterion does not depend on the labels. Each layer is trained as a Restricted Boltzman Machine. (RBM is the building block of Deep Belief Networks). The trained model can be fine tuned using a supervised method. RBM 2 RBM 1 RBM 0

4
**This paper Extends the concept to: Continuous variables**

Uncooperative input distributions Simultaneous Layer Training Explores variations to better understand the training method: What if we use greedy supervised layer-wise training ? What if we replace RBMs with auto-encoders ? RBM 2 RBM 1 RBM 0

5
**Outline Review Supervised Fine-tuning Extensions Analysis Experiments**

Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

6
**Outline Review Supervised Fine-tuning Extensions Analysis Experiments**

Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

7
**Restricted Boltzman Machine**

Undirected bipartite graphical model with connections between visible nodes and hidden nodes. Corresponds to joint probability distribution 𝑃 𝑣,ℎ = 1 𝑍 exp (−𝑒𝑛𝑒𝑟𝑔𝑦(𝑣,ℎ)) = 1 𝑍 exp ( 𝑣 ′ 𝑊ℎ+ 𝑏 ′ 𝑣+ 𝑐 ′ ℎ) ℎ 𝑣

8
**Restricted Boltzman Machine**

Undirected bipartite graphical model with connections between visible nodes and hidden nodes. Corresponds to joint probability distribution 𝑃 𝑣,ℎ = 1 𝑍 exp ( ℎ ′ 𝑊𝑣+ 𝑏 ′ 𝑣+ 𝑐 ′ ℎ) ℎ 𝑣 Factorized Conditionals 𝑄 ℎ 𝑣 = 𝑗 𝑃( ℎ 𝑗 |𝑣) 𝑄 ℎ 𝑗 =1 𝑣 =𝑠𝑖𝑔𝑚( 𝑐 𝑗 + 𝑘 𝑊 𝑗𝑘 𝑣 𝑘 ) 𝑃 𝑣 ℎ = 𝑘 𝑃( 𝑣 𝑘 |ℎ) 𝑃 𝑣 𝑘 =1 ℎ =𝑠𝑖𝑔𝑚( 𝑏 𝑘 + 𝑗 𝑊 𝑗𝑘 ℎ 𝑗 )

9
**Restricted Boltzman Machine (Training)**

Given input vectors 𝑉 0 , adjust 𝜃=(𝑊,𝑏,𝑐) to increase log 𝑃 𝑉 0 log 𝑃 𝑣 0 = log ℎ 𝑃( 𝑣 0 ,ℎ) =log ℎ exp −𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ −log 𝑣,ℎ exp −𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕log 𝑃 𝑣 0 𝜕𝜃 =− ℎ 𝑄 ℎ 𝑣 0 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ 𝜕𝜃 + 𝑣,ℎ 𝑃(𝑣,ℎ) 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕𝜃 𝜕log 𝑃 𝑣 0 𝜕 𝜃 𝑘 =− ℎ 𝑄 ℎ 𝑣 0 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ 𝜕 𝜃 𝑘 + 𝑣 𝑃 𝑣 ℎ 𝑘 𝑄( ℎ 𝑘 |𝑣) 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕 𝜃 𝑘

10
**Restricted Boltzman Machine (Training)**

Given input vectors 𝑉 0 , adjust 𝜃=(𝑊,𝑏,𝑐) to increase log 𝑃 𝑉 0 log 𝑃 𝑣 0 = log ℎ 𝑃( 𝑣 0 ,ℎ) =log ℎ exp −𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ −log 𝑣,ℎ exp −𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕log 𝑃 𝑣 0 𝜕𝜃 =− ℎ 𝑄 ℎ 𝑣 0 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ 𝜕𝜃 + 𝑣,ℎ 𝑃(𝑣,ℎ) 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕𝜃 𝜕log 𝑃 𝑣 0 𝜕 𝜃 𝑘 =− ℎ 𝑄 ℎ 𝑣 0 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ 𝜕 𝜃 𝑘 + 𝑣 𝑃 𝑣 ℎ 𝑘 𝑄( ℎ 𝑘 |𝑣) 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕 𝜃 𝑘

11
**Restricted Boltzman Machine (Training)**

Given input vectors 𝑉 0 , adjust 𝜃=(𝑊,𝑏,𝑐) to increase log 𝑃 𝑉 0 log 𝑃 𝑣 0 = log ℎ 𝑃( 𝑣 0 ,ℎ) =log ℎ exp −𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ −log 𝑣,ℎ exp −𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕log 𝑃 𝑣 0 𝜕𝜃 =− ℎ 𝑄 ℎ 𝑣 0 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ 𝜕𝜃 + 𝑣,ℎ 𝑃(𝑣,ℎ) 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕𝜃 𝜕log 𝑃 𝑣 0 𝜕 𝜃 𝑘 =− ℎ 𝑄 ℎ 𝑣 0 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣 0 ,ℎ 𝜕 𝜃 𝑘 + 𝑣 𝑃 𝑣 ℎ 𝑘 𝑄( ℎ 𝑘 |𝑣) 𝜕𝑒𝑛𝑒𝑟𝑔𝑦 𝑣,ℎ 𝜕 𝜃 𝑘 Sample ℎ 0 given 𝑣 0 Sample 𝑣 1 and ℎ 1 using Gibbs sampling

12
**Restricted Boltzman Machine (Training)**

Now we can perform stochastic gradient descent on data log- likelihood Stop based on some criterion (e.g. reconstruction error − log 𝑃( 𝑣 1 =𝑥| 𝑣 0 =𝑥 )

13
**Deep Belief Network A DBN is a model of the form**

𝑥= 𝑔 0 denotes input variables 𝑔 denotes hidden layers of causal variables

14
**Deep Belief Network A DBN is a model of the form**

𝑥= 𝑔 0 denotes input variables 𝑔 denotes hidden layers of causal variables

15
**Deep Belief Network A DBN is a model of the form**

𝑥= 𝑔 0 denotes input variables 𝑔 denotes hidden layers of causal variables 𝑃( 𝑔 𝑙−1 , 𝑔 𝑙 ) is an RBM 𝑃 𝑔 𝑖 𝑔 𝑖+1 = 𝑗 𝑃( 𝑔 𝑗 𝑖 | 𝑔 𝑖+1 ) 𝑃 𝑔 𝑗 𝑖 𝑔 𝑖+1 =𝑠𝑖𝑔𝑚( 𝑏 𝑗 𝑖 + 𝑘 𝑛 𝑖+1 𝑊 𝑘𝑗 𝑖 𝑔 𝑘 𝑖+1 ) RBM = Infinitely Deep network with tied weights

16
**Greedy layer-wise training**

𝑃( 𝑔 1 | 𝑔 0 ) is intractable Approximate with 𝑄( 𝑔 1 | 𝑔 0 ) Treat bottom two layers as an RBM Fit parameters using contrastive divergence

17
**Greedy layer-wise training**

𝑃( 𝑔 1 | 𝑔 0 ) is intractable Approximate with 𝑄( 𝑔 1 | 𝑔 0 ) Treat bottom two layers as an RBM Fit parameters using contrastive divergence That gives an approximate 𝑃 𝑔 1 We need to match it with 𝑃( 𝑔 1 )

18
**Greedy layer-wise training**

Approximate 𝑃 𝑔 𝑙 𝑔 𝑙−1 ≈𝑄( 𝑔 𝑙 | 𝑔 𝑙−1 ) Treat layers 𝑙−1,𝑙 as an RBM Fit parameters using contrastive divergence Sample 𝑔 0 𝑙−1 recursively using 𝑄 𝑔 𝑖 𝑔 𝑖−1 starting from 𝑔 0

19
**Outline Review Supervised Fine-tuning Extensions Analysis Experiments**

Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

20
**Supervised Fine Tuning (In this paper)**

Use greedy layer-wise training to initialize weights of all layers except output layer. For fine-tuning, use stochastic gradient descent of a cost function on the outputs where the conditional expected values of hidden nodes are approximated using mean-field. 𝐸 𝑔 𝑖 𝑔 𝑖−1 = 𝜇 𝑖−1 = 𝜇 𝑖 =𝑠𝑖𝑔𝑚( 𝑏 𝑖 + 𝑊 𝑖 𝜇 𝑖−1 )

21
**Supervised Fine Tuning (In this paper)**

Use greedy layer-wise training to initialize weights of all layers except output layer. Use backpropagation

22
**Outline Review Supervised Fine-tuning Extensions Analysis Experiments**

Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

23
**Continuous Inputs Recall RBMs:**

𝑄 ℎ 𝑗 𝑣 ∝𝑄 ℎ 𝑗 ,𝑣 ∝ exp ℎ 𝑗 𝑤′𝑣+ 𝑏 𝑗 ℎ 𝑗 ∝ exp (𝑤′𝑣+ 𝑏 𝑗 ) ℎ 𝑗 =exp(𝑎 𝑣 ℎ 𝑗 ) If we restrict ℎ 𝑗 ∈𝐼={0,1} then normalization gives us binomial with 𝑝 given by sigmoid. Instead, if 𝐼=[0,∞] we get exponential density If 𝐼 is closed interval then we get truncated exponential

24
**Continuous Inputs (Case for truncated exponential [0,1])**

Sampling For truncated exponential, inverse CDF can be used h j = 𝐹 −1 𝑈 = log (1−𝑈×(1−exp 𝑎 𝑣 ) 𝑎(𝑣) where 𝑈 is sampled uniformly from [0,1] Conditional Expectation 𝐸 ℎ 𝑗 𝑣 = 1 1−exp(−𝑎 𝑣 ) − 1 𝑎(𝑣)

25
Continuous Inputs To handle Gaussian inputs, we need to augment the energy function with a term quadratic in ℎ. For a diagonal covariance matrix 𝑃 ℎ 𝑗 𝑣 =𝑎 𝑣 ℎ 𝑗 + 𝑑 𝑗 ℎ 𝑗 2 Giving 𝐸 ℎ 𝑗 𝑧 =𝑎(𝑥)/2 𝑑 2

26
**Continuous Hidden Nodes ?**

27
**Continuous Hidden Nodes ?**

Truncated Exponential 𝐸 ℎ 𝑗 𝑣 = 1 1−exp(−𝑎 𝑣 ) − 1 𝑎(𝑣) Gaussian 𝐸 ℎ 𝑗 𝑣 =𝑎(𝑣)/2 𝑑 2

28
**Uncooperative Input Distributions**

Setting 𝑥~𝑝 𝑥 𝑦=𝑓 𝑥 +𝑛𝑜𝑖𝑠𝑒 No particular relation between p and f, (e.g. Gaussian and sinus)

29
**Uncooperative Input Distributions**

Setting 𝑥~𝑝 𝑥 𝑦=𝑓 𝑥 +𝑛𝑜𝑖𝑠𝑒 No particular relation between p and f, (e.g. Gaussian and sinus) Problem: Unsupvervised pre-training may not help prediction

30
**Outline Review Supervised Fine-tuning Extensions Analysis Experiments**

Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Analysis Experiments

31
**Uncooperative Input Distributions**

Proposal: Mix unsupervised and supervised training for each layer Temp. Ouptut Layer Stochastic Gradient of input log likelihood by Contrastive Divergence Stochastic Gradient of prediction error Combined Update

32
**Simultaneous Layer Training**

Greedy Layer-wise Training For each layer Repeat Until Criterion Met Sample layer input (by recursively applying trained layers to data) Update parameters using contrastive divergence

33
**Simultaneous Layer Training**

Simultaneous Training Repeat Until Criterion Met Sample input to all layers Update parameters of all layers using contrastive divergence Simpler: One criterion for the entire network Takes more time

34
**Outline Review Supervised Fine-tuning Extensions Analysis Experiments**

Restricted Boltzman Machines Deep Belief Networks Greedy layer-wise Training Supervised Fine-tuning Extensions Continuous Inputs Uncooperative Input Distributions Simultaneous Training Analysis Experiments

35
**Experiments Does greedy unsupervised pre-training help ?**

What if we replace RBM with auto-encoders ? What if we do greedy supervised pre-training ? Does continuous variable modeling help ? Does partially supervised pre-training help ?

36
**Experiment 1 Does greedy unsupervised pre-training help ?**

What if we replace RBM with auto-encoders ? What if we do greedy supervised pre-training ? Does continuous variable modeling help ? Does partially supervised pre-training help ?

37
Experiment 1

38
Experiment 1

39
**Experiment 1(MSE and Training Errors)**

Partially Supervised < Unsupervised Pre-training < No Pre-training Gaussian < Binomial

40
**Experiment 2 Does greedy unsupervised pre-training help ?**

What if we replace RBM with auto-encoders ? What if we do greedy supervised pre-training ? Does continuous variable modeling help ? Does partially supervised pre-training help ?

41
**Experiment 2 Auto Encoders**

Learn a compact representation to reconstruct X 𝑝 𝑥 =𝑠𝑖𝑔𝑚 𝑐+𝑊𝑠𝑖𝑔𝑚 𝑏+ 𝑊 ′ 𝑥 Trained to minimize reconstruction cross-entropy 𝑅=− 𝑖 𝑥 𝑖 log 𝑝 𝑥 𝑖 + (1− 𝑥 𝑖 ) log 𝑝 1−𝑥 𝑖 X X

42
Experiment 2 (500~1000) layer width 20 nodes in last two layers

43
Experiment 2 Auto-encoder pre-training outperforms supervised pre-training but is still outperformed by RBM. Without pre-training, deep nets do not generalize well, but they can still fit the data if the output layers are wide enough.

44
**Conclusions Unsupervised pre-training is important for deep networks.**

Partial supervision further enhances results, especially when input distribution and the function to be estimated are not closely related. Explicitly modeling conditional inputs is better than using binomial models.

45
Thanks

Similar presentations

OK

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Mba ppt on business cycles Ppt on itc group of hotels Ppt on bullet train in india Ppt on second law of thermodynamics creationism Ppt on conservation of natural vegetation and wildlife Ppt on ocean currents Ppt on energy cogeneration plant Converter pub to ppt online training Ppt on networking related topics about information Ppt on e mail