Deep Learning Qing LU, Siyuan CAO.

Deep Learning Qing LU, Siyuan CAO

Some Applications Deep Learning started to show its power in speed recognition since (Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Rocognition”, 2012) 9/22/2018 Deep Learning

Contents Deep Learning Basic Idea
A simple deep network: Deep Belief Network Unsupervised Learning: Auto-encoder A more popular network: Convolutional Neural Network 9/22/2018 Deep Learning

Deep Learning Basic Idea
Basic Architecture 9/22/2018 Deep Learning

Architecture Example: This is a deep network with
one 3-unit Input Layer one 2-unit Output Layer two 5-unit Hidden Layers Note: Unit is also known as “neuron” 9/22/2018 Deep Learning

Why such architecture? Because we learn things in this way.
Artificial Neural Network met its resistance because of computation time to train the network. Because we learn things in this way. Most of popular Deep Learning Architectures are built from Artificial Neural Network, which was quite popular in 1950s till 1990s 9/22/2018 Deep Learning

Deep Learning Basic Idea
Basic Architecture How it works 9/22/2018 Deep Learning

How does it work? It is an apple! Smells good! Other parameters
9/22/2018 Deep Learning

Chemical materials (e.g. hormone) are created Signals pass through neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning

Signals pass to the next level of neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning

How does it work? It is an apple! It is delicious! Smells good!
Signals pass to the brain and mouth It is an apple! It is delicious! Smells good! It is obvious that the input and the output are visible and the middle layers are unknown to us unless we are biologists. Therefore we call the middle layers “Hidden Layers” Let mouth running water Other parameters 9/22/2018 Deep Learning

Notice Each layer is another kind of representation of input
Feature Learning There is no feedback loop in this specific architecture. Feedback models exist. E.g. Recurrent Neural Network But computational complexity increases. That’s why they are not popular, compared with non-feedback models. 9/22/2018 Deep Learning

How to make machine to learn in this way?

Deep Belief Network Input Vector 9/22/2018 Deep Learning

Deep Belief Network The unit ℎ 1,𝑗 is activated according to a certain probability 𝑃 ℎ 1,𝑗 =1|𝒗 Note: 1. Each unit is a binary unit, i.e. its value is either 0 or 1. 2. The weights of units and connections are real number Input Vector 9/22/2018 Deep Learning

Deep Belief Network The unit ℎ 2,𝑗 is activated according to a certain probability 𝑃 ℎ 2,𝑗 =1| 𝒉 1 Input Vector 9/22/2018 Deep Learning

Deep Belief Network Again, the output unit is activated according to a certain probability. Input Vector 9/22/2018 Deep Learning

Question How can we find this probability? (i.e. the training process)
First, DBN is stacked by several simple single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 9/22/2018 Deep Learning

Architecture of DBN 9/22/2018 Deep Learning

Restricted Boltzmann Machine (RBM)
Introduction about RBM: RBM is a variant of Boltzmann Machine. RBM has only two layers, commonly referred as the “visible” and “hidden” units. Connection only exists between one “visible” unit and one “hidden” unit. There is NO connection between two “visible” units or two “hidden” units. 9/22/2018 Deep Learning

Question How can we find this probability? (i.e. the training process)
First, DBN is stacked by several single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 Second, energy is introduced into the model. 9/22/2018 Deep Learning

𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦
Energy Based Models We associate a scalar energy 𝐸 𝑥,𝑦 to each configuration. The probability distribution w.r.t. the energy is defined as 𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦 We want such properties: Lower energy indicates a more “desirable” configuration What is “desirable”? For a given data pair 𝑥,𝑦 , x is the input and y is the output If x and y are compatible, then the energy should be low If x and y are not compatible, then the energy should be high Lecun-06.pdf 9/22/2018 Deep Learning

Energy Function For RBM, we can find each configuration contains two units and one connection. Therefore the energy function is defined as follows: 𝐸 𝑣 𝑖 , ℎ 𝑗 =− 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 With 𝑣 𝑖 , and ℎ 𝑗 are binary units (i.e. 𝑣 𝑖 , ℎ 𝑗 ∈ 0,1 ) 𝑎 𝑖 , and 𝑏 𝑗 are biases of the units 𝑤 𝑖𝑗 is the weight of the connection 9/22/2018 Deep Learning

Energy Function We expend the energy function into vector 𝒗 and 𝒉:
𝐸 𝒗,𝒉 =− 𝑖 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑗 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑖 𝑗 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 Further more, totally in vector form: 𝐸 𝒗,𝒉 =− 𝒂 𝑇 ∙𝒗+ 𝒃 𝑇 ∙𝒉+ 𝒗 𝑇 𝑾𝒉 9/22/2018 Deep Learning

Something about Probabilities
The probability distribution is defined as 𝑃 𝒗,𝒉 = 𝑒 −𝐸 𝒗,𝒉 𝑍 with Z is the normalizing factor 𝑍= 𝒗,𝒉 𝑒 −𝐸 𝒗,𝒉 The probability of a visible vector is 𝑃 𝒗 = 1 𝑍 𝒉 𝑒 −𝐸 𝒗,𝒉 Because there is no connection between two visible units or two hidden units, the hidden units are independent to each other. The same as visible units. Therefore, we have conditional probabilities as follows: 𝑃 𝒗|𝒉 = 𝑖 𝑃 𝑣 𝑖 |𝒉 𝑃 𝒉|𝒗 = 𝑗 𝑃 ℎ 𝑗 |𝒗 9/22/2018 Deep Learning

Activation Probability
The activation probability of unit 𝑣 𝑖 or ℎ 𝑗 is: 𝑃 𝑣 𝑖 =1|𝒉 = 1 1+ 𝑒 − 𝑎 𝑖 + 𝑗 𝑤 𝑖𝑗 ℎ 𝑗 𝑃 ℎ 𝑗 =1|𝒗 = 1 1+ 𝑒 − 𝑏 𝑗 + 𝑖 𝑤 𝑖𝑗 𝑣 𝑖 How to get Sigmoid Function? 9/22/2018 Deep Learning

Training Algorithm For given training data set 𝑉 (a matrix with each row is a visible vector 𝒗) RBM is trained to argmax 𝜃 𝒗∈𝑉 𝑃 𝒗 or equivalently argmax 𝜃 𝒗∈𝑉 log⁡ 𝑃 𝒗 with 𝜃= 𝒂,𝒃,𝑾 Maximum Likelihood PI(P(V)) is the likelihood function compared to Machine Learning course. 9/22/2018 Deep Learning

Training Algorithm 𝜕log⁡ 𝑃 𝒗 𝜕𝜃 = 𝜕 𝜕𝜃 log 𝒉 𝑒 −𝐸 𝒗,𝒉 −log⁡ 𝒗 𝒉 𝑒 −𝐸 𝒗,𝒉 𝜕log⁡ 𝑃 𝒗 𝜕𝜃 = 𝒉 𝑃 𝒉|𝒗 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 9/22/2018 Deep Learning

Training Algorithm 𝜕 𝜕 𝑤 𝑖𝑗 𝐸 𝒗,𝒉 =− 𝑣 𝑖 ℎ 𝑗 𝜕 𝜕 𝑎 𝑖 𝐸 𝒗,𝒉 =− 𝑣 𝑖 𝜕 𝜕 𝑏 𝑗 𝐸 𝒗,𝒉 =− ℎ 𝑗 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 = 𝒉 𝑃 𝒉|𝒗 𝑣 𝑖 ℎ 𝑗 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝑣 𝑖 ℎ 𝑗 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning

Training Algorithm 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning

Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning

Training Algorithm 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning

Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) However, recently it was shown that estimates obtained after just a few steps can be sufficient for model training. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation 14, 1771–1800 (2002) Contrastive Divergence is commonly used to approximate the log- likelihood gradient for training RBM. 9/22/2018 Deep Learning

Contrastive Divergence (CD or CD-k)
Usually, only 1 step is enough. Source: An Introduction to Restricted Boltzmann Machines by Asja Fischer and Christian Igel 9/22/2018 Deep Learning

A Short Conclusion Until now, we have only done with ONE RBM
An RBM has only two layers, not exactly “Deep” We can use Contrastive Divergence to train RBM 9/22/2018 Deep Learning

Architecture of DBN Until now, we have only done with ONE RBM.
Then, we do the same thing to the rest RBMs. To compute the a, b, and W. 9/22/2018 Deep Learning

Architecture of DBN Train the rest RBMs with same approach. Note:
Until now, we only get local optimal configuration. 9/22/2018 Deep Learning

Backpropagation (Fine-Tuning)
Using backpropagation algorithm to fine-tune the network and to get close to global optima. Therefore, it makes DBN a supervised model. 9/22/2018 Deep Learning

Demo (MINST Classifier)
Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning

Conclusion to DBN Simple architecture, easy to scale
Existing an efficient algorithm to pre-train the network My test: Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 200, Time: 36 hours Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 50, Time: 6.5 hours Layer 1: 200 Layer 2: 200 Layer 3: 1000 BP: 50, Time: 2 hours Limits: Need labeled data Computation time is still an issue (image a full color picture taken from camera, how many parameters need to be updated for a network?) 9/22/2018 Deep Learning

Auto-encoder To train a model with classifier, labeled data are needed. However, in most cases, only unlabeled data are available and to label data is very expensive. Therefore, we need an unsupervised way to train the model. 9/22/2018 Deep Learning

Auto-encoder basic idea

Auto-encoder with DBN 9/22/2018 Deep Learning

Demo (MINST Autoencoder)
Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning

Convolutional Neural Network
Very popular in image recognition Special architecture to reduce the data size significantly (which means parameters of the network are also reduced) However it still needs long time to train the network because of algorithm 9/22/2018 Deep Learning

Question Given an image, how would you like to reduce the data size?

Convolutional Neural Network Architecture
Convolution layer Subsampling layer Full Connection layer LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning

Basic Idea of CNN Feedforward pass Backpropagation pass
To compute the error Backpropagation pass To update the weights and biases 9/22/2018 Deep Learning

Basic Idea of Feedforward Pass
Convolution Layer User several filters to enhance the feature from the input (or previous layer) Subsampling Layer Because image has local spatial relation, down-sampling can reduce data size, at the same time can still keep valuable information. (e.g. imaging you can still recognize the picture from the thumbnail) Full connection Layer Can be regarded as a classifier 9/22/2018 Deep Learning

A good video about Feedforward Pass
Convolution Layer Part: starts from 5:42 till 7:16 What is 2D matrix convolution What the effect can 2D matrix convolution achieve Subsampling Layer Part: at 10:10 How to subsampling 9/22/2018 Deep Learning

Question How many parameters need for a CNN, compared with DBN? (Input data are 32x32 digit, or full color pictures) 9/22/2018 Deep Learning

Convolutional Neural Network Architecture
Task to train CNN: Given labeled data, to obtain suitable weights and biases of matrices for convolution layer and sampling layer. LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning

Model of CNN Following slides are referred from “Notes on Convolutional Neural Networks” by Jake Bouvrie, 2006 9/22/2018 Deep Learning

Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error is given: 𝐸 𝑁 = 1 2 𝑛=1 𝑁 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 𝐸 𝑁 : Error of whole training examples 𝑡 𝑛,𝑘 : Output of n-th input data w.r.t. k-th class 𝑦 𝑛,𝑘 : Label of n-th input data w.r.t. k-th class 9/22/2018 Deep Learning

Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error of n-th example is given: 𝐸 𝑛 = 1 2 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 or 𝐸 𝑛 = 𝒕 𝑛 − 𝒚 𝑛 2 9/22/2018 Deep Learning

Model of General Feedforward Pass
The output of a certain layer: 𝒙 𝑙 =𝑓 𝒖 𝑙 with 𝒖 𝑙 = 𝑾 𝑙 𝒙 𝑙−1 + 𝒃 𝑙 𝑙: current layer. Layer 1 is input data layer, Layer 𝐿 is output layer of CNN. Therefore, 𝑙 is from 2 to 𝐿. 𝒙 𝑙 : output of layer 𝑙 𝑾 𝑙 and 𝒃 𝑙 : weights and biases for layer 𝑙 𝑓 ∙ : activation function, commonly to be sigmoid or hyperbolic tangent function. 9/22/2018 Deep Learning

Model of General Backpropagation Pass
Backpropagation Algorithm is used to updating weights and biases 𝛿 is regard as bias sensitivity, which will be propagate back through the network. 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 𝜕𝑢 𝜕𝑏 Since 𝜕𝑢 𝜕𝑏 =1 𝛿 becomes 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 9/22/2018 Deep Learning

𝛿 for layer 𝑙: 𝜹 𝑙 = 𝑾 𝑙+1 𝑇 𝜹 𝑙+1 ∘ 𝑓 ′ 𝒖 𝑙 for layer 𝐿: 𝜹 𝐿 = 𝑓 ′ 𝒖 𝐿 ∘ 𝒚 𝑛 − 𝒕 𝑛 ∘: element-wise multiplication Final equation to update bias Δ 𝒃 𝑙 =−𝜂 𝜕𝐸 𝜕 𝒃 𝑙 =−𝜂 𝜹 𝑙 𝜂: learning rate 9/22/2018 Deep Learning

To update weights, with analogous process for the bias update, 𝜕𝐸 𝜕 𝑾 𝑙 = 𝒙 𝑙−1 𝜹 𝑙 𝑇 ∆ 𝑾 𝑙 =−𝜂 𝜕𝐸 𝜕 𝑾 𝑙 9/22/2018 Deep Learning

Detail Form for Convolution Layer
Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝑖∈ 𝑀 𝑗 𝒙 𝑙−1,𝑖 ∗ 𝒌 𝑙,𝑖𝑗 + 𝒃 𝑙,𝑗 𝒌 𝑙,𝑖𝑗 weight matrix for layer 𝑙, between feature map 𝑖 and 𝑗 𝑀 𝑗 a selection of input maps 9/22/2018 Deep Learning

Detail Form for Convolution Layer
Backpropagation Pass 𝜹 𝑙,𝑗 = 𝛽 𝑙+1,𝑗 𝑓 ′ 𝒖 𝑙,𝑗 ∘𝑢𝑝 𝜹 𝑙+1,𝑗 𝛽 𝑙+1,𝑗 see next slide 𝑢𝑝 ⋅ up-sampling method, e.g. Kronecker product 𝑓 ′ ⋅ derivative of activation fucntion 9/22/2018 Deep Learning

Detail Form for Subsampling Layer
Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝛽 𝑙,𝑗 𝑑𝑜𝑤𝑛 𝒙 𝑙−1,𝑗 + 𝒃 𝑙,𝑗 𝛽 𝑙,𝑗 nothing special, just “weight”. Here it is only a scalar, not a matrix. 𝑑𝑜𝑤𝑛 ⋅ down-sampling method, e.g. average, maximum, etc. 9/22/2018 Deep Learning

Detail Form for Subsampling Layer
Backpropagation Pass 𝛿 𝑙,𝑗 = 𝒌 𝑙+1,𝑗 𝑇 𝛿 𝑙+1,𝑗 ∘ 𝑓 ′ 𝒖 𝑙,𝑗 9/22/2018 Deep Learning

A Short Conclusion (Feedforward)
Target: Compute Error Convolution Layers: Convolution is used instead of multiplication. Subsampling Layers: Different down-sampling can be used. 9/22/2018 Deep Learning

A Short Conclusion (Backpropagation)
Target: back propagate Error and update weight and bias Convolution Layers: Up-sampling is needed Subsampling Layers: shortcut method exists in MatLab (more details are in the paper by Jake Bouvrie) More detailed BP steps are introduced in “Notes on Convolutional Neural Networks” by Jake Bouvrie 9/22/2018 Deep Learning

Conclusion to CNN Significantly reduce the data size and parameter size Training algorithm is not efficient (only BP algorithm currently) There are researches available to combine CNN and DBN Personal view: A little bit more understandable what is happening in different layers than DBN although it is still hard for us to understand why to choose certain filters after training. 9/22/2018 Deep Learning

Conclusion to Deep Learning
Feature Learning Hierarchical architecture (simulate brain activity) There is no theoretical proof what are the optimal parameters ( number of layers, number units, etc.) Good performance in image, speech recognition Although it is hard for us to understand what is happening in the network Computation time is still an issue 9/22/2018 Deep Learning

Deep Learning Qing LU, Siyuan CAO.

Similar presentations

Presentation on theme: "Deep Learning Qing LU, Siyuan CAO."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning Qing LU, Siyuan CAO.

Similar presentations

Presentation on theme: "Deep Learning Qing LU, Siyuan CAO."— Presentation transcript:

Similar presentations

About project

Feedback