Deep Learning Qing LU, Siyuan CAO
Some Applications http://cs.stanford.edu/people/karpathy/deepimagesent/ http://deeplearning.cs.toronto.edu/ http://yann.lecun.com/exdb/lenet/ Deep Learning started to show its power in speed recognition since 2010 (Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Rocognition”, 2012) 9/22/2018 Deep Learning
Contents Deep Learning Basic Idea A simple deep network: Deep Belief Network Unsupervised Learning: Auto-encoder A more popular network: Convolutional Neural Network 9/22/2018 Deep Learning
Deep Learning Basic Idea Basic Architecture 9/22/2018 Deep Learning
Architecture Example: This is a deep network with one 3-unit Input Layer one 2-unit Output Layer two 5-unit Hidden Layers Note: Unit is also known as “neuron” 9/22/2018 Deep Learning
Why such architecture? Because we learn things in this way. http://en.wikipedia.org/wiki/Deep_learning#Deep_learning_in_the_human_brain Artificial Neural Network met its resistance because of computation time to train the network. Because we learn things in this way. Most of popular Deep Learning Architectures are built from Artificial Neural Network, which was quite popular in 1950s till 1990s 9/22/2018 Deep Learning
Deep Learning Basic Idea Basic Architecture How it works 9/22/2018 Deep Learning
How does it work? It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning
How does it work? It is an apple! Smells good! Other parameters Chemical materials (e.g. hormone) are created Signals pass through neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning
How does it work? It is an apple! Smells good! Other parameters Signals pass to the next level of neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning
How does it work? It is an apple! It is delicious! Smells good! Signals pass to the brain and mouth It is an apple! It is delicious! Smells good! It is obvious that the input and the output are visible and the middle layers are unknown to us unless we are biologists. Therefore we call the middle layers “Hidden Layers” Let mouth running water Other parameters 9/22/2018 Deep Learning
Notice Each layer is another kind of representation of input Feature Learning There is no feedback loop in this specific architecture. Feedback models exist. E.g. Recurrent Neural Network But computational complexity increases. That’s why they are not popular, compared with non-feedback models. 9/22/2018 Deep Learning
How to make machine to learn in this way? 9/22/2018 Deep Learning
Deep Belief Network Input Vector 9/22/2018 Deep Learning
Deep Belief Network The unit ℎ 1,𝑗 is activated according to a certain probability 𝑃 ℎ 1,𝑗 =1|𝒗 Note: 1. Each unit is a binary unit, i.e. its value is either 0 or 1. 2. The weights of units and connections are real number Input Vector 9/22/2018 Deep Learning
Deep Belief Network The unit ℎ 2,𝑗 is activated according to a certain probability 𝑃 ℎ 2,𝑗 =1| 𝒉 1 Input Vector 9/22/2018 Deep Learning
Deep Belief Network Again, the output unit is activated according to a certain probability. Input Vector 9/22/2018 Deep Learning
Question How can we find this probability? (i.e. the training process) First, DBN is stacked by several simple single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 9/22/2018 Deep Learning
Architecture of DBN 9/22/2018 Deep Learning
Architecture of DBN 9/22/2018 Deep Learning
Architecture of DBN 9/22/2018 Deep Learning
Architecture of DBN 9/22/2018 Deep Learning
Architecture of DBN 9/22/2018 Deep Learning
Restricted Boltzmann Machine (RBM) Introduction about RBM: RBM is a variant of Boltzmann Machine. RBM has only two layers, commonly referred as the “visible” and “hidden” units. Connection only exists between one “visible” unit and one “hidden” unit. There is NO connection between two “visible” units or two “hidden” units. 9/22/2018 Deep Learning
Question How can we find this probability? (i.e. the training process) First, DBN is stacked by several single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 Second, energy is introduced into the model. 9/22/2018 Deep Learning
𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦 Energy Based Models We associate a scalar energy 𝐸 𝑥,𝑦 to each configuration. The probability distribution w.r.t. the energy is defined as 𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦 We want such properties: Lower energy indicates a more “desirable” configuration What is “desirable”? For a given data pair 𝑥,𝑦 , x is the input and y is the output If x and y are compatible, then the energy should be low If x and y are not compatible, then the energy should be high Lecun-06.pdf 9/22/2018 Deep Learning
Energy Function For RBM, we can find each configuration contains two units and one connection. Therefore the energy function is defined as follows: 𝐸 𝑣 𝑖 , ℎ 𝑗 =− 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 With 𝑣 𝑖 , and ℎ 𝑗 are binary units (i.e. 𝑣 𝑖 , ℎ 𝑗 ∈ 0,1 ) 𝑎 𝑖 , and 𝑏 𝑗 are biases of the units 𝑤 𝑖𝑗 is the weight of the connection 9/22/2018 Deep Learning
Energy Function We expend the energy function into vector 𝒗 and 𝒉: 𝐸 𝒗,𝒉 =− 𝑖 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑗 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑖 𝑗 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 Further more, totally in vector form: 𝐸 𝒗,𝒉 =− 𝒂 𝑇 ∙𝒗+ 𝒃 𝑇 ∙𝒉+ 𝒗 𝑇 𝑾𝒉 9/22/2018 Deep Learning
Something about Probabilities The probability distribution is defined as 𝑃 𝒗,𝒉 = 𝑒 −𝐸 𝒗,𝒉 𝑍 with Z is the normalizing factor 𝑍= 𝒗,𝒉 𝑒 −𝐸 𝒗,𝒉 The probability of a visible vector is 𝑃 𝒗 = 1 𝑍 𝒉 𝑒 −𝐸 𝒗,𝒉 Because there is no connection between two visible units or two hidden units, the hidden units are independent to each other. The same as visible units. Therefore, we have conditional probabilities as follows: 𝑃 𝒗|𝒉 = 𝑖 𝑃 𝑣 𝑖 |𝒉 𝑃 𝒉|𝒗 = 𝑗 𝑃 ℎ 𝑗 |𝒗 9/22/2018 Deep Learning
Activation Probability The activation probability of unit 𝑣 𝑖 or ℎ 𝑗 is: 𝑃 𝑣 𝑖 =1|𝒉 = 1 1+ 𝑒 − 𝑎 𝑖 + 𝑗 𝑤 𝑖𝑗 ℎ 𝑗 𝑃 ℎ 𝑗 =1|𝒗 = 1 1+ 𝑒 − 𝑏 𝑗 + 𝑖 𝑤 𝑖𝑗 𝑣 𝑖 How to get Sigmoid Function? 9/22/2018 Deep Learning
Training Algorithm For given training data set 𝑉 (a matrix with each row is a visible vector 𝒗) RBM is trained to argmax 𝜃 𝒗∈𝑉 𝑃 𝒗 or equivalently argmax 𝜃 𝒗∈𝑉 log 𝑃 𝒗 with 𝜃= 𝒂,𝒃,𝑾 Maximum Likelihood PI(P(V)) is the likelihood function compared to Machine Learning course. 9/22/2018 Deep Learning
Training Algorithm 𝜕log 𝑃 𝒗 𝜕𝜃 = 𝜕 𝜕𝜃 log 𝒉 𝑒 −𝐸 𝒗,𝒉 −log 𝒗 𝒉 𝑒 −𝐸 𝒗,𝒉 𝜕log 𝑃 𝒗 𝜕𝜃 = 𝒉 𝑃 𝒉|𝒗 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 9/22/2018 Deep Learning
Training Algorithm 𝜕 𝜕 𝑤 𝑖𝑗 𝐸 𝒗,𝒉 =− 𝑣 𝑖 ℎ 𝑗 𝜕 𝜕 𝑎 𝑖 𝐸 𝒗,𝒉 =− 𝑣 𝑖 𝜕 𝜕 𝑏 𝑗 𝐸 𝒗,𝒉 =− ℎ 𝑗 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 = 𝒉 𝑃 𝒉|𝒗 𝑣 𝑖 ℎ 𝑗 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝑣 𝑖 ℎ 𝑗 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning
Training Algorithm 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning
Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning
Training Algorithm 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning
Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) However, recently it was shown that estimates obtained after just a few steps can be sufficient for model training. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation 14, 1771–1800 (2002) Contrastive Divergence is commonly used to approximate the log- likelihood gradient for training RBM. 9/22/2018 Deep Learning
Contrastive Divergence (CD or CD-k) Usually, only 1 step is enough. Source: An Introduction to Restricted Boltzmann Machines by Asja Fischer and Christian Igel 9/22/2018 Deep Learning
A Short Conclusion Until now, we have only done with ONE RBM An RBM has only two layers, not exactly “Deep” We can use Contrastive Divergence to train RBM 9/22/2018 Deep Learning
Architecture of DBN Until now, we have only done with ONE RBM. Then, we do the same thing to the rest RBMs. To compute the a, b, and W. 9/22/2018 Deep Learning
Architecture of DBN Train the rest RBMs with same approach. Note: Until now, we only get local optimal configuration. 9/22/2018 Deep Learning
Backpropagation (Fine-Tuning) Using backpropagation algorithm to fine-tune the network and to get close to global optima. Therefore, it makes DBN a supervised model. 9/22/2018 Deep Learning
Demo (MINST Classifier) Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning
Conclusion to DBN Simple architecture, easy to scale Existing an efficient algorithm to pre-train the network My test: Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 200, Time: 36 hours Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 50, Time: 6.5 hours Layer 1: 200 Layer 2: 200 Layer 3: 1000 BP: 50, Time: 2 hours Limits: Need labeled data Computation time is still an issue (image a full color picture taken from camera, how many parameters need to be updated for a 200-200-1000 network?) 9/22/2018 Deep Learning
Auto-encoder To train a model with classifier, labeled data are needed. However, in most cases, only unlabeled data are available and to label data is very expensive. Therefore, we need an unsupervised way to train the model. 9/22/2018 Deep Learning
Auto-encoder basic idea 9/22/2018 Deep Learning
Auto-encoder with DBN 9/22/2018 Deep Learning
Demo (MINST Autoencoder) Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning
Convolutional Neural Network Very popular in image recognition http://cs.stanford.edu/people/karpathy/deepimagesent/ http://deeplearning.cs.toronto.edu/ Special architecture to reduce the data size significantly (which means parameters of the network are also reduced) However it still needs long time to train the network because of algorithm 9/22/2018 Deep Learning
Question Given an image, how would you like to reduce the data size? 9/22/2018 Deep Learning
Convolutional Neural Network Architecture Convolution layer Subsampling layer Full Connection layer LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning
Basic Idea of CNN Feedforward pass Backpropagation pass To compute the error Backpropagation pass To update the weights and biases 9/22/2018 Deep Learning
Basic Idea of Feedforward Pass Convolution Layer User several filters to enhance the feature from the input (or previous layer) Subsampling Layer Because image has local spatial relation, down-sampling can reduce data size, at the same time can still keep valuable information. (e.g. imaging you can still recognize the picture from the thumbnail) Full connection Layer Can be regarded as a classifier 9/22/2018 Deep Learning
A good video about Feedforward Pass https://www.youtube.com/watch?v=n6hpQwq7Inw Convolution Layer Part: starts from 5:42 till 7:16 What is 2D matrix convolution What the effect can 2D matrix convolution achieve Subsampling Layer Part: at 10:10 How to subsampling 9/22/2018 Deep Learning
Question How many parameters need for a CNN, compared with DBN? (Input data are 32x32 digit, or full color pictures) 9/22/2018 Deep Learning
Convolutional Neural Network Architecture Task to train CNN: Given labeled data, to obtain suitable weights and biases of matrices for convolution layer and sampling layer. LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning
Model of CNN Following slides are referred from “Notes on Convolutional Neural Networks” by Jake Bouvrie, 2006 9/22/2018 Deep Learning
Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error is given: 𝐸 𝑁 = 1 2 𝑛=1 𝑁 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 𝐸 𝑁 : Error of whole training examples 𝑡 𝑛,𝑘 : Output of n-th input data w.r.t. k-th class 𝑦 𝑛,𝑘 : Label of n-th input data w.r.t. k-th class 9/22/2018 Deep Learning
Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error of n-th example is given: 𝐸 𝑛 = 1 2 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 or 𝐸 𝑛 = 1 2 𝒕 𝑛 − 𝒚 𝑛 2 9/22/2018 Deep Learning
Model of General Feedforward Pass The output of a certain layer: 𝒙 𝑙 =𝑓 𝒖 𝑙 with 𝒖 𝑙 = 𝑾 𝑙 𝒙 𝑙−1 + 𝒃 𝑙 𝑙: current layer. Layer 1 is input data layer, Layer 𝐿 is output layer of CNN. Therefore, 𝑙 is from 2 to 𝐿. 𝒙 𝑙 : output of layer 𝑙 𝑾 𝑙 and 𝒃 𝑙 : weights and biases for layer 𝑙 𝑓 ∙ : activation function, commonly to be sigmoid or hyperbolic tangent function. 9/22/2018 Deep Learning
Model of General Backpropagation Pass Backpropagation Algorithm is used to updating weights and biases 𝛿 is regard as bias sensitivity, which will be propagate back through the network. 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 𝜕𝑢 𝜕𝑏 Since 𝜕𝑢 𝜕𝑏 =1 𝛿 becomes 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 9/22/2018 Deep Learning
Model of General Backpropagation Pass 𝛿 for layer 𝑙: 𝜹 𝑙 = 𝑾 𝑙+1 𝑇 𝜹 𝑙+1 ∘ 𝑓 ′ 𝒖 𝑙 for layer 𝐿: 𝜹 𝐿 = 𝑓 ′ 𝒖 𝐿 ∘ 𝒚 𝑛 − 𝒕 𝑛 ∘: element-wise multiplication Final equation to update bias Δ 𝒃 𝑙 =−𝜂 𝜕𝐸 𝜕 𝒃 𝑙 =−𝜂 𝜹 𝑙 𝜂: learning rate 9/22/2018 Deep Learning
Model of General Backpropagation Pass To update weights, with analogous process for the bias update, 𝜕𝐸 𝜕 𝑾 𝑙 = 𝒙 𝑙−1 𝜹 𝑙 𝑇 ∆ 𝑾 𝑙 =−𝜂 𝜕𝐸 𝜕 𝑾 𝑙 9/22/2018 Deep Learning
Detail Form for Convolution Layer Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝑖∈ 𝑀 𝑗 𝒙 𝑙−1,𝑖 ∗ 𝒌 𝑙,𝑖𝑗 + 𝒃 𝑙,𝑗 𝒌 𝑙,𝑖𝑗 weight matrix for layer 𝑙, between feature map 𝑖 and 𝑗 𝑀 𝑗 a selection of input maps 9/22/2018 Deep Learning
Detail Form for Convolution Layer Backpropagation Pass 𝜹 𝑙,𝑗 = 𝛽 𝑙+1,𝑗 𝑓 ′ 𝒖 𝑙,𝑗 ∘𝑢𝑝 𝜹 𝑙+1,𝑗 𝛽 𝑙+1,𝑗 see next slide 𝑢𝑝 ⋅ up-sampling method, e.g. Kronecker product 𝑓 ′ ⋅ derivative of activation fucntion 9/22/2018 Deep Learning
Detail Form for Subsampling Layer Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝛽 𝑙,𝑗 𝑑𝑜𝑤𝑛 𝒙 𝑙−1,𝑗 + 𝒃 𝑙,𝑗 𝛽 𝑙,𝑗 nothing special, just “weight”. Here it is only a scalar, not a matrix. 𝑑𝑜𝑤𝑛 ⋅ down-sampling method, e.g. average, maximum, etc. 9/22/2018 Deep Learning
Detail Form for Subsampling Layer Backpropagation Pass 𝛿 𝑙,𝑗 = 𝒌 𝑙+1,𝑗 𝑇 𝛿 𝑙+1,𝑗 ∘ 𝑓 ′ 𝒖 𝑙,𝑗 9/22/2018 Deep Learning
A Short Conclusion (Feedforward) Target: Compute Error Convolution Layers: Convolution is used instead of multiplication. Subsampling Layers: Different down-sampling can be used. 9/22/2018 Deep Learning
A Short Conclusion (Backpropagation) Target: back propagate Error and update weight and bias Convolution Layers: Up-sampling is needed Subsampling Layers: shortcut method exists in MatLab (more details are in the paper by Jake Bouvrie) More detailed BP steps are introduced in “Notes on Convolutional Neural Networks” by Jake Bouvrie 9/22/2018 Deep Learning
Conclusion to CNN Significantly reduce the data size and parameter size Training algorithm is not efficient (only BP algorithm currently) There are researches available to combine CNN and DBN Personal view: A little bit more understandable what is happening in different layers than DBN although it is still hard for us to understand why to choose certain filters after training. 9/22/2018 Deep Learning
Conclusion to Deep Learning Feature Learning Hierarchical architecture (simulate brain activity) There is no theoretical proof what are the optimal parameters ( number of layers, number units, etc.) Good performance in image, speech recognition Although it is hard for us to understand what is happening in the network Computation time is still an issue 9/22/2018 Deep Learning