Deep Learning Qing LU, Siyuan CAO.

Slides:



Advertisements
Similar presentations
Greedy Layer-Wise Training of Deep Networks
Advertisements

A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.
Deep Learning Bing-Chen Tsai 1/21.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
Advanced topics.
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
Deep Learning.
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
How to do backpropagation in a brain
Deep Belief Networks for Spam Filtering
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Introduction to Deep Learning
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.
Cognitive models for emotion recognition: Big Data and Deep Learning
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Today’s Lecture Neural networks Training
Multiple-Layer Networks and Backpropagation Algorithms
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Neural Network Architecture Session 2
Learning Deep Generative Models by Ruslan Salakhutdinov
Convolutional Neural Network
Deep Feedforward Networks
Deep Learning Amin Sobhani.
an introduction to: Deep Learning
Energy models and Deep Belief Networks
Data Mining, Neural Network and Genetic Programming
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Matt Gormley Lecture 16 October 24, 2016
Restricted Boltzmann Machines for Classification
LECTURE ??: DEEP LEARNING
Multimodal Learning with Deep Boltzmann Machines
Supervised Training of Deep Networks
Convolutional Networks
Structure learning with deep autoencoders
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Department of Electrical and Computer Engineering
Convolutional Neural Networks for sentence classification
Artificial Intelligence Methods
Deep Architectures for Artificial Intelligence
Artificial Neural Network & Backpropagation Algorithm
of the Artificial Neural Networks.
network of simple neuron-like computing elements
Neural Networks Geoff Hulten.
On Convolutional Neural Network
SVM-based Deep Stacking Networks
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
CSC 578 Neural Networks and Deep Learning
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Deep Learning Qing LU, Siyuan CAO

Some Applications http://cs.stanford.edu/people/karpathy/deepimagesent/ http://deeplearning.cs.toronto.edu/ http://yann.lecun.com/exdb/lenet/ Deep Learning started to show its power in speed recognition since 2010 (Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Rocognition”, 2012) 9/22/2018 Deep Learning

Contents Deep Learning Basic Idea A simple deep network: Deep Belief Network Unsupervised Learning: Auto-encoder A more popular network: Convolutional Neural Network 9/22/2018 Deep Learning

Deep Learning Basic Idea Basic Architecture 9/22/2018 Deep Learning

Architecture Example: This is a deep network with one 3-unit Input Layer one 2-unit Output Layer two 5-unit Hidden Layers Note: Unit is also known as “neuron” 9/22/2018 Deep Learning

Why such architecture? Because we learn things in this way. http://en.wikipedia.org/wiki/Deep_learning#Deep_learning_in_the_human_brain Artificial Neural Network met its resistance because of computation time to train the network. Because we learn things in this way. Most of popular Deep Learning Architectures are built from Artificial Neural Network, which was quite popular in 1950s till 1990s 9/22/2018 Deep Learning

Deep Learning Basic Idea Basic Architecture How it works 9/22/2018 Deep Learning

How does it work? It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning

How does it work? It is an apple! Smells good! Other parameters Chemical materials (e.g. hormone) are created Signals pass through neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning

How does it work? It is an apple! Smells good! Other parameters Signals pass to the next level of neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning

How does it work? It is an apple! It is delicious! Smells good! Signals pass to the brain and mouth It is an apple! It is delicious! Smells good! It is obvious that the input and the output are visible and the middle layers are unknown to us unless we are biologists. Therefore we call the middle layers “Hidden Layers” Let mouth running water Other parameters 9/22/2018 Deep Learning

Notice Each layer is another kind of representation of input Feature Learning There is no feedback loop in this specific architecture. Feedback models exist. E.g. Recurrent Neural Network But computational complexity increases. That’s why they are not popular, compared with non-feedback models. 9/22/2018 Deep Learning

How to make machine to learn in this way? 9/22/2018 Deep Learning

Deep Belief Network Input Vector 9/22/2018 Deep Learning

Deep Belief Network The unit ℎ 1,𝑗 is activated according to a certain probability 𝑃 ℎ 1,𝑗 =1|𝒗 Note: 1. Each unit is a binary unit, i.e. its value is either 0 or 1. 2. The weights of units and connections are real number Input Vector 9/22/2018 Deep Learning

Deep Belief Network The unit ℎ 2,𝑗 is activated according to a certain probability 𝑃 ℎ 2,𝑗 =1| 𝒉 1 Input Vector 9/22/2018 Deep Learning

Deep Belief Network Again, the output unit is activated according to a certain probability. Input Vector 9/22/2018 Deep Learning

Question How can we find this probability? (i.e. the training process) First, DBN is stacked by several simple single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 9/22/2018 Deep Learning

Architecture of DBN 9/22/2018 Deep Learning

Architecture of DBN 9/22/2018 Deep Learning

Architecture of DBN 9/22/2018 Deep Learning

Architecture of DBN 9/22/2018 Deep Learning

Architecture of DBN 9/22/2018 Deep Learning

Restricted Boltzmann Machine (RBM) Introduction about RBM: RBM is a variant of Boltzmann Machine. RBM has only two layers, commonly referred as the “visible” and “hidden” units. Connection only exists between one “visible” unit and one “hidden” unit. There is NO connection between two “visible” units or two “hidden” units. 9/22/2018 Deep Learning

Question How can we find this probability? (i.e. the training process) First, DBN is stacked by several single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 Second, energy is introduced into the model. 9/22/2018 Deep Learning

𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦 Energy Based Models We associate a scalar energy 𝐸 𝑥,𝑦 to each configuration. The probability distribution w.r.t. the energy is defined as 𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦 We want such properties: Lower energy indicates a more “desirable” configuration What is “desirable”? For a given data pair 𝑥,𝑦 , x is the input and y is the output If x and y are compatible, then the energy should be low If x and y are not compatible, then the energy should be high Lecun-06.pdf 9/22/2018 Deep Learning

Energy Function For RBM, we can find each configuration contains two units and one connection. Therefore the energy function is defined as follows: 𝐸 𝑣 𝑖 , ℎ 𝑗 =− 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 With 𝑣 𝑖 , and ℎ 𝑗 are binary units (i.e. 𝑣 𝑖 , ℎ 𝑗 ∈ 0,1 ) 𝑎 𝑖 , and 𝑏 𝑗 are biases of the units 𝑤 𝑖𝑗 is the weight of the connection 9/22/2018 Deep Learning

Energy Function We expend the energy function into vector 𝒗 and 𝒉: 𝐸 𝒗,𝒉 =− 𝑖 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑗 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑖 𝑗 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 Further more, totally in vector form: 𝐸 𝒗,𝒉 =− 𝒂 𝑇 ∙𝒗+ 𝒃 𝑇 ∙𝒉+ 𝒗 𝑇 𝑾𝒉 9/22/2018 Deep Learning

Something about Probabilities The probability distribution is defined as 𝑃 𝒗,𝒉 = 𝑒 −𝐸 𝒗,𝒉 𝑍 with Z is the normalizing factor 𝑍= 𝒗,𝒉 𝑒 −𝐸 𝒗,𝒉 The probability of a visible vector is 𝑃 𝒗 = 1 𝑍 𝒉 𝑒 −𝐸 𝒗,𝒉 Because there is no connection between two visible units or two hidden units, the hidden units are independent to each other. The same as visible units. Therefore, we have conditional probabilities as follows: 𝑃 𝒗|𝒉 = 𝑖 𝑃 𝑣 𝑖 |𝒉 𝑃 𝒉|𝒗 = 𝑗 𝑃 ℎ 𝑗 |𝒗 9/22/2018 Deep Learning

Activation Probability The activation probability of unit 𝑣 𝑖 or ℎ 𝑗 is: 𝑃 𝑣 𝑖 =1|𝒉 = 1 1+ 𝑒 − 𝑎 𝑖 + 𝑗 𝑤 𝑖𝑗 ℎ 𝑗 𝑃 ℎ 𝑗 =1|𝒗 = 1 1+ 𝑒 − 𝑏 𝑗 + 𝑖 𝑤 𝑖𝑗 𝑣 𝑖 How to get Sigmoid Function? 9/22/2018 Deep Learning

Training Algorithm For given training data set 𝑉 (a matrix with each row is a visible vector 𝒗) RBM is trained to argmax 𝜃 𝒗∈𝑉 𝑃 𝒗 or equivalently argmax 𝜃 𝒗∈𝑉 log⁡ 𝑃 𝒗 with 𝜃= 𝒂,𝒃,𝑾 Maximum Likelihood PI(P(V)) is the likelihood function compared to Machine Learning course. 9/22/2018 Deep Learning

Training Algorithm 𝜕log⁡ 𝑃 𝒗 𝜕𝜃 = 𝜕 𝜕𝜃 log 𝒉 𝑒 −𝐸 𝒗,𝒉 −log⁡ 𝒗 𝒉 𝑒 −𝐸 𝒗,𝒉 𝜕log⁡ 𝑃 𝒗 𝜕𝜃 = 𝒉 𝑃 𝒉|𝒗 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 9/22/2018 Deep Learning

Training Algorithm 𝜕 𝜕 𝑤 𝑖𝑗 𝐸 𝒗,𝒉 =− 𝑣 𝑖 ℎ 𝑗 𝜕 𝜕 𝑎 𝑖 𝐸 𝒗,𝒉 =− 𝑣 𝑖 𝜕 𝜕 𝑏 𝑗 𝐸 𝒗,𝒉 =− ℎ 𝑗 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 = 𝒉 𝑃 𝒉|𝒗 𝑣 𝑖 ℎ 𝑗 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝑣 𝑖 ℎ 𝑗 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning

Training Algorithm 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning

Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning

Training Algorithm 𝜕log⁡ 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log⁡ 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning

Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) However, recently it was shown that estimates obtained after just a few steps can be sufficient for model training. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation 14, 1771–1800 (2002) Contrastive Divergence is commonly used to approximate the log- likelihood gradient for training RBM. 9/22/2018 Deep Learning

Contrastive Divergence (CD or CD-k) Usually, only 1 step is enough. Source: An Introduction to Restricted Boltzmann Machines by Asja Fischer and Christian Igel 9/22/2018 Deep Learning

A Short Conclusion Until now, we have only done with ONE RBM An RBM has only two layers, not exactly “Deep” We can use Contrastive Divergence to train RBM 9/22/2018 Deep Learning

Architecture of DBN Until now, we have only done with ONE RBM. Then, we do the same thing to the rest RBMs. To compute the a, b, and W. 9/22/2018 Deep Learning

Architecture of DBN Train the rest RBMs with same approach. Note: Until now, we only get local optimal configuration. 9/22/2018 Deep Learning

Backpropagation (Fine-Tuning) Using backpropagation algorithm to fine-tune the network and to get close to global optima. Therefore, it makes DBN a supervised model. 9/22/2018 Deep Learning

Demo (MINST Classifier) Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning

Conclusion to DBN Simple architecture, easy to scale Existing an efficient algorithm to pre-train the network My test: Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 200, Time: 36 hours Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 50, Time: 6.5 hours Layer 1: 200 Layer 2: 200 Layer 3: 1000 BP: 50, Time: 2 hours Limits: Need labeled data Computation time is still an issue (image a full color picture taken from camera, how many parameters need to be updated for a 200-200-1000 network?) 9/22/2018 Deep Learning

Auto-encoder To train a model with classifier, labeled data are needed. However, in most cases, only unlabeled data are available and to label data is very expensive. Therefore, we need an unsupervised way to train the model. 9/22/2018 Deep Learning

Auto-encoder basic idea 9/22/2018 Deep Learning

Auto-encoder with DBN 9/22/2018 Deep Learning

Demo (MINST Autoencoder) Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning

Convolutional Neural Network Very popular in image recognition http://cs.stanford.edu/people/karpathy/deepimagesent/ http://deeplearning.cs.toronto.edu/ Special architecture to reduce the data size significantly (which means parameters of the network are also reduced) However it still needs long time to train the network because of algorithm 9/22/2018 Deep Learning

Question Given an image, how would you like to reduce the data size? 9/22/2018 Deep Learning

Convolutional Neural Network Architecture Convolution layer Subsampling layer Full Connection layer LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning

Basic Idea of CNN Feedforward pass Backpropagation pass To compute the error Backpropagation pass To update the weights and biases 9/22/2018 Deep Learning

Basic Idea of Feedforward Pass Convolution Layer User several filters to enhance the feature from the input (or previous layer) Subsampling Layer Because image has local spatial relation, down-sampling can reduce data size, at the same time can still keep valuable information. (e.g. imaging you can still recognize the picture from the thumbnail) Full connection Layer Can be regarded as a classifier 9/22/2018 Deep Learning

A good video about Feedforward Pass https://www.youtube.com/watch?v=n6hpQwq7Inw Convolution Layer Part: starts from 5:42 till 7:16 What is 2D matrix convolution What the effect can 2D matrix convolution achieve Subsampling Layer Part: at 10:10 How to subsampling 9/22/2018 Deep Learning

Question How many parameters need for a CNN, compared with DBN? (Input data are 32x32 digit, or full color pictures) 9/22/2018 Deep Learning

Convolutional Neural Network Architecture Task to train CNN: Given labeled data, to obtain suitable weights and biases of matrices for convolution layer and sampling layer. LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning

Model of CNN Following slides are referred from “Notes on Convolutional Neural Networks” by Jake Bouvrie, 2006 9/22/2018 Deep Learning

Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error is given: 𝐸 𝑁 = 1 2 𝑛=1 𝑁 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 𝐸 𝑁 : Error of whole training examples 𝑡 𝑛,𝑘 : Output of n-th input data w.r.t. k-th class 𝑦 𝑛,𝑘 : Label of n-th input data w.r.t. k-th class 9/22/2018 Deep Learning

Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error of n-th example is given: 𝐸 𝑛 = 1 2 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 or 𝐸 𝑛 = 1 2 𝒕 𝑛 − 𝒚 𝑛 2 9/22/2018 Deep Learning

Model of General Feedforward Pass The output of a certain layer: 𝒙 𝑙 =𝑓 𝒖 𝑙 with 𝒖 𝑙 = 𝑾 𝑙 𝒙 𝑙−1 + 𝒃 𝑙 𝑙: current layer. Layer 1 is input data layer, Layer 𝐿 is output layer of CNN. Therefore, 𝑙 is from 2 to 𝐿. 𝒙 𝑙 : output of layer 𝑙 𝑾 𝑙 and 𝒃 𝑙 : weights and biases for layer 𝑙 𝑓 ∙ : activation function, commonly to be sigmoid or hyperbolic tangent function. 9/22/2018 Deep Learning

Model of General Backpropagation Pass Backpropagation Algorithm is used to updating weights and biases 𝛿 is regard as bias sensitivity, which will be propagate back through the network. 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 𝜕𝑢 𝜕𝑏 Since 𝜕𝑢 𝜕𝑏 =1 𝛿 becomes 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 9/22/2018 Deep Learning

Model of General Backpropagation Pass 𝛿 for layer 𝑙: 𝜹 𝑙 = 𝑾 𝑙+1 𝑇 𝜹 𝑙+1 ∘ 𝑓 ′ 𝒖 𝑙 for layer 𝐿: 𝜹 𝐿 = 𝑓 ′ 𝒖 𝐿 ∘ 𝒚 𝑛 − 𝒕 𝑛 ∘: element-wise multiplication Final equation to update bias Δ 𝒃 𝑙 =−𝜂 𝜕𝐸 𝜕 𝒃 𝑙 =−𝜂 𝜹 𝑙 𝜂: learning rate 9/22/2018 Deep Learning

Model of General Backpropagation Pass To update weights, with analogous process for the bias update, 𝜕𝐸 𝜕 𝑾 𝑙 = 𝒙 𝑙−1 𝜹 𝑙 𝑇 ∆ 𝑾 𝑙 =−𝜂 𝜕𝐸 𝜕 𝑾 𝑙 9/22/2018 Deep Learning

Detail Form for Convolution Layer Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝑖∈ 𝑀 𝑗 𝒙 𝑙−1,𝑖 ∗ 𝒌 𝑙,𝑖𝑗 + 𝒃 𝑙,𝑗 𝒌 𝑙,𝑖𝑗 weight matrix for layer 𝑙, between feature map 𝑖 and 𝑗 𝑀 𝑗 a selection of input maps 9/22/2018 Deep Learning

Detail Form for Convolution Layer Backpropagation Pass 𝜹 𝑙,𝑗 = 𝛽 𝑙+1,𝑗 𝑓 ′ 𝒖 𝑙,𝑗 ∘𝑢𝑝 𝜹 𝑙+1,𝑗 𝛽 𝑙+1,𝑗 see next slide 𝑢𝑝 ⋅ up-sampling method, e.g. Kronecker product 𝑓 ′ ⋅ derivative of activation fucntion 9/22/2018 Deep Learning

Detail Form for Subsampling Layer Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝛽 𝑙,𝑗 𝑑𝑜𝑤𝑛 𝒙 𝑙−1,𝑗 + 𝒃 𝑙,𝑗 𝛽 𝑙,𝑗 nothing special, just “weight”. Here it is only a scalar, not a matrix. 𝑑𝑜𝑤𝑛 ⋅ down-sampling method, e.g. average, maximum, etc. 9/22/2018 Deep Learning

Detail Form for Subsampling Layer Backpropagation Pass 𝛿 𝑙,𝑗 = 𝒌 𝑙+1,𝑗 𝑇 𝛿 𝑙+1,𝑗 ∘ 𝑓 ′ 𝒖 𝑙,𝑗 9/22/2018 Deep Learning

A Short Conclusion (Feedforward) Target: Compute Error Convolution Layers: Convolution is used instead of multiplication. Subsampling Layers: Different down-sampling can be used. 9/22/2018 Deep Learning

A Short Conclusion (Backpropagation) Target: back propagate Error and update weight and bias Convolution Layers: Up-sampling is needed Subsampling Layers: shortcut method exists in MatLab (more details are in the paper by Jake Bouvrie) More detailed BP steps are introduced in “Notes on Convolutional Neural Networks” by Jake Bouvrie 9/22/2018 Deep Learning

Conclusion to CNN Significantly reduce the data size and parameter size Training algorithm is not efficient (only BP algorithm currently) There are researches available to combine CNN and DBN Personal view: A little bit more understandable what is happening in different layers than DBN although it is still hard for us to understand why to choose certain filters after training. 9/22/2018 Deep Learning

Conclusion to Deep Learning Feature Learning Hierarchical architecture (simulate brain activity) There is no theoretical proof what are the optimal parameters ( number of layers, number units, etc.) Good performance in image, speech recognition Although it is hard for us to understand what is happening in the network Computation time is still an issue 9/22/2018 Deep Learning