A shallow introduction to Deep Learning

Name: A shallow introduction to Deep Learning
Uploaded: 2017-07-03T21:12:07+00:00
Duration: PTM20S34
Channel: Mercy Morris
Description: A shallow introduction to Deep Learning

A shallow introduction to Deep Learning
Zhiting Hu

Outline Motivation: why go deep? DL since 2006 Some DL Models
Discussion

Motivation Definition Deep Learning A wide class of machine learning techniques and architectures, with the hallmark of using many layers of non-linear information processing that are hierarchical in nature. An Example: Deep Neural Networks

Motivation Definition Example Neural Network

Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
Motivation Definition Motivation Definition Example Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)

Deep Neural Network (DNN)
Motivation Definition Example Motivation Definition Deep Neural Network (DNN)

Parameter learning: Back-propagation
Motivation Definition Example Motivation Definition Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W

Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases:

Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation

Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation (2) backward propagation

Motivation: why go deep?
Why Deep? Motivation: why go deep? Brains have a deep architecture Humans organize their ideas hierarchically, through composition of simpler ideas Insufficiently deep architectures can be exponentially inefficient Distributed representations are necessary to achieve non-local generalization Intermediate representations allow sharing statistical strength

Brains have a deep architecture
Motivation Why Deep? Brains have a deep architecture

Motivation Why Deep? Brains have a deep architecture [Lee, Grosse, Ranganath & Ng, 2009]

Motivation Why Deep? Brains have a deep architecture Deep Learning = Learning Hierarchical Representations (features) [Lee, Grosse, Ranganath & Ng, 2009]

Deep Architecture in our Mind
Motivation Why Deep? Deep Architecture in our Mind Humans organize their ideas and concepts hierarchically Humans first learn simpler concepts and then compose them to represent more abstract ones Engineers break-up solutions into multiple levels of abstraction and processing

Insufficiently deep architectures can be exponentially inefficient
Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient Theoretical arguments Two layers of neurons = universal approximator Some functions compactly represented with k layers may require exponential size with 2 layers Theorems on advantage of depth: (Hastad et al 86 & 91, Bengio et al 2007, Bengio & Delalleau 2011, Braverman 2011)

Insufficiently deep architectures can be exponentially inefficient
Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient “Shallow” computer program “Deep” computer program

Discussion

Why now? “Winter of Neural Networks” Since 90’s
DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima

DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with ….

DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with …. So people turned to shallow models with convex loss function (e.g., SVMs, CRFs etc.)

DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima

DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima GPU, distributed systems Large-scale learning

Success in object recognition
DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

Success in speech recognition
DL since 2006 Success in speech recognition Google uses DL in their android speech recognizer (both server-side and on some phones with enough memory) Results from Google, IBM, Microsoft

Success in NLP Neural Word embedding
DL since 2006 Success in NLP Neural Word embedding Use neural network to learn vector representation of a word Semantic relations appear as linear relationships in the space of learned representations King – Queen ~= Man – Woman Paris – France + Italy ~= Rome

DL in Industry Microsoft
DL since 2006 DL in Industry Microsoft First successful DL models for speech recognition, by MSR in 2009 Google “Google Brain” Led by Google fellow Jeff Dean Large-scale deep learning infrastructure (Le et al, ICML’12) 10 million 200*200 images. Network with 1 billion connections, train on 1000 machines (16K cores) for 3 days Facebook Facebook hires NYU deep learning expert to run its new AI lab ( )

Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion

Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks)

DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance

DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image

DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image Training Back-propagation

DL Models CNN Convolutional Neural Networks (CNNs) MNIST handwritten digits benchmark State-of-the-art: 0.35% error rate (IJCAI 2011)

Restricted Boltzmann Machine (RBM)
DL Models DBN RBM Restricted Boltzmann Machine (RBM) Building block of Deep Belief Nets (DBNs) and Deep Boltzmann Machine (DBM) Bipartite undirected graphical model Define: Parameter learning: Model parameters: W, b, c Maximize Gradient Descent, but use Contrastive Divergence (CD) to approximate the gradient

Deep Belief Nets (DBNs)
DL Models DBN Deep Belief Nets (DBNs)

DBNs Layer-wise pre-training
DL Models DBN DBNs Layer-wise pre-training

Supervised fine-tuning
DL Models DBN Supervised fine-tuning After pre-training, the parameters W and c for each layer can be used to initialize a deep multi-layer neural network. These parameters can then be fine-tuned using back-propagation on labeled data

Stacked auto-encoders / sparse coding
DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs

Stacked auto-encoders / sparse coding
DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs Let’s skip it….

DL Models

Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features

Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception

Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception Features in NLP (hand-crafted)

Discussion Feature Learning Deep Learning = Learning Hierarchical features

Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate

Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets

Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets Lack of theoretical analysis

References

A shallow introduction to Deep Learning

Similar presentations

Presentation on theme: "A shallow introduction to Deep Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A shallow introduction to Deep Learning

Similar presentations

Presentation on theme: "A shallow introduction to Deep Learning"— Presentation transcript:

Similar presentations

About project

Feedback