Presentation is loading. Please wait.

Presentation is loading. Please wait.

A shallow introduction to Deep Learning

Similar presentations


Presentation on theme: "A shallow introduction to Deep Learning"— Presentation transcript:

1 A shallow introduction to Deep Learning
Zhiting Hu

2 Outline Motivation: why go deep? DL since 2006 Some DL Models
Discussion

3 Outline Motivation: why go deep? DL since 2006 Some DL Models
Discussion

4 Motivation Definition Deep Learning A wide class of machine learning techniques and architectures, with the hallmark of using many layers of non-linear information processing that are hierarchical in nature. An Example: Deep Neural Networks

5 Motivation Definition Example Neural Network

6 Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
Motivation Definition Motivation Definition Example Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)

7 Deep Neural Network (DNN)
Motivation Definition Example Motivation Definition Deep Neural Network (DNN)

8 Parameter learning: Back-propagation
Motivation Definition Example Motivation Definition Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W

9 Parameter learning: Back-propagation
Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases:

10 Parameter learning: Back-propagation
Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation

11 Parameter learning: Back-propagation
Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation (2) backward propagation

12 Motivation: why go deep?
Why Deep? Motivation: why go deep? Brains have a deep architecture Humans organize their ideas hierarchically, through composition of simpler ideas Insufficiently deep architectures can be exponentially inefficient Distributed representations are necessary to achieve non-local generalization Intermediate representations allow sharing statistical strength

13 Brains have a deep architecture
Motivation Why Deep? Brains have a deep architecture

14 Brains have a deep architecture
Motivation Why Deep? Brains have a deep architecture [Lee, Grosse, Ranganath & Ng, 2009]

15 Brains have a deep architecture
Motivation Why Deep? Brains have a deep architecture Deep Learning = Learning Hierarchical Representations (features) [Lee, Grosse, Ranganath & Ng, 2009]

16 Deep Architecture in our Mind
Motivation Why Deep? Deep Architecture in our Mind Humans organize their ideas and concepts hierarchically Humans first learn simpler concepts and then compose them to represent more abstract ones Engineers break-up solutions into multiple levels of abstraction and processing

17 Insufficiently deep architectures can be exponentially inefficient
Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient Theoretical arguments Two layers of neurons = universal approximator Some functions compactly represented with k layers may require exponential size with 2 layers Theorems on advantage of depth: (Hastad et al 86 & 91, Bengio et al 2007, Bengio & Delalleau 2011, Braverman 2011)

18 Insufficiently deep architectures can be exponentially inefficient
Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient “Shallow” computer program “Deep” computer program

19 Outline Motivation: why go deep? DL since 2006 Some DL Models
Discussion

20 Why now? “Winter of Neural Networks” Since 90’s
DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima

21 Why now? “Winter of Neural Networks” Since 90’s
DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with ….

22 Why now? “Winter of Neural Networks” Since 90’s
DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with …. So people turned to shallow models with convex loss function (e.g., SVMs, CRFs etc.)

23 DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima

24 DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima GPU, distributed systems Large-scale learning

25 Success in object recognition
DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

26 Success in object recognition
DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

27 Success in object recognition
DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

28 Success in speech recognition
DL since 2006 Success in speech recognition Google uses DL in their android speech recognizer (both server-side and on some phones with enough memory) Results from Google, IBM, Microsoft

29 Success in NLP Neural Word embedding
DL since 2006 Success in NLP Neural Word embedding Use neural network to learn vector representation of a word Semantic relations appear as linear relationships in the space of learned representations King – Queen ~= Man – Woman Paris – France + Italy ~= Rome

30 DL in Industry Microsoft
DL since 2006 DL in Industry Microsoft First successful DL models for speech recognition, by MSR in 2009 Google “Google Brain” Led by Google fellow Jeff Dean Large-scale deep learning infrastructure (Le et al, ICML’12) 10 million 200*200 images. Network with 1 billion connections, train on 1000 machines (16K cores) for 3 days Facebook Facebook hires NYU deep learning expert to run its new AI lab ( )

31 Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion

32 Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks)

33 Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance

34 Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image

35 Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image

36 Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image Training Back-propagation

37 Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) MNIST handwritten digits benchmark State-of-the-art: 0.35% error rate (IJCAI 2011)

38 Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion

39 Restricted Boltzmann Machine (RBM)
DL Models DBN RBM Restricted Boltzmann Machine (RBM) Building block of Deep Belief Nets (DBNs) and Deep Boltzmann Machine (DBM) Bipartite undirected graphical model Define: Parameter learning: Model parameters: W, b, c Maximize Gradient Descent, but use Contrastive Divergence (CD) to approximate the gradient

40 Deep Belief Nets (DBNs)
DL Models DBN Deep Belief Nets (DBNs)

41 DBNs Layer-wise pre-training
DL Models DBN DBNs Layer-wise pre-training

42 DBNs Layer-wise pre-training
DL Models DBN DBNs Layer-wise pre-training

43 DBNs Layer-wise pre-training
DL Models DBN DBNs Layer-wise pre-training

44 Supervised fine-tuning
DL Models DBN Supervised fine-tuning After pre-training, the parameters W and c for each layer can be used to initialize a deep multi-layer neural network. These parameters can then be fine-tuned using back-propagation on labeled data

45 Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion

46 Stacked auto-encoders / sparse coding
DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs

47 Stacked auto-encoders / sparse coding
DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs Let’s skip it….

48 DL Models

49 DL Models

50 DL Models

51 Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion

52 Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features

53 Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception

54 Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception Features in NLP (hand-crafted)

55 Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features

56 Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate

57 Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets

58 Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets Lack of theoretical analysis

59 Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion

60 References


Download ppt "A shallow introduction to Deep Learning"

Similar presentations


Ads by Google