Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep Learning RUSSIR 2017 – Day 3

Similar presentations


Presentation on theme: "Deep Learning RUSSIR 2017 – Day 3"— Presentation transcript:

1 Deep Learning RUSSIR 2017 – Day 3
Efstratios Gavves

2 Recap from Day 2 Deep Learning is good not only for classifying things Structured prediction is also possible Multi-task structure prediction allows for unified networks and for missing data Discovering structure in data is also possible

3 Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though

4 Sequential Data: Text What

5 Sequential Data: Text What about

6 Sequential Data: Text What about inputs that appear in sequences, such
as text? Could a neural network handle such modalities?

7 Memory needed Pr 𝑥 = 𝑖 Pr 𝑥 𝑖 𝑥 1 ,…, 𝑥 𝑖−1 ) What about inputs that
appear in sequences, such as text? Could a neural network handle such modalities?

8 Sequential data: Video

9 Quiz: Other sequential data?

10 Quiz: Other sequential data?
Time series data Stock exchange Biological measurements Climate measurements Market analysis Speech/Music User behavior in websites ...

11 Click to go to the video in Youtube Click to go to the website
Applications Click to go to the video in Youtube Click to go to the website

12 Machine Translation The phrase in the source language is one sequence
“Today the weather is good” The phrase in the target language is also a sequence “Погода сегодня хорошая” Погода сегодня хорошая <EOS> LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good <EOS> Погода сегодня хорошая Encoder Decoder

13 Image captioning An image is a thousand words, literally!
Pretty much the same as machine transation Replace the encoder part with the output of a Convnet E.g. use Alexnet or a VGG16 network Keep the decoder part to operate as a translator Today the weather is good <EOS> Convnet LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good

14 Click to go to the video in Youtube
Demo Click to go to the video in Youtube

15 Question answering Bleeding-edge research, no real consensus
Very interesting open, research problems Again, pretty much like machine translation Again, Encoder-Decoder paradigm Insert the question to the encoder part Model the answer at the decoder part Question answering with images also Again, bleeding-edge research How/where to add the image? What has been working so far is to add the image only in the beginning Q: John entered the living room, where he met Mary. She was drinking some wine and watching a movie. What room did John enter? A: John entered the living room. Q: what are the people playing? A: They play beach football

16 Click to go to the website
Demo Click to go to the website

17 Click to go to the website
Handwriting Click to go to the website

18 Recurrent Networks Output NN Cell State Recurrent connections Input

19 Sequences Next data depend on previous data
Roughly equivalent to predicting what comes next Pr 𝑥 = 𝑖 Pr 𝑥 𝑖 𝑥 1 ,…, 𝑥 𝑖−1 ) What about inputs that appear in sequences, such as text? Could a neural network handle such modalities?

20 Why sequences? Parameters are reused  Fewer parameters  Easier modelling Generalizes well to arbitrary lengths  Not constrained to specific length However, often we pick a “frame” 𝑇 instead of an arbitrary length Pr 𝑥 = 𝑖 Pr 𝑥 𝑖 𝑥 𝑖−𝑇 ,…, 𝑥 𝑖−1 ) 𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑀𝑜𝑑𝑒𝑙( I think, therefore, I am! ) 𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑀𝑜𝑑𝑒𝑙( Everything is repeated, in a circle. History is a master because it teaches us that it doesn't exist. It's the permutations that matter. )

21 Quiz: What is really a sequence?
Data inside a sequence are … ? McGuire Bond Bond I am Bond , James tired am !

22 Quiz: What is really a sequence?
Data inside a sequence are non i.i.d. Identically, independently distributed The next “word” depends on the previous “words” Ideally on all of them We need context, and we need memory! How to model context and memory ? McGuire Bond Bond I am Bond , James Bond tired am !

23 𝑥 𝑖 ≡ One-hot vectors A vector with all zeros except for the active dimension 12 words in a sequence  12 One-hot vectors After the one-hot vectors apply an embedding Word2Vec, GloVE Vocabulary One-hot vectors I am Bond , James McGuire tired ! I am Bond , James McGuire tired ! 1 I am Bond , James McGuire tired ! 1 I am Bond , James McGuire tired ! 1 I am Bond , James McGuire tired ! 1

24 Quiz: Why not just indices?
One-hot representation OR? Index representation I am James McGuire I am James McGuire 1 1 1 1 𝑥 "𝐼" =1 𝑥 "𝑎𝑚" =2 𝑥 "𝐽𝑎𝑚𝑒𝑠" =4 𝑥 𝑡=1,2,3,4 = 𝑥 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" =7

25 Quiz: Why not just indices?
No, because then some words get closer together for no good reason (artificial bias between words) One-hot representation OR? Index representation I am James McGuire I am James McGuire 1 1 1 1 𝑞 "𝐼" =1 𝑞 "𝑎𝑚" =2 𝑞 "𝐽𝑎𝑚𝑒𝑠" =4 𝑞 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" =7 𝑥 "𝐼" 𝑥 "𝐽𝑎𝑚𝑒𝑠" 𝑥 "𝑎𝑚" 𝑥 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑥 "𝑎𝑚" , 𝑥 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" )=1 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑞 "𝑎𝑚" , 𝑞 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" )=5 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑥 "𝐼" , 𝑥 "𝑎𝑚" )=1 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑞 "𝐼" , 𝑞 "𝑎𝑚" )=1

26 Memory A representation of the past
A memory must project information at timestep 𝑡 on a latent space 𝑐 𝑡 using parameters 𝜃 Then, re-use the projected information from 𝑡 at 𝑡+1 𝑐 𝑡+1 =ℎ( 𝑥 𝑡+1 , 𝑐 𝑡 ;𝜃) Memory parameters 𝜃 are shared for all timesteps 𝑡=0, … 𝑐 𝑡+1 =ℎ( 𝑥 𝑡+1 ,ℎ( 𝑥 𝑡 ,ℎ( 𝑥 𝑡−1 ,…ℎ 𝑥 1 , 𝑐 0 ;𝜃 ;𝜃);𝜃);𝜃)

27 Memory as a Graph Simplest model 𝑐 𝑡 Input with parameters 𝑈
Memory embedding with parameters 𝑊 Output with parameters 𝑉 Output 𝑦 𝑡 Output parameters 𝑈 𝑉 𝑊 Memory parameters 𝑐 𝑡 Memory Input parameters Input 𝑥 𝑡

28 Memory as a Graph Simplest model 𝑐 𝑡 𝑐 𝑡+1 Input with parameters 𝑈
Memory embedding with parameters 𝑊 Output with parameters 𝑉 Output 𝑦 𝑡 𝑦 𝑡+1 Output parameters 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 Memory parameters 𝑐 𝑡 𝑐 𝑡+1 Memory Input parameters Input 𝑥 𝑡 𝑥 𝑡+1

29 Memory as a Graph Simplest model 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 𝑐 𝑡+3
Input with parameters 𝑈 Memory embedding with parameters 𝑊 Output with parameters 𝑉 Output 𝑦 𝑡 𝑦 𝑡+1 𝑦 𝑡+2 𝑦 𝑡+3 Output parameters 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 Memory parameters 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 𝑐 𝑡+3 Memory Input parameters Input 𝑥 𝑡 𝑥 𝑡+1 𝑥 𝑡+2 𝑥 𝑡+3

30 Folding the memory 𝑐 𝑡 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 Folded Network
Unrolled/Unfolded Network Folded Network 𝑦 𝑡 𝑦 𝑡+1 𝑦 𝑡+2 𝑥 𝑡 𝑦 𝑡 𝑊 𝑉 𝑈 𝑐 𝑡 (𝑐 𝑡−1 ) 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑊 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 𝑥 𝑡 𝑥 𝑡+1 𝑥 𝑡+2

31 Recurrent Neural Network (RNN)
Only two equations 𝑥 𝑡 𝑦 𝑡 𝑊 𝑉 𝑈 𝑐 𝑡 (𝑐 𝑡−1 ) 𝑐 𝑡 = tanh (𝑈 𝑥 𝑡 +𝑊 𝑐 𝑡−1 ) 𝑦 𝑡 = softmax ( 𝑉 𝑐 𝑡 )

32 RNN Example 𝑐 𝑡 = tanh (𝑈 𝑥 𝑡 +𝑊 𝑐 𝑡−1 ) 𝑦 𝑡 = softmax ( 𝑉 𝑐 𝑡 )
Vocabulary of 5 words A memory of 3 units Hyperparameter that we choose like layer size 𝑐 𝑡 : 3×1 , W:[3×3] An input projection of 3 dimensions U:[3×5] An output projections of 10 dimensions V:[10×3] 𝑈⋅ 𝑥 𝑡=4 = 𝑐 𝑡 = tanh (𝑈 𝑥 𝑡 +𝑊 𝑐 𝑡−1 ) 𝑦 𝑡 = softmax ( 𝑉 𝑐 𝑡 )

33 Rolled Network vs. MLP? What is really different?
Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different 𝑦 1 𝑦 2 𝑦 3 Final output 𝑉 𝑉 𝑉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 𝑦 𝑊 1 𝑊 2 𝑊 3 𝑊 𝑊 𝑊 𝑥 𝑥 𝐿𝑎𝑦𝑒𝑟 1 𝐿𝑎𝑦𝑒𝑟 2 𝐿𝑎𝑦𝑒𝑟 3 𝑈 𝑈 𝑈 3-gram Unrolled Recurrent Network 3-layer Neural Network

34 Rolled Network vs. MLP? What is really different?
Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different 𝑦 1 𝑦 2 𝑦 3 Final output 𝑉 𝑉 𝑉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 𝑦 𝑊 1 𝑊 2 𝑊 3 𝑊 𝑊 𝑊 𝑥 𝑥 𝐿𝑎𝑦𝑒𝑟 1 𝐿𝑎𝑦𝑒𝑟 2 𝐿𝑎𝑦𝑒𝑟 3 𝑈 𝑈 𝑈 3-gram Unrolled Recurrent Network 3-layer Neural Network

35 Rolled Network vs. MLP? What is really different?
Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different Sometimes intermediate outputs are not even needed Removing them, we almost end up to a standard Neural Network 𝑦 1 𝑦 2 𝑦 3 Final output 𝑉 𝑉 𝑉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 𝑦 𝑊 1 𝑊 2 𝑊 3 𝑊 𝑊 𝑊 𝑥 𝑥 𝐿𝑎𝑦𝑒𝑟 1 𝐿𝑎𝑦𝑒𝑟 2 𝐿𝑎𝑦𝑒𝑟 3 𝑈 𝑈 𝑈 3-gram Unrolled Recurrent Network 3-layer Neural Network

36 Rolled Network vs. MLP? What is really different?
Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different 𝑉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 Final output 𝑦 𝑦 𝑊 1 𝑊 2 𝑊 3 𝑊 𝑊 𝑊 𝑥 𝑥 𝐿𝑎𝑦𝑒𝑟 1 𝐿𝑎𝑦𝑒𝑟 2 𝐿𝑎𝑦𝑒𝑟 3 𝑈 𝑈 𝑈 3-gram Unrolled Recurrent Network 3-layer Neural Network

37 Training Recurrent Networks
Cross-entropy loss 𝑃= 𝑡,𝑘 𝑦 𝑡𝑘 𝑙 𝑡𝑘 ⇒ ℒ=− log 𝑃 = 𝑡 ℒ 𝑡 = − 1 𝑇 𝑡 𝑙 𝑡 log 𝑦 𝑡 Backpropagation Through Time (BPTT) Again, chain rule Only difference: Gradients survive over time steps

38 Training RNNs is hard Vanishing gradients Exploding gradients
After a few time steps the gradients become almost 0 Exploding gradients After a few time steps the gradients become huge Can’t capture long-term dependencies

39 Alternative formulation for RNNs
An alternative formulation to derive conclusions and intuitions 𝑐 𝑡 = 𝑊⋅tanh ( 𝑐 𝑡−1 ) +𝑈⋅ 𝑥 𝑡 +𝑏 ℒ = 𝑡 ℒ 𝑡 ( 𝑐 𝑡 )

40 Another look at the gradients
ℒ= 𝐿(𝑐 𝑇 ( 𝑐 𝑇−1 … 𝑐 1 𝑥 1 , 𝑐 0 ;𝑊 ;𝑊 ;𝑊 ;𝑊) 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕ℒ 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝜕ℒ 𝜕 c t ⋅ 𝜕 𝑐 𝑡 𝜕 c t−1 ⋅ 𝜕 𝑐 𝑡−1 𝜕 c t−2 ⋅…⋅ 𝜕 𝑐 𝜏+1 𝜕 c τ ≤ 𝜂 𝑡−𝜏 𝜕 ℒ 𝑡 𝜕 𝑐 𝑡 The RNN gradient is a recursive product of 𝜕 𝑐 𝑡 𝜕 c t−1 𝑅𝑒𝑠𝑡 → short-term factors 𝑡≫𝜏 → long-term factors

41 Exploding/Vanishing gradients
𝜕ℒ 𝜕 𝑐 𝑡 = 𝜕ℒ 𝜕 c T ⋅ 𝜕 𝑐 𝑇 𝜕 c T−1 ⋅ 𝜕 𝑐 𝑇−1 𝜕 c T−2 ⋅…⋅ 𝜕 𝑐 𝑡+1 𝜕 c 𝑐 𝑡 𝜕ℒ 𝜕 𝑐 𝑡 = 𝜕ℒ 𝜕 c T ⋅ 𝜕 𝑐 𝑇 𝜕 c T−1 ⋅ 𝜕 𝑐 𝑇−1 𝜕 c T−2 ⋅…⋅ 𝜕 𝑐 1 𝜕 c 𝑐 𝑡 𝜕ℒ 𝜕𝑊 ≪1⟹ Vanishing gradient <1 <1 <1 𝜕ℒ 𝜕𝑊 ≫1⟹ Exploding gradient >1 >1 >1

42 Vanishing gradients The gradient of the error w.r.t. to intermediate cell 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑟 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝑡≥𝑘≥𝜏 𝜕 𝑐 𝑘 𝜕 𝑐 𝑘−1 = 𝑡≥𝑘≥𝜏 𝑊⋅𝜕 tanh ( 𝑐 𝑘−1 )

43 Vanishing gradients The gradient of the error w.r.t. to intermediate cell 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑟 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝑡≥𝑘≥𝜏 𝜕 𝑐 𝑘 𝜕 𝑐 𝑘−1 = 𝑡≥𝑘≥𝜏 𝑊⋅𝜕 tanh ( 𝑐 𝑘−1 ) Long-term dependencies get exponentially smaller weights

44 Gradient clipping for exploding gradients
Scale the gradients to a threshold 1. g← 𝜕ℒ 𝜕𝑊 2. if 𝑔 > 𝜃 0 : g← 𝜃 0 𝑔 𝑔 else: print(‘Do nothing’) Simple, but works! g Pseudocode 𝜃 0 𝑔 𝑔 𝜃 0

45 Rescaling vanishing gradients?
Not good solution Weights are shared between timesteps  Loss summed over timesteps ℒ = 𝑡 ℒ 𝑡 ⟹ 𝜕ℒ 𝜕𝑊 = 𝑡 𝜕 ℒ 𝑡 𝜕𝑊 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 = 𝜏=1 𝑡 𝜕 ℒ 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 Rescaling for one timestep ( 𝜕 ℒ 𝑡 𝜕𝑊 ) affects all timesteps The rescaling factor for one timestep does not work for another

46 More intuitively 𝜕 ℒ 1 𝜕𝑊 𝜕 ℒ 2 𝜕𝑊 𝜕 ℒ 3 𝜕𝑊 𝜕ℒ 𝜕𝑊 = 𝜕 ℒ 1 𝜕𝑊 + 𝜕 ℒ 2 𝜕𝑊 + 𝜕 ℒ 3 𝜕𝑊 + 𝜕 ℒ 4 𝜕𝑊 + 𝜕 ℒ 5 𝜕𝑊 𝜕 ℒ 4 𝜕𝑊 𝜕 ℒ 5 𝜕𝑊

47 More intuitively Let’s say 𝜕 ℒ 1 𝜕𝑊 ∝1, 𝜕 ℒ 2 𝜕𝑊 ∝1/10, 𝜕 ℒ 3 𝜕𝑊 ∝1/100, 𝜕 ℒ 4 𝜕𝑊 ∝1/1000, 𝜕 ℒ 5 𝜕𝑊 ∝1/10000 𝜕ℒ 𝜕𝑊 = 𝑟 𝜕 ℒ 𝑟 𝜕𝑊 =1.1111 If 𝜕ℒ 𝜕𝑊 rescaled to 1  𝜕 ℒ 5 𝜕𝑊 ∝ 10 −5 Longer-term dependencies negligible Weak recurrent modelling Learning focuses on the short-term only 𝜕 ℒ 1 𝜕𝑊 𝜕 ℒ 4 𝜕𝑊 𝜕 ℒ 2 𝜕𝑊 𝜕 ℒ 3 𝜕𝑊 𝜕 ℒ 5 𝜕𝑊 𝜕ℒ 𝜕𝑊 = 𝜕 ℒ 1 𝜕𝑊 + 𝜕 ℒ 2 𝜕𝑊 + 𝜕 ℒ 3 𝜕𝑊 + 𝜕 ℒ 4 𝜕𝑊 + 𝜕 ℒ 5 𝜕𝑊

48 Recurrent networks ∝ Chaotic systems

49 Fixing vanishing gradients
Regularization on the recurrent weights Force the error signal not to vanish Ω= 𝑡 Ω t = 𝑡 𝜕ℒ 𝜕 𝑐 𝑡+1 𝜕 𝑐 𝑡+1 𝜕 𝑐 𝑡 𝜕ℒ 𝜕 𝑐 𝑡+1 −1 2 Advanced recurrent modules Long-Short Term Memory module Gated Recurrent Unit module Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013

50 Advanced Recurrent Nets
+ 𝜎 tanh Input Output 𝑓 𝑡 𝑐 𝑡−1 𝑐 𝑡 𝑜 𝑡 𝑖 𝑡 𝑚 𝑡 𝑚 𝑡−1 𝑥 𝑡

51 How to fix the vanishing gradients?
Error signal over time must have not too large, not too small norm Solution: have an activation function with gradient equal to 1 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑟 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝑡≥𝑘≥𝜏 𝜕 𝑐 𝑘 𝜕 𝑐 𝑘−1 Identify function has a gradient equal to 1 By doing so, gradients do not become too small not too large

52 Long Short-Term Memory LSTM: Beefed up RNN
Simple RNN 𝑐 𝑡 = 𝑊⋅tanh ( 𝑐 𝑡−1 ) +𝑈⋅ 𝑥 𝑡 +𝑏 𝑖=𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓=𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜=𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 + 𝜎 tanh Input Output 𝑓 𝑡 𝑐 𝑡−1 𝑐 𝑡 𝑜 𝑡 𝑖 𝑡 𝑚 𝑡 𝑚 𝑡−1 𝑥 𝑡 LSTM - The previous state 𝑐 𝑡−1 is connected to new 𝑐 𝑡 with no nonlinearity (identity function). - The only other factor is the forget gate 𝑓 which rescales the previous LSTM state. More info at:

53 Cell state The cell state carries the essential information over time
Cell state line 𝑖=𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓=𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜=𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

54 LSTM nonlinearities 𝜎∈(0, 1): control gate – something like a switch tanh ∈ −1, 1 : recurrent nonlinearity 𝑖=𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓=𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜=𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 + 𝜎 tanh Input Output 𝑓 𝑡 𝑐 𝑡−1 𝑐 𝑡 𝑜 𝑡 𝑖 𝑡 𝑚 𝑡 𝑚 𝑡−1 𝑥 𝑡

55 LSTM Step-by-Step: Step (1)
E.g. LSTM on “Yesterday she slapped me. Today she loves me.” Decide what to forget and what to remember for the new memory Sigmoid 1  Remember everything Sigmoid 0  Forget everything 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

56 LSTM Step-by-Step: Step (2)
Decide what new information is relevant from the new input and should be add to the new memory Modulate the input 𝑖 𝑡 Generate candidate memories 𝑐 𝑡 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

57 LSTM Step-by-Step: Step (3)
Compute and update the current cell state 𝑐 𝑡 Depends on the previous cell state What we decide to forget What inputs we allow The candidate memories 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

58 LSTM Step-by-Step: Step (4)
Modulate the output Does the new cell state relevant?  Sigmoid 1 If not  Sigmoid 0 Generate the new memory 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

59 LSTM Unrolled Network Macroscopically very similar to standard RNNs
The engine is a bit different (more complicated) Because of their gates LSTMs capture long and short term dependencies × + 𝜎 tanh

60 Beyond RNN & LSTM LSTM with peephole connections
Gates have access also to the previous cell states 𝑐 𝑡−1 (not only memories) Coupled forget and input gates, 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 1− 𝑓 𝑡 ⊙ 𝑐 𝑡 Bi-directional recurrent networks Gated Recurrent Units (GRU) Deep recurrent architectures Recursive neural networks Tree structured Multiplicative interactions Generative recurrent architectures LSTM (2) LSTM (1) LSTM (2) LSTM (1) LSTM (2) LSTM (1)

61 Take-away message Recurrent Neural Networks (RNN) for sequences Backpropagation Through Time Vanishing and Exploding Gradients and Remedies RNNs using Long Short-Term Memory (LSTM) Applications of Recurrent Neural Networks

62 Thank you!


Download ppt "Deep Learning RUSSIR 2017 – Day 3"

Similar presentations


Ads by Google