Deep Learning RUSSIR 2017 – Day 3

Deep Learning RUSSIR 2017 – Day 3
Efstratios Gavves

Recap from Day 2 Deep Learning is good not only for classifying things Structured prediction is also possible Multi-task structure prediction allows for unified networks and for missing data Discovering structure in data is also possible

Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though

Sequential Data: Text What

Sequential Data: Text What about

Sequential Data: Text What about inputs that appear in sequences, such
as text? Could a neural network handle such modalities?

Memory needed Pr 𝑥 = 𝑖 Pr 𝑥 𝑖 𝑥 1 ,…, 𝑥 𝑖−1 ) What about inputs that
appear in sequences, such as text? Could a neural network handle such modalities?

Sequential data: Video

Quiz: Other sequential data?

Quiz: Other sequential data?
Time series data Stock exchange Biological measurements Climate measurements Market analysis Speech/Music User behavior in websites ...

Click to go to the video in Youtube Click to go to the website
Applications Click to go to the video in Youtube Click to go to the website

Machine Translation The phrase in the source language is one sequence
“Today the weather is good” The phrase in the target language is also a sequence “Погода сегодня хорошая” Погода сегодня хорошая <EOS> LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good <EOS> Погода сегодня хорошая Encoder Decoder

Image captioning An image is a thousand words, literally!
Pretty much the same as machine transation Replace the encoder part with the output of a Convnet E.g. use Alexnet or a VGG16 network Keep the decoder part to operate as a translator Today the weather is good <EOS> Convnet LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good

Click to go to the video in Youtube
Demo Click to go to the video in Youtube

Question answering Bleeding-edge research, no real consensus
Very interesting open, research problems Again, pretty much like machine translation Again, Encoder-Decoder paradigm Insert the question to the encoder part Model the answer at the decoder part Question answering with images also Again, bleeding-edge research How/where to add the image? What has been working so far is to add the image only in the beginning Q: John entered the living room, where he met Mary. She was drinking some wine and watching a movie. What room did John enter? A: John entered the living room. Q: what are the people playing? A: They play beach football

Click to go to the website
Demo Click to go to the website

Click to go to the website
Handwriting Click to go to the website

Recurrent Networks Output NN Cell State Recurrent connections Input

Sequences Next data depend on previous data
Roughly equivalent to predicting what comes next Pr 𝑥 = 𝑖 Pr 𝑥 𝑖 𝑥 1 ,…, 𝑥 𝑖−1 ) What about inputs that appear in sequences, such as text? Could a neural network handle such modalities?

Why sequences? Parameters are reused  Fewer parameters  Easier modelling Generalizes well to arbitrary lengths  Not constrained to specific length However, often we pick a “frame” 𝑇 instead of an arbitrary length Pr 𝑥 = 𝑖 Pr 𝑥 𝑖 𝑥 𝑖−𝑇 ,…, 𝑥 𝑖−1 ) 𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑀𝑜𝑑𝑒𝑙( I think, therefore, I am! ) ≡ 𝑅𝑒𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑀𝑜𝑑𝑒𝑙( Everything is repeated, in a circle. History is a master because it teaches us that it doesn't exist. It's the permutations that matter. )

Quiz: What is really a sequence?
Data inside a sequence are … ? McGuire Bond Bond I am Bond , James tired am !

Quiz: What is really a sequence?
Data inside a sequence are non i.i.d. Identically, independently distributed The next “word” depends on the previous “words” Ideally on all of them We need context, and we need memory! How to model context and memory ? McGuire Bond Bond I am Bond , James Bond tired am !

𝑥 𝑖 ≡ One-hot vectors A vector with all zeros except for the active dimension 12 words in a sequence  12 One-hot vectors After the one-hot vectors apply an embedding Word2Vec, GloVE Vocabulary One-hot vectors I am Bond , James McGuire tired ! I am Bond , James McGuire tired ! 1 I am Bond , James McGuire tired ! 1 I am Bond , James McGuire tired ! 1 I am Bond , James McGuire tired ! 1

Quiz: Why not just indices?
One-hot representation OR? Index representation I am James McGuire I am James McGuire 1 1 1 1 𝑥 "𝐼" =1 𝑥 "𝑎𝑚" =2 𝑥 "𝐽𝑎𝑚𝑒𝑠" =4 𝑥 𝑡=1,2,3,4 = 𝑥 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" =7

Quiz: Why not just indices?
No, because then some words get closer together for no good reason (artificial bias between words) One-hot representation OR? Index representation I am James McGuire I am James McGuire 1 1 1 1 𝑞 "𝐼" =1 𝑞 "𝑎𝑚" =2 𝑞 "𝐽𝑎𝑚𝑒𝑠" =4 𝑞 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" =7 𝑥 "𝐼" 𝑥 "𝐽𝑎𝑚𝑒𝑠" 𝑥 "𝑎𝑚" 𝑥 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑥 "𝑎𝑚" , 𝑥 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" )=1 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑞 "𝑎𝑚" , 𝑞 "𝑀𝑐𝐺𝑢𝑖𝑟𝑒" )=5 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑥 "𝐼" , 𝑥 "𝑎𝑚" )=1 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒( 𝑞 "𝐼" , 𝑞 "𝑎𝑚" )=1

Memory A representation of the past
A memory must project information at timestep 𝑡 on a latent space 𝑐 𝑡 using parameters 𝜃 Then, re-use the projected information from 𝑡 at 𝑡+1 𝑐 𝑡+1 =ℎ( 𝑥 𝑡+1 , 𝑐 𝑡 ;𝜃) Memory parameters 𝜃 are shared for all timesteps 𝑡=0, … 𝑐 𝑡+1 =ℎ( 𝑥 𝑡+1 ,ℎ( 𝑥 𝑡 ,ℎ( 𝑥 𝑡−1 ,…ℎ 𝑥 1 , 𝑐 0 ;𝜃 ;𝜃);𝜃);𝜃)

Memory as a Graph Simplest model 𝑐 𝑡 Input with parameters 𝑈
Memory embedding with parameters 𝑊 Output with parameters 𝑉 Output 𝑦 𝑡 Output parameters 𝑈 𝑉 𝑊 Memory parameters 𝑐 𝑡 Memory Input parameters Input 𝑥 𝑡

Memory as a Graph Simplest model 𝑐 𝑡 𝑐 𝑡+1 Input with parameters 𝑈
Memory embedding with parameters 𝑊 Output with parameters 𝑉 Output 𝑦 𝑡 𝑦 𝑡+1 Output parameters 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 Memory parameters 𝑐 𝑡 𝑐 𝑡+1 Memory Input parameters Input 𝑥 𝑡 𝑥 𝑡+1

Memory as a Graph Simplest model 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 𝑐 𝑡+3
Input with parameters 𝑈 Memory embedding with parameters 𝑊 Output with parameters 𝑉 Output 𝑦 𝑡 𝑦 𝑡+1 𝑦 𝑡+2 𝑦 𝑡+3 Output parameters 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 Memory parameters 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 𝑐 𝑡+3 Memory Input parameters Input 𝑥 𝑡 𝑥 𝑡+1 𝑥 𝑡+2 𝑥 𝑡+3

Folding the memory 𝑐 𝑡 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 Folded Network
Unrolled/Unfolded Network Folded Network 𝑦 𝑡 𝑦 𝑡+1 𝑦 𝑡+2 𝑥 𝑡 𝑦 𝑡 𝑊 𝑉 𝑈 𝑐 𝑡 (𝑐 𝑡−1 ) 𝑈 𝑉 𝑊 𝑈 𝑉 𝑊 𝑊 𝑐 𝑡 𝑐 𝑡+1 𝑐 𝑡+2 𝑥 𝑡 𝑥 𝑡+1 𝑥 𝑡+2

Recurrent Neural Network (RNN)
Only two equations 𝑥 𝑡 𝑦 𝑡 𝑊 𝑉 𝑈 𝑐 𝑡 (𝑐 𝑡−1 ) 𝑐 𝑡 = tanh (𝑈 𝑥 𝑡 +𝑊 𝑐 𝑡−1 ) 𝑦 𝑡 = softmax ( 𝑉 𝑐 𝑡 )

RNN Example 𝑐 𝑡 = tanh (𝑈 𝑥 𝑡 +𝑊 𝑐 𝑡−1 ) 𝑦 𝑡 = softmax ( 𝑉 𝑐 𝑡 )
Vocabulary of 5 words A memory of 3 units Hyperparameter that we choose like layer size 𝑐 𝑡 : 3×1 , W:[3×3] An input projection of 3 dimensions U:[3×5] An output projections of 10 dimensions V:[10×3] 𝑈⋅ 𝑥 𝑡=4 = 𝑐 𝑡 = tanh (𝑈 𝑥 𝑡 +𝑊 𝑐 𝑡−1 ) 𝑦 𝑡 = softmax ( 𝑉 𝑐 𝑡 )

Rolled Network vs. MLP? What is really different?
Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different 𝑦 1 𝑦 2 𝑦 3 Final output 𝑉 𝑉 𝑉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 𝑦 𝑊 1 𝑊 2 𝑊 3 𝑊 𝑊 𝑊 𝑥 𝑥 𝐿𝑎𝑦𝑒𝑟 1 𝐿𝑎𝑦𝑒𝑟 2 𝐿𝑎𝑦𝑒𝑟 3 𝑈 𝑈 𝑈 3-gram Unrolled Recurrent Network 3-layer Neural Network

Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different Sometimes intermediate outputs are not even needed Removing them, we almost end up to a standard Neural Network 𝑦 1 𝑦 2 𝑦 3 Final output 𝑉 𝑉 𝑉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 𝑦 𝑊 1 𝑊 2 𝑊 3 𝑊 𝑊 𝑊 𝑥 𝑥 𝐿𝑎𝑦𝑒𝑟 1 𝐿𝑎𝑦𝑒𝑟 2 𝐿𝑎𝑦𝑒𝑟 3 𝑈 𝑈 𝑈 3-gram Unrolled Recurrent Network 3-layer Neural Network

Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different 𝑉 “Layer/Step” 1 “Layer/Step” 2 “Layer/Step” 3 Final output 𝑦 𝑦 𝑊 1 𝑊 2 𝑊 3 𝑊 𝑊 𝑊 𝑥 𝑥 𝐿𝑎𝑦𝑒𝑟 1 𝐿𝑎𝑦𝑒𝑟 2 𝐿𝑎𝑦𝑒𝑟 3 𝑈 𝑈 𝑈 3-gram Unrolled Recurrent Network 3-layer Neural Network

Training Recurrent Networks
Cross-entropy loss 𝑃= 𝑡,𝑘 𝑦 𝑡𝑘 𝑙 𝑡𝑘 ⇒ ℒ=− log 𝑃 = 𝑡 ℒ 𝑡 = − 1 𝑇 𝑡 𝑙 𝑡 log 𝑦 𝑡 Backpropagation Through Time (BPTT) Again, chain rule Only difference: Gradients survive over time steps

Training RNNs is hard Vanishing gradients Exploding gradients
After a few time steps the gradients become almost 0 Exploding gradients After a few time steps the gradients become huge Can’t capture long-term dependencies

Alternative formulation for RNNs
An alternative formulation to derive conclusions and intuitions 𝑐 𝑡 = 𝑊⋅tanh ( 𝑐 𝑡−1 ) +𝑈⋅ 𝑥 𝑡 +𝑏 ℒ = 𝑡 ℒ 𝑡 ( 𝑐 𝑡 )

Another look at the gradients
ℒ= 𝐿(𝑐 𝑇 ( 𝑐 𝑇−1 … 𝑐 1 𝑥 1 , 𝑐 0 ;𝑊 ;𝑊 ;𝑊 ;𝑊) 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕ℒ 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝜕ℒ 𝜕 c t ⋅ 𝜕 𝑐 𝑡 𝜕 c t−1 ⋅ 𝜕 𝑐 𝑡−1 𝜕 c t−2 ⋅…⋅ 𝜕 𝑐 𝜏+1 𝜕 c τ ≤ 𝜂 𝑡−𝜏 𝜕 ℒ 𝑡 𝜕 𝑐 𝑡 The RNN gradient is a recursive product of 𝜕 𝑐 𝑡 𝜕 c t−1 𝑅𝑒𝑠𝑡 → short-term factors 𝑡≫𝜏 → long-term factors

Exploding/Vanishing gradients
𝜕ℒ 𝜕 𝑐 𝑡 = 𝜕ℒ 𝜕 c T ⋅ 𝜕 𝑐 𝑇 𝜕 c T−1 ⋅ 𝜕 𝑐 𝑇−1 𝜕 c T−2 ⋅…⋅ 𝜕 𝑐 𝑡+1 𝜕 c 𝑐 𝑡 𝜕ℒ 𝜕 𝑐 𝑡 = 𝜕ℒ 𝜕 c T ⋅ 𝜕 𝑐 𝑇 𝜕 c T−1 ⋅ 𝜕 𝑐 𝑇−1 𝜕 c T−2 ⋅…⋅ 𝜕 𝑐 1 𝜕 c 𝑐 𝑡 𝜕ℒ 𝜕𝑊 ≪1⟹ Vanishing gradient <1 <1 <1 𝜕ℒ 𝜕𝑊 ≫1⟹ Exploding gradient >1 >1 >1

Vanishing gradients The gradient of the error w.r.t. to intermediate cell 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑟 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝑡≥𝑘≥𝜏 𝜕 𝑐 𝑘 𝜕 𝑐 𝑘−1 = 𝑡≥𝑘≥𝜏 𝑊⋅𝜕 tanh ( 𝑐 𝑘−1 )

Vanishing gradients The gradient of the error w.r.t. to intermediate cell 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑟 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝑡≥𝑘≥𝜏 𝜕 𝑐 𝑘 𝜕 𝑐 𝑘−1 = 𝑡≥𝑘≥𝜏 𝑊⋅𝜕 tanh ( 𝑐 𝑘−1 ) Long-term dependencies get exponentially smaller weights

Gradient clipping for exploding gradients
Scale the gradients to a threshold 1. g← 𝜕ℒ 𝜕𝑊 2. if 𝑔 > 𝜃 0 : g← 𝜃 0 𝑔 𝑔 else: print(‘Do nothing’) Simple, but works! g Pseudocode 𝜃 0 𝑔 𝑔 𝜃 0

Rescaling vanishing gradients?
Not good solution Weights are shared between timesteps  Loss summed over timesteps ℒ = 𝑡 ℒ 𝑡 ⟹ 𝜕ℒ 𝜕𝑊 = 𝑡 𝜕 ℒ 𝑡 𝜕𝑊 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 = 𝜏=1 𝑡 𝜕 ℒ 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 Rescaling for one timestep ( 𝜕 ℒ 𝑡 𝜕𝑊 ) affects all timesteps The rescaling factor for one timestep does not work for another

More intuitively 𝜕 ℒ 1 𝜕𝑊 𝜕 ℒ 2 𝜕𝑊 𝜕 ℒ 3 𝜕𝑊 𝜕ℒ 𝜕𝑊 = 𝜕 ℒ 1 𝜕𝑊 + 𝜕 ℒ 2 𝜕𝑊 + 𝜕 ℒ 3 𝜕𝑊 + 𝜕 ℒ 4 𝜕𝑊 + 𝜕 ℒ 5 𝜕𝑊 𝜕 ℒ 4 𝜕𝑊 𝜕 ℒ 5 𝜕𝑊

More intuitively Let’s say 𝜕 ℒ 1 𝜕𝑊 ∝1, 𝜕 ℒ 2 𝜕𝑊 ∝1/10, 𝜕 ℒ 3 𝜕𝑊 ∝1/100, 𝜕 ℒ 4 𝜕𝑊 ∝1/1000, 𝜕 ℒ 5 𝜕𝑊 ∝1/10000 𝜕ℒ 𝜕𝑊 = 𝑟 𝜕 ℒ 𝑟 𝜕𝑊 =1.1111 If 𝜕ℒ 𝜕𝑊 rescaled to 1  𝜕 ℒ 5 𝜕𝑊 ∝ 10 −5 Longer-term dependencies negligible Weak recurrent modelling Learning focuses on the short-term only 𝜕 ℒ 1 𝜕𝑊 𝜕 ℒ 4 𝜕𝑊 𝜕 ℒ 2 𝜕𝑊 𝜕 ℒ 3 𝜕𝑊 𝜕 ℒ 5 𝜕𝑊 𝜕ℒ 𝜕𝑊 = 𝜕 ℒ 1 𝜕𝑊 + 𝜕 ℒ 2 𝜕𝑊 + 𝜕 ℒ 3 𝜕𝑊 + 𝜕 ℒ 4 𝜕𝑊 + 𝜕 ℒ 5 𝜕𝑊

Recurrent networks ∝ Chaotic systems

Fixing vanishing gradients
Regularization on the recurrent weights Force the error signal not to vanish Ω= 𝑡 Ω t = 𝑡 𝜕ℒ 𝜕 𝑐 𝑡+1 𝜕 𝑐 𝑡+1 𝜕 𝑐 𝑡 𝜕ℒ 𝜕 𝑐 𝑡+1 −1 2 Advanced recurrent modules Long-Short Term Memory module Gated Recurrent Unit module Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013

Advanced Recurrent Nets
+ 𝜎 tanh Input Output 𝑓 𝑡 𝑐 𝑡−1 𝑐 𝑡 𝑜 𝑡 𝑖 𝑡 𝑚 𝑡 𝑚 𝑡−1 𝑥 𝑡

How to fix the vanishing gradients?
Error signal over time must have not too large, not too small norm Solution: have an activation function with gradient equal to 1 𝜕 ℒ 𝑡 𝜕𝑊 = 𝜏=1 t 𝜕 ℒ 𝑟 𝜕 𝑦 𝑡 𝜕 𝑦 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 𝜕 𝑐 𝜏 𝜕𝑊 𝜕 𝑐 𝑡 𝜕 𝑐 𝜏 = 𝑡≥𝑘≥𝜏 𝜕 𝑐 𝑘 𝜕 𝑐 𝑘−1 Identify function has a gradient equal to 1 By doing so, gradients do not become too small not too large

Long Short-Term Memory LSTM: Beefed up RNN
Simple RNN 𝑐 𝑡 = 𝑊⋅tanh ( 𝑐 𝑡−1 ) +𝑈⋅ 𝑥 𝑡 +𝑏 𝑖=𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓=𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜=𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 + 𝜎 tanh Input Output 𝑓 𝑡 𝑐 𝑡−1 𝑐 𝑡 𝑜 𝑡 𝑖 𝑡 𝑚 𝑡 𝑚 𝑡−1 𝑥 𝑡 LSTM - The previous state 𝑐 𝑡−1 is connected to new 𝑐 𝑡 with no nonlinearity (identity function). - The only other factor is the forget gate 𝑓 which rescales the previous LSTM state. More info at:

Cell state The cell state carries the essential information over time
Cell state line 𝑖=𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓=𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜=𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

LSTM nonlinearities 𝜎∈(0, 1): control gate – something like a switch tanh ∈ −1, 1 : recurrent nonlinearity 𝑖=𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓=𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜=𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 + 𝜎 tanh Input Output 𝑓 𝑡 𝑐 𝑡−1 𝑐 𝑡 𝑜 𝑡 𝑖 𝑡 𝑚 𝑡 𝑚 𝑡−1 𝑥 𝑡

LSTM Step-by-Step: Step (1)
E.g. LSTM on “Yesterday she slapped me. Today she loves me.” Decide what to forget and what to remember for the new memory Sigmoid 1  Remember everything Sigmoid 0  Forget everything 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

Decide what new information is relevant from the new input and should be add to the new memory Modulate the input 𝑖 𝑡 Generate candidate memories 𝑐 𝑡 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

Compute and update the current cell state 𝑐 𝑡 Depends on the previous cell state What we decide to forget What inputs we allow The candidate memories 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

Modulate the output Does the new cell state relevant?  Sigmoid 1 If not  Sigmoid 0 Generate the new memory 𝑐 𝑡−1 + 𝜎 tanh 𝑐 𝑡 𝑖 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑖) + 𝑚 𝑡−1 𝑊 (𝑖) 𝑓 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑓) + 𝑚 𝑡−1 𝑊 (𝑓) 𝑜 𝑡 =𝜎 𝑥 𝑡 𝑈 (𝑜) + 𝑚 𝑡−1 𝑊 (𝑜) 𝑐 𝑡 = tanh ( 𝑥 𝑡 𝑈 𝑔 + 𝑚 𝑡−1 𝑊 (𝑔) ) 𝑐 𝑡 = 𝑐 𝑡−1 ⊙𝑓+ 𝑐 𝑡 ⊙𝑖 𝑚 𝑡 = tanh 𝑐 𝑡 ⊙𝑜 𝑓 𝑡 𝑖 𝑡 𝑜 𝑡 𝑐 𝑡 𝑚 𝑡−1 𝑚 𝑡 𝑥 𝑡

LSTM Unrolled Network Macroscopically very similar to standard RNNs
The engine is a bit different (more complicated) Because of their gates LSTMs capture long and short term dependencies × + 𝜎 tanh

Beyond RNN & LSTM LSTM with peephole connections
Gates have access also to the previous cell states 𝑐 𝑡−1 (not only memories) Coupled forget and input gates, 𝑐 𝑡 = 𝑓 𝑡 ⊙ 𝑐 𝑡−1 + 1− 𝑓 𝑡 ⊙ 𝑐 𝑡 Bi-directional recurrent networks Gated Recurrent Units (GRU) Deep recurrent architectures Recursive neural networks Tree structured Multiplicative interactions Generative recurrent architectures LSTM (2) LSTM (1) LSTM (2) LSTM (1) LSTM (2) LSTM (1)

Take-away message Recurrent Neural Networks (RNN) for sequences Backpropagation Through Time Vanishing and Exploding Gradients and Remedies RNNs using Long Short-Term Memory (LSTM) Applications of Recurrent Neural Networks

Thank you!

Deep Learning RUSSIR 2017 – Day 3

Similar presentations

Presentation on theme: "Deep Learning RUSSIR 2017 – Day 3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning RUSSIR 2017 – Day 3

Similar presentations

Presentation on theme: "Deep Learning RUSSIR 2017 – Day 3"— Presentation transcript:

Similar presentations

About project

Feedback