Presentation is loading. Please wait.

Presentation is loading. Please wait.

SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.

Similar presentations


Presentation on theme: "SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa."— Presentation transcript:

1 SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa

2 This is a description for people who have already understood simple neural network architecture like feed forward networks.

3 I will introduce LSTM, how to use, tips in chainer.

4 1.RNN to LSTM Output Layer Middle(Hidden) Layer Input Layer Simple RNN
( )

5 FAQ with LSTM beginner students.
I hear LSTM is kind of RNN, but LSTM looks different architecture… These have same architecture! Please follow me! Neural bear A-san These are same? different? 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 𝒚 𝒕 𝒉 𝒕 LSTM LSTM LSTM 𝒙 𝒕 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 A-san often sees this RNN A-san often sees this LSTM

6 Introduce LSTM figure from RNN
𝒚 𝒕 𝒉 𝒕 𝒙 𝒕

7 Introduce LSTM figure from RNN
Unroll on time scale 𝒚 𝒕 𝒉 𝒕 𝒙 𝒕

8 Oh, I often see this in RNN!
Introduce LSTM figure from RNN Unroll on time scale 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 𝒉 𝟎 𝒉 𝟏 𝒉 𝟐 𝒉 𝟑 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 Oh, I often see this in RNN! 𝒉 𝑡 = tanh 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 𝒚 𝑡 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾 ℎ𝑦 𝒉 𝑡 So, this figure focuses on variables and shows that their relationships.

9 Let’s focus on the more actual process
I try to write the architecture detail. See RNN as a large function with input ( 𝒙 𝑡 , 𝒉 𝑡−1 ) and return (𝒚 𝑡 , 𝒉 𝑡 ) is function 𝒚 𝟏 𝒚 𝟐 𝐲 𝐭 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗 𝐭 𝐲 𝐭 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗 𝐭 𝒗 𝐭 =𝐖 ℎ𝑦 𝒉 𝒕 𝒗 𝐭 =𝐖 ℎ𝑦 𝒉 𝒕 𝒉 𝟎 𝒉 𝟏 𝒉 𝟐 𝒉 𝐭 =𝑡𝑎𝑛ℎ 𝒖 𝐭 𝒉 𝐭 =𝑡𝑎𝑛ℎ 𝒖 𝐭 𝒖 𝑡 = 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 𝒖 𝑡 = 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 𝒙 𝟏 𝒙 𝟐

10 Let’s focus on the more actual process
See RNN as a large function with input ( 𝒙 𝑡 , 𝒉 𝑡−1 ) and return (𝒚 𝑡 , 𝒉 𝑡 ) is function 𝒚 𝟏 𝒚 𝟐 𝐲 𝐭 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗 𝐭 𝐲 𝐭 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗 𝐭 𝒗 𝐭 =𝐖 ℎ𝑦 𝒉 𝒕 𝒗 𝐭 =𝐖 ℎ𝑦 𝒉 𝒕 𝒉 𝟎 𝒉 𝟏 𝒉 𝟐 𝒉 𝐭 =𝑡𝑎𝑛ℎ 𝒖 𝐭 𝒉 𝐭 =𝑡𝑎𝑛ℎ 𝒖 𝐭 𝒖 𝑡 = 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 𝒖 𝑡 = 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 𝒙 𝟏 𝒙 𝟐

11 Let’s focus on the more actual process
See RNN as a large function with input ( 𝒙 𝑡 , 𝒉 𝑡−1 ) and return (𝒚 𝑡 , 𝒉 𝑡 ) is function 𝒚 𝟏 𝐲 𝐭 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗 𝐭 𝒗 𝐭 =𝐖 ℎ𝑦 𝒉 𝒕 𝒉 𝟎 𝒉 𝟏 𝒉 𝐭 =𝑡𝑎𝑛ℎ 𝒖 𝐭 LSTM 𝒖 𝑡 = 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 Oh, this looks same as LSTM! 𝒙 𝟏

12 Summary of this section
LSTM figure is not special! Yeah. Moreover, initial hidden state ℎ 0 is often omitted like below. 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 RNN RNN RNN 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 If you see RNN as LSTM, in fact, you need to give cell value to next time LSTM module, but it is mostly omitted, too.

13 Too complex! By the way, if you want to see the contents of LSTM…
(𝜎 ⋅ =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ⋅ とする) 𝒉 𝟎 𝒚 𝟏 𝒉 𝟏 𝒈 𝑜,𝑡 =𝜎 𝑾 𝑥𝑜 𝒙 𝑡 + 𝑾 ℎ𝑜 𝒉 𝑡−1 𝐲 t =𝜎 𝑾 ℎ𝑦 𝒉 t Too complex! 𝐡 𝐭 = tanh 𝒄 𝑡 ⊙ 𝒈 𝑜,𝑡 𝒈 𝑓,𝑡 =𝜎 𝑾 𝒙𝑓 𝒙 𝑡 + 𝑾 ℎ𝑓 𝒉 𝑡−1 𝐜 t = 𝒄 t−1 +𝑡𝑎𝑛ℎ 𝒛 𝑡 𝒄 𝒕 𝒄 t−1 = 𝒄 t−1 ⊙ 𝒈 𝑓,𝑡 𝒄 𝒕−𝟏 𝒈 𝑖,𝑡 =𝜎 (𝑾 𝑥𝑖 𝒙 𝑡 + 𝑾 ℎ𝑖 𝒉 𝑡−1 ) 𝒛 t = 𝐳 t ⊙ 𝒈 𝑖,𝑡 𝒛 𝑡 = tanh 𝑾 𝑥𝑧 𝒙 𝑡 + 𝑾 ℎ𝑧 𝒉 𝑡−1 𝒙 𝟏

14 LSTM FAQ Q. What is the difference beween RNN and LSTM?
Constant Error Carousel(CEC, often called as cell) input gate, forget gate, output Input gate: Select to accept input to cell or not Forget gate: Select to throw away cell information or not Output gate: Select to 次の時刻にどの程度情報を伝えるか選ぶ Q. Why does LSTM avoid gradient vanishing problem? 1. BP is suffered because of repeatedly sigmoid diff calculation. 2. RNN output was effected from changeable hidden states. 3.LSTM has a cell and store previous input as sum of weighted inputs, so they are robust to current hidden states( Of course, there is a certain limit to remember the sequence)

15 LSTM わかるLSTM ~ 最近の動向と共に から引用
( )

16 LSTM with Peephole Known as Standard LSTM, but peephole omitted LSTM is often used, too. わかるLSTM ~ 最近の動向と共に から引用 ( )

17 Chainer usage Not Peephole(Standard ver. in chainer)
chainer.links.LSTM with Peephole chainer.links.StatefulPeepholeLSTM “Stateful” means wrapping hidden state in the internal state of the function(※) Stateful○○ Stateless○○ stateful_lstm(x1) stateful_lstm(x2) h = init_state() h = stateless_lstm(h, x1) h = stateless_lstm(h, x2) (※)

18 2. LSTM Learning Methods Full BPTT Truncated BPTT
(BPTT: Back Propagation Trough Time) Graham Neubig NLP tutorial 8- recurrent neural networks

19 Truncated BPTT by chainer

20 ChainerでTruncated BPTT
Update weights BP BP until 𝒊=𝟑𝟎 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑𝟎 𝒚 𝟑𝟏 𝒚 𝟑𝟐 𝒉 𝟏 𝒉 𝟐 𝒉 𝟑𝟎 LSTM LSTM LSTM LSTM LSTM 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 𝒙 𝟐 𝒙 𝟑

21 Filling end of sequence is standard.
Mini-batch calculation with GPU How should I do if I want to use GPU with unaligned data length? Filling end of sequence is standard. ex): End of sequence is 0 1 2 0 I call them Zero padding

22 Adding handcraft rule can solve it.
Mini-batch calculation with GPU Learned model become redundant! They should learn “continuous 0 output rule” Adding handcraft rule can solve it. There are 2 methods in chainer chainer.functions.where NStepLSTM(v or later)

23 chainer.functions.where
𝒚 𝒕 Condition matrix True False 𝒄 𝑡−1 , 𝒉 𝑡−1 𝒄 𝑡𝑚𝑝 , 𝒉 𝑡𝑚𝑝 𝒄 𝑡 , 𝒉 𝑡 𝒄 𝐭 =F.where 𝑺, 𝒄 𝑡𝑚𝑝 ,𝒄 𝑡−1 LSTM 𝒉 𝐭 =F.where 𝑺, 𝒉 𝑡𝑚𝑝 ,𝒉 𝑡−1 𝒙 𝒕 𝒉 𝑡−1 𝒉 𝑡 𝒉 𝒕−𝟏 1 𝒉 𝒕−𝟏 2 𝒉 𝒕−𝟏 3 𝒉 𝒕−𝟏 1 𝒉 𝒕 2 𝒉 𝒕−𝟏 3 False, False,…,False True , True ,…,True 𝑺=

24 NStepLSTM(v1.16.0 or later) NStepLSTM can auto filling
There is a bug with cudnn,dropout(※) 10/25 fixed version marged to master repository Use latest version(wait for v or git clone from github) There is no document now, read raw script below  I don’t need to listen F.where ? Hahaha… (※)ChainerのNStepLSTMでニコニコ動画のコメント予測。

25 Gradient Clipping can suppress gradient explosion
LSTM can solve gradient vanishing problem, but RNN also suffer from gradient explosion(※) Proposed by ※ If norm of all gradient is over the threshold, make norm threshold In chainer, you can use optimizer.add_hook(Gradient_Clipping(threshold)) ※ On the difficulty of training recurrent neural networks 

26 DropOut application to LSTM
DropOut is a strong smoothing method, But DropOut anywhere doesn’t always success. According to ※, 1.DropOut hidden recurrent state in LSTM 2.DropOut cell in LSTM 3.DropOut input gate in LSTM Conclusion: 3.achieved the best performance. Basically, Recurrent part →DropOut should not be applied to Forward part →DropOut should be applied to ※ Recurrent Dropout without Memory Loss

27 Batch Normalization on Activation x
Batch Normalization on LSTM Batch Normalization?  Scaling activation(sum of weighted input) distribution to N(0,1) In practice, BN is applied to mini-batch In theory, BN should be applied to all data Batch Normalization on Activation x

28 Batch Normalization on LSTM
BN to RNN doesn’t improve the performance(※) hidden-to-hidden suffer from gradient explosion by repeatedly scaling input-to-hidden makes learning faster, but not improve performance 3 new proposed way(proposed date order) (Weight Normalization) (Recurrent Batch Normalization) Layer Normalization ※Batch Normalized Recurrent Neural Networks

29 Batch Normalization and Layer Normalization
Difference between Batch Normalization and Layer Normalization Assuming activation 𝒂 ( 𝑎 𝑖 (𝑛) = Σ 𝑗 𝑤 𝑖𝑗 𝑥 𝑗 𝑛 , h 𝑖 𝑛 = 𝑎 𝑖 (𝑛) ) 𝑎 1 (1) 𝑎 2 (1) 𝑎 3 (1) 𝑎 4 (1) ⋯ 𝑎 𝐻 (1) 𝑎 1 (1) 𝑎 2 (1) 𝑎 3 (1) 𝑎 4 (1) ⋯ 𝑎 𝐻 1 ⋮ ⋮ ⋮ ⋮ ⋮ Layer Normalization normalizes horizontal Batch Normalization normalizes vertically Variance 𝜎 becomes larger if gradient explosion happens. Normalization makes output more robust(detail is in paper)

30 https://arxiv.org/abs/1312.6120v3
Initialization Tips Exact solutions to the nonlinear dynamics of learning in deep linear neural networks A Simple Way to Initialize Recurrent Networks of Rectified Linear Units RNN with ReLU and recurrent weight connections initialized by identity matrix is as good as LSTM

31 From “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units”

32 MNIST 784 sequence prediction

33 IRNN 𝒉 𝑡 = tanh 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 𝒚 𝑡 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾 ℎ𝑦 𝒉 𝑡 𝒚 𝟏 𝒉 𝟎
𝒚 𝑡 =𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾 ℎ𝑦 𝒉 𝑡 𝒚 𝟏 𝒉 𝟎 𝒉 𝟏 𝒉 𝑡 = ReL𝑈 𝑾 𝑥ℎ 𝒙 𝑡 + 𝑾 ℎℎ 𝒉 𝑡−1 𝒚 𝑡 =𝑅𝑒𝐿𝑈 𝑾 ℎ𝑦 𝒉 𝑡 𝒙 𝟏 Initialize by identity matrix x=0の時、 h=ReLU(h)

34 Extra materials

35 Various RNN model Encoder-Decoder Bidirectional LSTM Attention model

36 2 RNNの隠れ層の初期値に注目する 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 𝒉 𝟎 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑
RNN output is changed by initial hidden states ℎ 0 ℎ 0 is also learnable by BP It can be connected to an encoder output →encoder-decoder model 𝒉 𝟎 RNN RNN RNN 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 First slice is 0(black), but various sequence appear original gen from learned ℎ 0 gen from random ℎ 0 RNNスライスpixel生成

37 Encoder-Decoder model
𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 RNN RNN RNN 𝒉 𝟎 𝒅𝒆𝒄 RNN RNN RNN 𝒙 𝟏 𝒆𝒏𝒄 𝒙 𝟐 𝒆𝒏𝒄 𝒙 𝟑 𝒆𝒏𝒄 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 Point: Use when your I/O data have different sequence length 𝒉 𝟎 𝒅𝒆𝒄 is learned by encoder and decoder learning on the same time To improve performance, you can use beamsearch on Decoder

38 Bidirectional LSTM 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 𝒉 𝟎 𝒅𝒆𝒄 𝒙 𝟏 𝒆𝒏𝒄 𝒙 𝟐 𝒆𝒏𝒄 𝒙 𝟑 𝒆𝒏𝒄 𝒉 𝟎 𝒅𝒆𝒄
I remember latter information! 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 RNN RNN RNN 𝒉 𝟎 𝒅𝒆𝒄 𝒙 𝟏 𝒆𝒏𝒄 𝒙 𝟐 𝒆𝒏𝒄 𝒙 𝟑 𝒆𝒏𝒄 RNN RNN RNN 𝒉 𝟎 𝒅𝒆𝒄 RNN RNN RNN 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 I remember former information! 𝒙 𝟑 𝒆𝒏𝒄 𝒙 𝟐 𝒆𝒏𝒄 𝒙 𝟏 𝒆𝒏𝒄 Long long time dependency is difficult to learn unless you use LSTM (LSTM doesn’t solve gradient vanishing fundamentally) You can improve performance by using inverted encoder

39 Attention model 𝒚 𝟏 𝒚 𝟐 𝒚 𝟑 𝛼 1,𝑡 𝛼 2,𝑡 𝛼 3,𝑡 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 𝒉 𝟎 𝒅𝒆𝒄 𝜶 1
𝒉 𝟏 𝒆𝒏𝒄 𝒉 𝟐 𝒆𝒏𝒄 𝒉 𝟎 𝒅𝒆𝒄 RNN RNN RNN 𝒉 𝟑 𝒆𝒏𝒄 𝒉 2 𝒆𝒏𝒄 𝒉 𝟎 𝒅𝒆𝒄 𝒙 𝟏 𝒙 𝟐 𝒙 𝟑 𝒉 𝟏 𝒆𝒏𝒄 𝒉 𝟐 𝒆𝒏𝒄 RNN RNN RNN 𝒉 𝟎 𝒅𝒆𝒄 𝜶 1 𝜶 2 𝜶 3 Moreover, using middle hidden states of encoder leads better performance! 𝒙 𝟏 𝒆𝒏𝒄 𝒙 𝟐 𝒆𝒏𝒄 𝒙 𝟑 𝒆𝒏𝒄 𝒉 𝟑 𝒆𝒏𝒄 𝒉 2 𝒆𝒏𝒄 𝒉 𝟎 𝒅𝒆𝒄 RNN RNN RNN 𝒙 𝟑 𝒆𝒏𝒄 𝒙 𝟐 𝒆𝒏𝒄 𝒙 𝟏 𝒆𝒏𝒄

40 Gated Recurrent Unit (GRU)
Variant of LSTM Delete cell Gate are reduced to 2 Unless less complexity, performance is not bad Often appear on MT task or SD task LSTM

41 GRU can be interpreted as special case of LSTM
Try to split LSTM, and make them upside down GRU is to hidden states what LSTM is to cell Share Input gate and Output gate as Update gate Delete tanh function of cell output of LSTM

42 GRU can be interpreted as special case of LSTM
Try to split LSTM, and make them upside down

43 GRU can be interpreted as special case of LSTM
Try to split LSTM, and make them upside down See LSTM cell as GRU hidden states Share Input gate and Output gate as Update gate Delete tanh function of cell output of LSTM


Download ppt "SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa."

Similar presentations


Ads by Google