Presentation is loading. Please wait.

Presentation is loading. Please wait.

Convolutional Sequence to Sequence Learning

Similar presentations


Presentation on theme: "Convolutional Sequence to Sequence Learning"— Presentation transcript:

1 Convolutional Sequence to Sequence Learning
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin Facebook AI Research ( ) Shiyu Zhang

2 Classic RNN seq2seq Encoder-decoder with a soft-attention Encoder:

3 Can see the whole sentence
How about CNN seq2seq? Advantages: Do not depend on previous steps => parallelization Hierarchical structure provides a shorter path to capture long-range dependencies O(n) -> O(n/k) Can see the whole sentence

4 Architecture encoder Input: word + position
Kernel parameter: W (kd×2d), b(2d) GLU (gated linear units): Non-linearity allow the networks to exploit full input field or to focus on fewer elements; gated Sigmoid(b) control which inputs A are relevant 𝑧 𝑖 𝑙 =ν 𝑊 𝑙 𝑧 𝑖− 𝑘 2 𝑙−1 , …, 𝑧 𝑖+ 𝑘 2 𝑙−1 + 𝑏 𝑤 𝑙 + 𝑧 𝑖 𝑙−1 Residual connection

5 Architecture Decoder ℎ 𝑖 𝑙 =ν 𝑊 𝑙 ℎ 𝑖− 𝑘 2 𝑙−1 , …, ℎ 𝑖+ 𝑘 2 𝑙−1 + 𝑏 𝑤 𝑙 + ℎ 𝑖 𝑙−1

6 Architecture Attention Separate attentions for each decoder layer
c is simply added to h

7 Architecture Output

8 Strategies To stabilize learning: maintain the variance of activations throughout the forward and backward passes. Normalization × at the sum of residual connection ×𝑚 1/𝑚 at the weighted sum of attention Initialization Layers no GLU, initialize weights 𝑁(0, 𝑛 𝑙 ) Layers with GLU, output variance is ¼ of input variance, so initialize weights 𝑁(0, 𝑛 𝑙 ) If use dropout with probability p, above two are: 𝑁(0, 𝑝 𝑛 𝑙 ), 𝑁(0, 4𝑝 𝑛 𝑙 )

9 Experiments

10 Experiments

11 Experiments summarization

12 Reference Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional Sequence to Sequence Learning. ArXiv e-prints (May 2017). Gehring, J., Auli, M., Grangier, D., and Dauphin, Y. N. A Convolutional Encoder Model for Neural Machine Translation. ArXiv e-prints (Nov ).

13

14 Questions Parallelization at decoder?


Download ppt "Convolutional Sequence to Sequence Learning"

Similar presentations


Ads by Google