Download presentation
Presentation is loading. Please wait.
1
Convolutional Sequence to Sequence Learning
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin Facebook AI Research ( ) Shiyu Zhang
2
Classic RNN seq2seq Encoder-decoder with a soft-attention Encoder:
3
Can see the whole sentence
How about CNN seq2seq? Advantages: Do not depend on previous steps => parallelization Hierarchical structure provides a shorter path to capture long-range dependencies O(n) -> O(n/k) Can see the whole sentence
4
Architecture encoder Input: word + position
Kernel parameter: W (kd×2d), b(2d) GLU (gated linear units): Non-linearity allow the networks to exploit full input field or to focus on fewer elements; gated Sigmoid(b) control which inputs A are relevant 𝑧 𝑖 𝑙 =ν 𝑊 𝑙 𝑧 𝑖− 𝑘 2 𝑙−1 , …, 𝑧 𝑖+ 𝑘 2 𝑙−1 + 𝑏 𝑤 𝑙 + 𝑧 𝑖 𝑙−1 Residual connection
5
Architecture Decoder ℎ 𝑖 𝑙 =ν 𝑊 𝑙 ℎ 𝑖− 𝑘 2 𝑙−1 , …, ℎ 𝑖+ 𝑘 2 𝑙−1 + 𝑏 𝑤 𝑙 + ℎ 𝑖 𝑙−1
6
Architecture Attention Separate attentions for each decoder layer
c is simply added to h
7
Architecture Output
8
Strategies To stabilize learning: maintain the variance of activations throughout the forward and backward passes. Normalization × at the sum of residual connection ×𝑚 1/𝑚 at the weighted sum of attention Initialization Layers no GLU, initialize weights 𝑁(0, 𝑛 𝑙 ) Layers with GLU, output variance is ¼ of input variance, so initialize weights 𝑁(0, 𝑛 𝑙 ) If use dropout with probability p, above two are: 𝑁(0, 𝑝 𝑛 𝑙 ), 𝑁(0, 4𝑝 𝑛 𝑙 )
9
Experiments
10
Experiments
11
Experiments summarization
12
Reference Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional Sequence to Sequence Learning. ArXiv e-prints (May 2017). Gehring, J., Auli, M., Grangier, D., and Dauphin, Y. N. A Convolutional Encoder Model for Neural Machine Translation. ArXiv e-prints (Nov ).
14
Questions Parallelization at decoder?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.