Convolutional Sequence to Sequence Learning Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin Facebook AI Research (2017.05.12) Shiyu Zhang 2017.05.18
Classic RNN seq2seq Encoder-decoder with a soft-attention Encoder:
Can see the whole sentence How about CNN seq2seq? Advantages: Do not depend on previous steps => parallelization Hierarchical structure provides a shorter path to capture long-range dependencies O(n) -> O(n/k) Can see the whole sentence
Architecture encoder Input: word + position Kernel parameter: W (kd×2d), b(2d) GLU (gated linear units): Non-linearity allow the networks to exploit full input field or to focus on fewer elements; gated Sigmoid(b) control which inputs A are relevant 𝑧 𝑖 𝑙 =ν 𝑊 𝑙 𝑧 𝑖− 𝑘 2 𝑙−1 , …, 𝑧 𝑖+ 𝑘 2 𝑙−1 + 𝑏 𝑤 𝑙 + 𝑧 𝑖 𝑙−1 Residual connection
Architecture Decoder ℎ 𝑖 𝑙 =ν 𝑊 𝑙 ℎ 𝑖− 𝑘 2 𝑙−1 , …, ℎ 𝑖+ 𝑘 2 𝑙−1 + 𝑏 𝑤 𝑙 + ℎ 𝑖 𝑙−1
Architecture Attention Separate attentions for each decoder layer c is simply added to h
Architecture Output
Strategies To stabilize learning: maintain the variance of activations throughout the forward and backward passes. Normalization × 0.5 at the sum of residual connection ×𝑚 1/𝑚 at the weighted sum of attention Initialization Layers no GLU, initialize weights 𝑁(0, 1 𝑛 𝑙 ) Layers with GLU, output variance is ¼ of input variance, so initialize weights 𝑁(0, 4 𝑛 𝑙 ) If use dropout with probability p, above two are: 𝑁(0, 𝑝 𝑛 𝑙 ), 𝑁(0, 4𝑝 𝑛 𝑙 )
Experiments
Experiments
Experiments summarization
Reference Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional Sequence to Sequence Learning. ArXiv e-prints (May 2017). Gehring, J., Auli, M., Grangier, D., and Dauphin, Y. N. A Convolutional Encoder Model for Neural Machine Translation. ArXiv e-prints (Nov. 2016). https://github.com/facebookresearch/fairseq
Questions Parallelization at decoder?