Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group

Overview Language Modelling Machine Translation

Language Modelling Problem Aim is to calculate the probability of a sequence (sentence) P(X) Can be decomposed into product of conditional probabilities of tokens (works): In practice, only finite content used

N-Gram Language Model N-Grams estimate word conditional probabilities via counting: Sparse (alleviated by back-off, but not entirely) Doesn’t exploit word similarity Finite Context

Neural Network Language Model Y. Bengio et al., JMLR’03

Limitation of Neural Network Language Model Sparsity – Solved World Similarity – Solved Finite Context – Not Computational Complexity - Softmax

Recurrent Neural Network Language Model [X. Liu, et al.]

Wall Street Journal Results – T. Mikolov Google 2010

Limitation of RNN Language Model Sparsity – Solved! World Similarity -> Sentence Similarity – Solved! Finite Context – Solved? Not quite… Still Computationally Complex Softmax

Lattice Rescoring with RNNs Application of RNNs to lattices expands space Lattice is expanded to a prefix tree or N-best list Impractical to apply to large lattices Approximate Lattice Expansion – expand if: N-gram history is different RNN history vector distance exceeds a threshold

Overview Language Modeling Machine Translation

Machine Translation Task Translate an Source Sentence E into a target sentence F Can be formulated in Noisy-Channel Framework: E’ = argmax E [P(F|E)] = argmax E [P(E|F)*P(F)] P(F) is just a language model – need to estimate P(E|F).

Previous Approaches: Word Alignment Use IBM Models 1-5 to create initial word alignments of increasing complexity and accuracy from sentence pairs. Make conditional independence assumptions to separate out sentence length, alignment and translation models. Bootstrap using simpler models to initialize more complex models. W. Byrne, 4F11

Previous Approaches: Phrase Based SMT Using IBM world alignments create phrase alignments and a phrase translation model. Parameters estimated by Maximum Likelihood or EM. Apply Synchronous Context Free Grammar to learn hierarchical rules over phrases. W. Byrne, 4F11

Problems with Previous Approaches Highly Memory Intensive Initial alignment makes conditional independence assumption Word and Phrase translation models only count co-occurrences of surface form – don’t take word similarity into account Highly non-trivial to decode hierarchical phrase based translation word alignments + lexical reordering model language model phrase translations parse a synchronous context free grammar over the text – components are very different from one another.

Neural Machine Translation The translation problem is expressed as a probability P(F|E) Equivalent to P(f n, f n-1, …, f 0 | e m, e m-1, …, e 0 ) -> a sequence conditioned on another sequence. Create an RNN architecture where the output of on RNN (decoder) is conditioned on another RNN (encoder). We can connect them using a joint alignment and translation mechanism. Results in a single gestalt Machine Translation model which can generate candidate translations.

Bi-Directional RNNs

Neural Machine Translation: Encoder h0h0 h1h1 hjhj hNhN …… … … e0e0 e1e1 ejej eNeN … … … …… Can be pre-trained as a Bi Directional RNN language model

Neural Machine Translation: Decoder s0s0 s1s1 stst sMsM …… f0f0 f1f1 ftft F M = … … …… f t is produced by sampling the discrete probability produced by softmax output layer. Can be pre-trained as a RNN language model

Neural Machine Translation: Joint Alignment h0h0 h1h1 hjhj hNhN …… … … … … s0s0 s1s1 stst sMsM …… f0f0 f1f1 ftft fMfM … … …… C t = ∑a tj h j s t-1 z0z0 z1z1 zjzj zNzN a t,1:N z j = W ∙ tanh(V ∙ s t-1 + U ∙ h j )

Neural Machine Translation: Features End-to-end differentiable, trained using SGD with cross-entropy error function. Encoder and Decoder learn to represent source and target sentences in a compact, distributed manner Does not make conditional independence assumptions to separate out translation model, alignment model, re-ordering model, etc… Does not pre-align words by bootstrapping from simpler models. Learns translation and joint alignment in a semantic space, not over surface forms. Conceptually easy to decode – complexity similar to speech processing, not SMT. Fewer Parameters – more memory efficient.

NMT BLEU results on English to French Translation D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate.Neural Machine Translation by Jointly Learning to Align and Translate

Conclusion RNNs and LSTM RNNs have been widely applied to a large. State of the art in language modelling Competitive performance on new tasks. Quickly evolving.

Biliography W. Byrne, Engineering Part IIB: Module 4F11 Speech and Language Processing. Lecture 12. http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf http://mi.eng.cam.ac.uk/~pcw/local/4F11/4F11_2014_lect12.pdf D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014.Neural Machine Translation by Jointly Learning to Align and Translate Y. Bengio, et al., “A neural probabilistic language model”. Journal of Machine Learning Research, No. 3 (2003) X. Liu, et al. “Efficient Lattice Rescoring using Recurrent Neural Network Language Models”. In: Proceedings of IEEE ICASSP 2014. T. Mikolov. “Statistical Language Models Based on Neural Networks” (2012) PhD Thesis. Brno University of Technology, Faculty of Information Technology, Department Of Computer Graphics and Multimedia.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Similar presentations

Presentation on theme: "Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Similar presentations

Presentation on theme: "Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group."— Presentation transcript:

Similar presentations

About project

Feedback