An Overview of Machine Translation

An Overview of Machine Translation
Jiyuan zhang CSLT / RIIT Tsinghua University Slide 1 : 自我介绍 1

Machine Translation definition
Machine translation, often referred to by the acronym MT, is a subfield of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.(Wikipedia)

Why is MT so important? Information society and production of multilingual content Globalization and demand for translation services Size of world translation market Size of translation industry MT can impove productivity of human translators MT can supply cheap gist translation

Why is MT so difficult? High quality human translation implies:
Deep and rich understanding of source language and text Sophisticated and creative command of target language Nowadays,feasible goals for machine translation are tasks: an approximate translation is still useful(gist translation) Human translators can post-edit MT(computer assisted translation)

MT system technology Hand-crafted: knowledge for analysis, transfer, generation, meaning representation, or direct translation is manually done Includes: rule-based MT Machine-learned: Representations are implemented by mathematical models learnable from data: Includes: statistical MT ,example-based MT and DNN-based MT

Three basic types of rule-based MT systems
Direct systems Transfer systems interlinguas

The rule-based MT system pyramid
7

Direct Systems Lacks any kinds of intermediate stages in all details for one particular pair of languages in one direction no analysis of syntactic structure or semantic relationships Some local reordering rules

Direct System

Interlingua system the difficulty of creating an interlingua
The source text is analysed in a representation from which the target text is directly generated. Neutral between two or more languages the difficulty of creating an interlingua

Interlingua system

Transfer systems bilingual modules between intermediate representations of each of the two languages the input to generation is an abstract representation of the target text (possibly a tree) A grammar between languages is sometimes more difficult to formulate

Transfer systems 7

Example-based machine translation
Assumption: people translate by analogy Decompose a sentence into phrases Translate phrases by analogy to previous translations Properly compose translation fragments into one long sentence Need a very large bilingual example database

Example-based machine translation

Statistical machine translation
Statistical machine translation(SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived form the analysis of bilingual text corpora. (Wikipedia)

Translation models: “Adequacy” Assign better scores to accurate translations Language models: “Fluency” Assign better scores to natural target language text

Source-Channel Model

Direct Maximum Entropy Translation Model 9

Statistical Machine Translation
Word-Based SMT Phrase-Based SMT Hierarchical Phrase-Based SMT etc

Word-based SMT Usually directed: each word in the target generated by one word in the source Many-many and null-many links allowed Classic IBM models of Brown et al Used now mostly for word alignment, not translation

Word-based SMT

Phrase-based SMT Translation: segment input, translate and re-arrange phrases Steps: select a source segment, translate and attach to target Scores: linear combination of feature functions Features: phrase pairs, target n-grams etc Decoder: efficient algorithm to compute optimal solutions Features and combination weights are machine learnable

Phrase-based SMT

Hierarchical Phrase-based SMT
Discontinuous phrases, i.e. phrases with gaps Long-range reordering rules Formalized as synchronous context-free grammars Not based on syntactic rules: just two non-terminal symbols! The model is fully machine learnable

Hierarchical Phrase-Based SMT
Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. 11

Deep Neural networks in MT
Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstraction in data by using multiple processing layers, with complex structures of otherwise, composed noe-linear transformations.(Wikipedia) As one of the more challenging NLP tasks, machine translation(MT) has become a testing ground for researchers who want to evaluate various kinds of DNNs Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Two types of Deep Neural networks in MT
Direct application, which adopts DNNs to design a purely neural MT model. Indirect application, which attempts to improve standard MT systems. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Neural network model A neural network is put together by hooking together many of our simple “neurons”, so that the output of a neuron can be the input of another. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Backpropagation Algorithm
Backpropagation is a common method of training artificial neural networks used in conjunction with a optimization method such as gd. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. 11

Recurrent neural network
Recurrent Neural Networks(RNNs) are popular models that have shown great promise in many NLP tasks, including MT. The idea behind RNNs is to make use of sequential information. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Recurrent neural network
A recurrent neural network and the unfolding in time of the computation involved in its forward computation. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

LSTM/GRU networks LSTMs were designed to combat vanishing gradients through a gating mechanism. GRU is much simpler to compute and imp- ment. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. 11

MT with RNNs – Simplest Model
Encoder: Decoder: Minimize cross entropy error for all target words conditioned on source words Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

MT with RNNs – Simplest Model
Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. 11

RNN encoder-decoder(LSTM)
The basic architecture(LSTM) includes two networks: one encodes the variable-length source sentence into a real-valued vector, and the other decodes the vector into a variable-length target sentence. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

RNN encoder-decoder(GRU)
The neural network architecture(GRU) that learns to encode a variable-length sequence into a fixed-length vector representation back into a variable-length sequence. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

RNN encoder-decoder(GRU)
Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

RNN encoder-decoder(including align)
A new architecture consists of a bidirectional RNN as an encoder and a decoder that emulates searching through a source sentence during decoding a translation. The novel architecture that learns to align and translate simultaneously, because it allows a model to automatically search for parts of a source sentence that are relevant to predicting a target word. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

RNN encoder-decoder(including align)
Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Attention mechanism With an attention mechanism,allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. The attention mechanism is simply giving the network access to its internal memory, which is the hidden state of the encoder. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Attention = (Fuzzy) Memory?
the attention mechanism is simply giving the network access to its internal memory, which is the hidden state of the encoder. Instead of choosing what to “attend” to, the network chooses what to retrieve from memory. Unlike typical me-mory, the memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

DNNs in Standard SMT Frameworks
DNNs for Word Alignment This DNN-based method not only can learn the bilingual word embedding that captures the similarity between words, but can also make use of wide contextual information. DNNs for Translation Rule selection DNNs will achieve better rule prediction by addressing different aspects such as phrase similarity and topic similarity. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

DNNs in Standard SMT Frameworks
DNNs for Reordering and Structure Prediction DNNs for Joint Translation Prediction DNNs for Language Models in SMT The recurrentNN LM is employed to rescore the n-best translation candidates. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

NMT advantages over SMT
NMT requires a minimal set of domain knowledge. The whole system is jointly tuned to maximize the translation perfo- rmance, unlike the existing phrase-based system which consists of many feature functions that tuned separately. The memory footprint of the NMT model is often much smaller than the existing system which relies on maintaining large tables of phrase pairs. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Problems in DNN machine translation
How to efficiently cover most of the vocabulary How to make use of target large-scale How to utilize more syntactic/semantic Computational complexity Error analysis Remembering and reasoning Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. 14

Evaluation metrics in MT
Subjective judgments by human evaluators Automatic evaluation metrics BLEU METEOR WER/TER Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. 14

MT system Giza++ ,a training tool for IBM Model 1-5
Moses, a complete SMT system Joshua, a decoder for syntax-based SMT Pharaoh, a decoder for phrase-based SMT etc Defining listwise loss functions based on the understanding on the unique properties of ranking for IR. 14

conclude DNNs have a long way to go in MT. Due to their effective representations of languages, they could be a good solution eventually. It’s interesting and imperative to investigate more efficient algorithms for parameter learning of complicated neural network architectures. Defining listwise loss functions based on the understanding on the unique properties of ranking for IR.

Presented by Jiyuan Zhang
Thank you! Presented by Jiyuan Zhang Slide 25 : 致谢

An Overview of Machine Translation

Similar presentations

Presentation on theme: "An Overview of Machine Translation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Overview of Machine Translation

Similar presentations

Presentation on theme: "An Overview of Machine Translation"— Presentation transcript:

Similar presentations

About project

Feedback