Machine Translation Course 2 Diana Trandab ă ţ Academic year: 2015-2016.

Machine Translation Course 2 Diana Trandab ă ţ Academic year: 2015-2016

2/21 Brief history war-time use of computers in code breaking Warren Weaver’s memorandum 1949 - suggests applying decoding techniques to mechanically recognize fundamental aspects f NL "When I look at an article in Russian, I say:this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode." Warren Weaver, “Translation” Alan Turing, 1948: Computers could be used for – "(i) Various games, e.g. chess, noughts and crosses, bridge, poker; – (ii) the learning of languages; – (iii) translation of languages; – (iv) cryptography; – (v) mathematics." Big investment by US Government (mostly on Russian-English) Early promise of FAHQT – Fully automatic high quality translation

Brief history (2) 1952: First MT conference: US MT Conference, 1954: U. Georgetown (Peter Toma) + IBM first MT experiment (English - Russian) 250 words, 6 rules, 49 translated sentences 1956: First international conference on MT at Georgetown University 1955: MT abroad: Anglia (U. Cambridge), Franta (GETA), URSS (Academia din Moscova), Germania (U. Bonn) 1959: MT in Japan (MITI-”Yamato”) Difficulties soon recognised: –no formal linguistics –crude computers –need for “real-world knowledge” –Bar Hillel’s “semantic barrier”

Brief history (3) 1960: Bar-Hillel’s warning – foarte pesimist ca se poate ajunge vreodata la automatizarea completa a procesului de traducere 1966 ALPAC (Automatic Language Processing Advisory Committee) report –“insufficient demand for translation” –“MT is more expensive, slower and less accurate” –“no immediate or future prospect” –should invest instead in fundamental CL research –Result: no public funding for MT research in US for the next 25 years (though some privately funded research continued)

5/21 1966-1985 Research confined to Europe and Canada “2nd generation approach”: linguistically and computationally more sophisticated 1976: success of Météo (Canada) 1978: in Europe starts discussions of its own MT project, Eurotra first commercial systems early 1980s FAHQT abandoned in favour of – “Translator’s Workstation” – interactive systems – sublanguage / controlled input

6/21 1985-2000 Lots of research in Europe and Japan in this “linguistic” paradigm PC replaces mainframe computers more systems marketed despite low quality, users claim increased productivity general explosion in translation market thanks to international organizations, globalisation of marketplace (“buy in your language, sell in mine”) renewed funding in US (work on Farsi, Pashto, Arabic, Korean; include speech translation) emergence of new research paradigm (“empirical” methods; allows rapid development of new target language) growth of WWW, including translation tools

7/21 Present situation creditable commercial systems now available wide price range, many very cheap (£30) MT available free on WWW widely used for web-page and e-mail translation low-quality output acceptable for reading foreign-language web pages but still only a small set of languages covered speech translation widely researched

So that: – meaning(text2) == meaning(text1) i.e. faithful – text2 is grammatically correct i.e. fluency MT is hard! – No MT system completely solved the problem What is MT? Text 1 source language Text 2 target language

MT difficulties Different word order (SVO vs VSO vs SOV languages) “a black cat” (DT ADJ N) --> “o pisica neagra” (DT N ADJ) Multiple transaltions “John knows Bill.” --> “John connaît Bill.” --> “John il cunoaste pe Bill” “John knows Bill will be late. --> “John sait que Bill sera en retard.” --> “John stie ca Bill va intarzia” Cross-lingual mapping of word senses (there are no perfect translation equivalents)

How humans do translation? Learn a foreign language: – Memorize word translations – Learn some patterns: – Exercise: Passive activity: read, listen Active activity: write, speak Translation: – Understand the sentence – Clarify or ask for help (optional) – Translate the sentence Training stage Decoding stage Translation lexicon Templates, transfer rules Parsing, semantics analysis? Interactive MT? Word-level? Phrase-level? Generate from meaning? Reinforced learning Reranking

What kinds of resources are available to MT? Translation lexicon: – Bilingual dictionary Templates, transfer rules: – Grammar books Parallel data, comparable data Thesaurus, WordNet, FrameNet, … NLP tools: tokenizer, morph analyzer, parser, …  More resources for major languages, less for “minor” languages.

Major approaches Transfer-based Interlingua Example-based (EBMT) Statistical MT (SMT) Hybrid approach

Direct translation no complete intermediary sentence structure translation proceeds in a number of steps, each step dedicated to a specific task the most important component is the bilingual dictionary typically general language problems with – ambiguity – inflection – word order and other structural shifts

Simplistic approach sentence splitting tokenisation handling capital letters dictionary look-up and lexical substitution incl. some heuristics for handling ambiguities copying unknown words, digits, signs of punctuation etc. formal editing

Advanced classical approach (Tucker 1987) Source text dictionary look-up and morphological analysis Identification of homographs Identification of compound nouns Identification of nouns and verb phrases Processing of idioms

Advanced approach, cont. processing of prepositions subject-predicate identification syntactic ambiguity identification synthesis and morphological processing of target text rearrangement of words and phrases in target text

Feasibility of the direct translation strategy Is it possible to carry out the direct translation steps as suggested by Tucker with sufficient precision without relying on a complete sentence structure?

Current trends in direct translation re-use of translations – translation memories of sentences and sub-sentence units such as words, phrases and larger units – lexicalistic translation – example-based translation – statistical translation Will re-use of translations overcome the problems with the direct translation approach? If so, how can they be handled?

Systran System Translation developed in the US by Peter Toma first version 1969 (Ru-En) EC bought the rights of Systran in 1976 currently 18 language pairs http://babelfish.altavista.com/

Systran Linguistic Resources Dictionaries – POS Definitions – Inflection Tables – Decomposition Tables – Segmentation Dictionaries Disambiguation Rules Analysis Rules

Transfer-based MT Analysis, transfer, generation: 1.Parse the source sentence 2.Transform the parse tree with transfer rules 3.Translate source words 4.Get the target sentence from the tree Resources required: – Source parser – A translation lexicon – A set of transfer rules An example: Mary bought a book yesterday.

analysis --> transfer --> generation Each arrow can be implemented with rule-based methods or probabilistically The Transfer Metaphor Interlingua attraction(NamedJohn, NamedMary, high) English Semantics loves(John, Mary) French Semantics aime(Jean, Marie) English Syntax S(NP(John) VP(loves, NP(Mary))) French Syntax S(NP(Jean) VP(aime, NP(Marie))) English Words John loves Mary French Words Jean aime Marie word transfer (memory-based translation) syntactic transfer semantic transfer knowledge transfer

Transfer-based MT (cont) Parsing: linguistically motivated grammar or formal grammar? Transfer: – context-free rules? A path on a dependency tree? – Apply at most one rule at each level? – How are rules created? Translating words: word-to-word translation? Generation: using LM or other additional knowledge? How to create the needed resources automatically?

Syntactic transfer Solves some problems… –Word order –Some cases of lexical choice Ex: –Dictionary of analysis know: verb ; transitive ; subj: human ; obj: NP || Sentence –Dictionary of transfer know + obj [NP] --> cunoate know + obj [sentence] --> ti But syntax is not enough… –No one-to-one correspondence between syntactic structures in different languages (syntactic mismatch)

Interlingua For n languages, we need n(n-1) MT systems. Interlingua uses a language-independent representation. Conceptually, Interlingua is elegant: we only need n analyzers, and n generators. Resource needed: – A language-independent representation – Sophisticated analyzers – Sophisticated generators

Interlingua (cont) Questions: – Does language-independent meaning representation really exist? If so, what does it look like? – It requires deep analysis: how to get such an analyzer: e.g., semantic analysis – It requires non-trivial generation: How is that done? – It forces disambiguation at various levels: lexical, syntactic, semantic, discourse levels. – It cannot take advantage of similarities between a particular language pair.

Example-based MT Basic idea: translate a sentence by using the closest match in parallel data. First proposed by Nagao (1981). Ex: – Training data: w1 w2 w3 w4  w1’ w2’ w3’ w4’ w5 w6 w7  w5’ w6’ w7’ w8 w9  w8’ w9’ – Test sent: w1 w2 w6 w7 w9  w1’ w2’ w6’ w7’ w9’

EMBT (cont) Types of EBMT: – Lexical (shallow) – Morphological / POS analysis – Parse-tree based (deep) Types of data required by EBMT systems: – Parallel text – Bilingual dictionary – Thesaurus for computing semantic similarity – Syntactic parser, dependency parser, etc.

EBMT (cont) Word alignment: using dictionary and heuristics  exact match Generalization: – Clusters: dates, numbers, colors, shapes, etc. – Clusters can be built by hand or learned automatically. Ex: – Exact match: 12 players met in Paris last Tuesday  12 juc ă tori s-au întâlnit joia trecut ă în Paris – Templates: $num players met in $city $time  $num juc ă tori s-au întâlnit $time în $city

Statistical MT Basic idea: learn all the parameters from parallel data. Major types: – Word-based – Phrase-based Strengths: – Easy to build, and it requires no human knowledge – Good performance when a large amount of training data is available. Weaknesses: – How to express linguistic generalization?

Statistical MT: Being faithful & fluent Often impossible to have a true translation; one that is: – Faithful to the source language, and – Fluent in the target language – Ex: Japanese: “fukaku hansei shite orimasu” Fluent translation: “we apologize” Faithful translation: “we are deeply reflecting (on our past behaviour, and what we did wrong, and how to avoid the problem next time)” So need to compromise between faithfulness & fluency Statistical MT tries to maximise some function that represents the importance of faithfulness and fluency – Best-translation T*= argmax T fluency(T) x faithfulness(T, S)

The Noisy Channel Model Statistical MT is based on the noisy channel model Developed by Shannon to model communication (ex. over a phone line) Noisy channel model in SMT (ex. En to Ro): – Assume that the true text is in English – But when it was transmitted over the noisy channel, it somehow got corrupted and came out in Romanian i.e. the noisy channel has deformed/corrupted the original English input into Romanian So really… Romanian is a form of noisy English – The task is to recover the original English sentence (or to decode the Romanian into English)

Comparison of resource requirement Transfer- based InterlinguaEBMTSMT dictionary+++ Transfer rules + parser+++ (?) semantic analyzer + parallel data++ othersUniversal representation thesaurus

Hybrid MT Basic idea: combine strengths of different approaches: – Syntax-based: generalization at syntactic level – Interlingua: conceptually elegant – EBMT: memorizing translation of n-grams; generalization at various level. – SMT: fully automatic; using LM; optimizing some objective functions. Types of hybrid HT: – Borrowing concepts/methods: SMT from EBMT: phrase-based SMT; Alignment templates EBMT from SMT: automatically learned translation lexicon Transfer-based from SMT: automatically learned translation lexicon, transfer rules; using language models, … – Using two MTs in a pipeline: Using transfer-based MT as a preprocessor of SMT – Using multiple MTs in parallel, then adding a re-ranker.

Machine Translation Course 2 Diana Trandab ă ţ Academic year: 2015-2016.

Similar presentations

Presentation on theme: "Machine Translation Course 2 Diana Trandab ă ţ Academic year: 2015-2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Translation Course 2 Diana Trandab ă ţ Academic year: 2015-2016.

Similar presentations

Presentation on theme: "Machine Translation Course 2 Diana Trandab ă ţ Academic year: 2015-2016."— Presentation transcript:

Similar presentations

About project

Feedback