Machine Translation Overview

Machine Translation Overview
Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Open House February 26, 2009

Machine Translation: History
MT started in 1940’s, one of the first conceived application of computers Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems AIPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s MT Revival started in earnest in 1980s (US, Japan) Field dominated by rule-based approaches, requiring 100s of K-years of manual development Economic incentive for developing MT systems for small number of language pairs (mostly European languages) February 26, 2009 LTI OH 2009

Machine Translation: Where are we today?
Age of Internet and Globalization – great demand for translation services and MT: Multiple official languages of UN, EU, Canada, etc. Documentation dissemination for large manufacturers (Microsoft, IBM, Caterpillar) Language and translation services business sector estimated at $16 Billion worldwide in 2008 Economic incentive is still primarily within a small number of language pairs Some fairly good commercial products in the market for these language pairs Primarily a product of rule-based systems after many years of development Web-based (mostly free) MT services: Google, Babelfish, others… Pervasive MT between many language pairs still non-existent and not on the immediate horizon February 26, 2009 LTI OH 2009

Representative Example: Google Translate
February 26, 2009 LTI OH 2009

Google Translate

Example of High Quality MT
PAHO’s Spanam system: Mediante petición recibida por la Comisión Interamericana de Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra. Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it. February 26, 2009 LTI OH 2009

Core Challenges of MT Ambiguity and Language Divergences:
Human languages are highly ambiguous, and differently in different languages Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific constructions and idioms Amount of required knowledge: Translation equivalencies for vast vocabularies (several 100k words and phrases) Syntactic knowledge (how to map syntax of one language to another), plus more complex language divergences (semantic differences, constructions and idioms, etc.) How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent? February 26, 2009 LTI OH 2009

Major Sources of Translation Problems
Lexical Differences: Multiple possible translations for SL word, or difficulties expressing SL word meaning in a single TL word Structural Differences: Syntax of SL is different than syntax of the TL: word order, sentence and constituent structure Differences in Mappings of Syntax to Semantics: Meaning in TL is conveyed using a different syntactic structure than in the SL Idioms and Constructions February 26, 2009 LTI OH 2009

How to Tackle the Core Challenges
Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules. Example: Systran’s RBMT systems. Lots of Parallel Data: data-driven approaches for finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems. Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s Statistical XFER approach. Simplify the Problem: build systems that are limited-domain or constrained in other ways. Examples: CATALYST, NESPOLE!. February 26, 2009 LTI OH 2009

State-of-the-Art in MT
What users want: General purpose (any text) High quality (human level) Fully automatic (no user intervention) We can meet any 2 of these 3 goals today, but not all three at once: FA HQ: Knowledge-Based MT (KBMT) FA GP: Corpus-Based (Example-Based) MT GP HQ: Human-in-the-loop (efficiency tool) February 26, 2009 LTI OH 2009

Types of MT Applications:
Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ) Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ) Communication: Lower quality may be okay, but system robustness, real-time required. February 26, 2009 LTI OH 2009

Approaches to MT: Vaquois MT Triangle
Interlingua Give-information+personal-data (name=alon_lavie) Generation Analysis Transfer [s [vp accusative_pronoun “chiamare” proper_name]] [s [np [possessive_pronoun “name”]] [vp “be” proper_name]] Direct Mi chiamo Alon Lavie My name is Alon Lavie February 26, 2009 LTI OH 2009

Knowledge-based Interlingual MT
The classic “deep” Artificial Intelligence approach: Analyze the source language into a detailed symbolic representation of its meaning Generate this meaning in the target language “Interlingua”: one single meaning representation for all languages Nice in theory, but extremely difficult in practice: What kind of representation? What is the appropriate level of detail to represent? How to ensure that the interlingua is in fact universal? “Demonstration” is now set of KANT interlinguas Downloaded here from AMTA interlingua workshop webpage: clicking on sentences show the IL Note use of phrasal lexical units! Start with #4: “simple” example (with a 2-level genl. quantifier!) 1: rel. clause 3: conj “but”, coordination, another genl. quant. 5: implicit rel. clause, apposition 6: “latter”/”former” 7: appositive VP 8: quote marks 13/14: complex [numbers different on frontpage and files!] 21/22: “all” is (:OR SINGULAR PLURAL) [ambiguity packing!] February 26, 2009 LTI OH 2009

Interlingua versus Transfer
With interlingua, need only N parsers/ generators instead of N2 transfer systems: L2 L2 L3 L1 L3 L1 interlingua L6 L4 L6 L4 L5 L5 February 26, 2009 LTI OH 2009

Multi-Engine MT Apply several MT engines to each input in parallel
Create a combined translation from the individual translations Goal is to combine strengths, and avoid weaknesses. Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc. Various approaches to the problem February 26, 2009 LTI OH 2009

Speech-to-Speech MT Speech just makes MT (much) more difficult:
Spoken language is messier False starts, filled pauses, repetitions, out-of-vocabulary words Lack of punctuation and explicit sentence boundaries Current Speech technology is far from perfect Need for speech recognition and synthesis in foreign languages Robustness: MT quality degradation should be proportional to SR quality Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance? February 26, 2009 LTI OH 2009

MT at the LTI LTI originated as the Center for Machine Translation (CMT) in 1985 MT continues to be a prominent sub-discipline of research with the LTI More MT faculty than any of the other areas More MT faculty than anywhere else Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT Leader in the area of speech-to-speech MT Multi-Engine MT (MEMT) MT Evaluation (METEOR) February 26, 2009 LTI OH 2009

Phrase-based Statistical MT
Word-to-word and phrase-to-phrase translation pairs are acquired automatically from data and assigned probabilities based on a statistical model Extracted and trained from very large amounts of sentence-aligned parallel text Word alignment algorithms Phrase detection algorithms Translation model probability estimation Main approach pursued in CMU systems in the DARPA/TIDES program and now in GALE Chinese-to-English and Arabic-to-English Most active work is on improved word alignment, phrase extraction and advanced decoding techniques Contact Faculty: Stephan Vogel February 26, 2009 LTI OH 2009

EBMT Developed originally for the PANGLOSS system in the early 1990s
Translation between English and Spanish Generalized EBMT under development for the past several years Used in a variety of projects in recent years DARPA TIDES and GALE programs DIPLOMAT and TONGUES Active research work on improving alignment and indexing, decoding from a lattice Contact Faculty: Ralf Brown and Jaime Carbonell February 26, 2009 LTI OH 2009

CMU Statistical Transfer (Stat-XFER) MT Approach
Integrate the major strengths of rule-based and statistical MT within a common statistically-driven framework: Linguistically rich formalism that can express complex and abstract compositional transfer rules Rules can be written by human experts and also acquired automatically from data Easy integration of morphological analyzers and generators Word and syntactic-phrase correspondences can be automatically acquired from parallel text Search-based decoding from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc. Framework suitable for both resource-rich and resource-poor language scenarios Most active work on phrase and rule acquisition from parallel data, efficient decoding, joint decoding with non-syntactic phrases, MT for low-resource languages Contact Faculty: Alon Lavie, Lori Levin, Bob Frederking and Jaime Carbonell February 26, 2009 LTI OH 2009

Speech-to-Speech MT Evolution from JANUS/C-STAR systems to NESPOLE!, LingWear, BABYLON, TRANSTAC Early 1990s: first prototype system that fully performed sp-to-sp (very limited domains) Interlingua-based, but with shallow task-oriented representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double}) Semantic Grammars for analysis and generation Multiple languages: English, German, French, Italian, Japanese, Korean, and others Phrase-based SMT applied in Speech-to-Speech scenarios Most active work on portable speech translation on small devices: Iraqi-Arabic/English and Thai/English Contact Faculty: Alan Black, Stephan Vogel, Florian Metze and Alex Waibel February 26, 2009 LTI OH 2009

KBMT: KANT, KANTOO, CATALYST
Deep knowledge-based framework, with symbolic interlingua as intermediate representation Syntactic and semantic analysis into a unambiguous detailed symbolic representation of meaning using unification grammars and transformation mappers Generation into the target language using unification grammars and transformation mappers First large-scale multi-lingual interlingua-based MT system deployed commercially: CATALYST at Caterpillar: high quality translation of documentation manuals for heavy equipment Limited domains and controlled English input Minor amounts of post-editing Active follow-on projects Contact Faculty: Eric Nyberg and Teruko Mitamura February 26, 2009 LTI OH 2009

Multi-Engine MT New decoding-based approach developed in recent years under DoD and DARPA funding (used in GALE) Main ideas: Treat original engines as “black boxes” Align the word and phrase correspondences between the translations Build a collection of synthetic combinations based on the aligned words and phrases Score the synthetic combinations based on Language Model and confidence measures Select the top-scoring synthetic combination Architecture Issues: integrating “workflows” that produce multiple translations and then combine them with MEMT IBM’s UIMA architecture Contact Faculty: Alon Lavie February 26, 2009 LTI OH 2009

Automatic MT Evaluation
METEOR: new metric developed at CMU Improves upon BLEU metric developed by IBM and used extensively in recent years Main ideas: Assess the similarity between a machine-produced translation and (several) human reference translations Similarity is based on word-to-word matching that matches: Identical words Morphological variants of same word (stemming) synonyms Similarity is based on weighted combination of Precision and Recall Address fluency/grammaticality via a direct penalty: how well-ordered is the matching of the MT output with the reference? Improved levels of correlation with human judgments of MT Quality Contact Faculty: Alon Lavie February 26, 2009 LTI OH 2009

Summary Main challenges for current state-of-the-art MT approaches - Coverage and Accuracy: Acquiring broad-coverage high-accuracy translation lexicons (for words and phrases) learning syntactic mappings between languages from parallel word-aligned data overcoming syntax-to-semantics differences and dealing with constructions Stronger Target Language Modeling February 26, 2009 LTI OH 2009

Questions… February 26, 2009 LTI OH 2009

Lexical Differences SL word has several different meanings, that translate differently into TL Ex: financial bank vs. river bank Lexical Gaps: SL word reflects a unique meaning that cannot be expressed by a single word in TL Ex: English snub doesn’t have a corresponding verb in French or German TL has finer distinctions than SL  SL word should be translated differently in different contexts Ex: English wall can be German wand (internal), mauer (external) February 26, 2009 LTI OH 2009

Structural Differences
Syntax of SL is different than syntax of the TL: Word order within constituents: English NPs: art adj n the big boy Hebrew NPs: art n art adj ha yeled ha gadol Constituent structure: English is SVO: Subj Verb Obj I saw the man Modern Arabic is VSO: Verb Subj Obj Different verb syntax: Verb complexes in English vs. in German I can eat the apple Ich kann den apfel essen Case marking and free constituent order German and other languages that mark case: den apfel esse Ich the(acc) apple eat I(nom) February 26, 2009 LTI OH 2009

Syntax-to-Semantics Differences
Meaning in TL is conveyed using a different syntactic structure than in the SL Changes in verb and its arguments Passive constructions Motion verbs and state verbs Case creation and case absorption Main Distinction from Structural Differences: Structural differences are mostly independent of lexical choices and their semantic meaning  can be addressed by transfer rules that are syntactic in nature Syntax-to-semantic mapping differences are meaning-specific: require the presence of specific words (and meanings) in the SL February 26, 2009 LTI OH 2009

Syntax-to-Semantics Differences
Structure-change example: I like swimming “Ich scwhimme gern” I swim gladly Verb-argument example: Jones likes the film. “Le film plait à Jones.” (lit: “the film pleases to Jones”) Passive Constructions Example: French reflexive passives: Ces livres se lisent facilement *”These books read themselves easily” These books are easily read February 26, 2009 LTI OH 2009

Idioms and Constructions
Main Distinction: meaning of whole is not directly compositional from meaning of its sub-parts  no compositional translation Examples: George is a bull in a china shop He kicked the bucket Can you please open the window? February 26, 2009 LTI OH 2009

Formulaic Utterances Good night. tisbaH cala xEr waking up on good
Romanization of Arabic from CallHome Egypt February 26, 2009 LTI OH 2009

Analysis and Generation Main Steps
Morphological analysis (word-level) and POS tagging Syntactic analysis and disambiguation (produce syntactic parse-tree) Semantic analysis and disambiguation (produce symbolic frames or logical form representation) Map to language-independent Interlingua Generation: Generate semantic representation in TL Sentence Planning: generate syntactic structure and lexical selections for concepts Surface-form realization: generate correct forms of words February 26, 2009 LTI OH 2009

Direct Approaches No intermediate stage in the translation
First MT systems developed in the 1950’s-60’s (assembly code programs) Morphology, bi-lingual dictionary lookup, local reordering rules “Word-for-word, with some local word-order adjustments” Modern Approaches: Phrase-based Statistical MT (SMT) Example-based MT (EBMT) February 26, 2009 LTI OH 2009

Statistical MT (SMT) Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Most dominant approach in current MT research Evolved from word-level translation to phrase-based translation Main Ideas: Training: statistical “models” of word and phrase translation equivalence are learned automatically from bilingual parallel sentences, creating a bilingual “database” of translations Decoding: new sentences are translated by a program (the decoder), which matches the source words and phrases with the database of translations, and searches the “space” of all possible translation combinations. February 26, 2009 LTI OH 2009

Statistical MT (SMT) Main steps in training phrase-based statistical MT: Create a sentence-aligned parallel corpus Word Alignment: train word-level alignment models (GIZA++) Phrase Extraction: extract phrase-to-phrase translation correspondences using heuristics (Pharoah) Minimum Error Rate Training (MERT): optimize translation system parameters on development data to achieve best translation performance Attractive: completely automatic, no manual rules, much reduced manual labor Main drawbacks: Translation accuracy levels vary Effective only with large volumes (several mega-words) of parallel text Broad domain, but domain-sensitive Still viable only for small number of language pairs! Impressive progress in last 5 years February 26, 2009 LTI OH 2009

Matches to Source Found
EBMT Paradigm New Sentence (Source) Yesterday, 200 delegates met with President Clinton. Matches to Source Found Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton… Alignment (Sub-sentential) Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton over… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton… Translated Sentence (Target) February 26, 2009 LTI OH 2009 Gestern trafen sich 200 Abgeordnete mit Praesident Clinton.

Transfer Approaches Syntactic Transfer: Semantic Transfer:
Analyze SL input sentence to its syntactic structure (parse tree) Transfer SL parse-tree to TL parse-tree (various formalisms for specifying mappings) Generate TL sentence from the TL parse-tree Semantic Transfer: Analyze SL input to a language-specific semantic representation (i.e., Case Frames, Logical Form) Transfer SL semantic representation to TL semantic representation Generate syntactic structure and then surface sentence in the TL February 26, 2009 LTI OH 2009

Transfer Approaches Main Advantages and Disadvantages:
Syntactic Transfer: No need for semantic analysis and generation Syntactic structures are general, not domain specific  Less domain dependent, can handle open domains Requires word translation lexicon Semantic Transfer: Requires deeper analysis and generation, symbolic representation of concepts and predicates  difficult to construct for open or unlimited domains Can better handle non-compositional meaning structures  can be more accurate No word translation lexicon – generate in TL from symbolic concepts February 26, 2009 LTI OH 2009

The METEOR Metric Example: P = 5/8 =0.625 R = 5/14 = 0.357
Reference: “the Iraqi weapons are to be handed over to the army within two weeks” MT output: “in two weeks Iraq’s weapons will give army” Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army P = 5/8 = R = 5/14 = 0.357 Fmean = 10*P*R/(9P+R) = Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50 Discounting factor: DF = 0.5 * (frag**3) = Final score: Fmean * (1- DF) = * = February 26, 2009 LTI OH 2009

Synthetic Combination MEMT
Two Stage Approach: Align: Identify common words and phrases across the translations provided by the engines Decode: search the space of synthetic combinations of words/phrases and select the highest scoring combined translation Example: announced afghan authorities on saturday reconstituted four intergovernmental committees The Afghan authorities on Saturday the formation of the four committees of government February 26, 2009 LTI OH 2009

Synthetic Combination MEMT
Two Stage Approach: Align: Identify common words and phrases across the translations provided by the engines Decode: search the space of synthetic combinations of words/phrases and select the highest scoring combined translation Example: announced afghan authorities on saturday reconstituted four intergovernmental committees The Afghan authorities on Saturday the formation of the four committees of government MEMT: the afghan authorities announced on Saturday the formation of four intergovernmental committees February 26, 2009 LTI OH 2009

Machine Translation Overview

Similar presentations

Presentation on theme: "Machine Translation Overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Translation Overview

Similar presentations

Presentation on theme: "Machine Translation Overview"— Presentation transcript:

Similar presentations

About project

Feedback