The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
Rule Learning – Overview Goal: learn transfer rules for a language pair where one language is resource-rich, the other is resource-poor Learning proceeds.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Kalyani Patel K.S.School of Business Management,Gujarat University.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Dependency Tree-to-Dependency Tree Machine Translation November 4, 2011 Presented by: Jeffrey Flanigan (CMU) Lori Levin, Jaime Carbonell In collaboration.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
Hindi SLE Debriefing AVENUE Transfer System July 3, 2003.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University Open House March 18, 2005.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin Merrill (Shyamsundar Jayaraman,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Improving a Statistical MT System with Automatically Learned Rewrite Rules Fei Xia and Michael McCord IBM T. J. Watson Research Center Yorktown Heights,
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Enabling MT for Languages with Limited Resources Alon Lavie and Lori Levin Language Technologies Institute Carnegie Mellon University.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
Approaches to Machine Translation
Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Basic Parsing with Context Free Grammars Chapter 13
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Vamshi Ambati 14 Sept 2007 Student Research Symposium
Stat-Xfer מציגים: יוגב וקנין ועומר טבח, 05/01/2012
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
AMTEXT: Extraction-based MT for Arabic
Presentation transcript:

The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University

July 21, 2003TIDES MT Evaluation Workshop 2 Main CMU Milli-RADD Activities Participation in the “Small-data” track Chinese/English MT Evaluation, including –SMT system –EBMT system Development of the CMU Trainable Transfer MT Engine Participation in the June-03 Hindi/English SLE (SMT, EBMT, XFER)

July 21, 2003TIDES MT Evaluation Workshop 3 May 2003 Chinese “Small” Data Track Results Second-best result for Small Data Track –EBMT –SMT –(best result was ISI )

July 21, 2003TIDES MT Evaluation Workshop 4 CMU Main Activities during June-03 Hindi SLE Joint endeavor of CMU Mega-RADD and Milli-RADD teams Data Collection and distribution SMT system development EBMT system development Transfer system development

July 21, 2003TIDES MT Evaluation Workshop 5 Main Data Collection Efforts

July 21, 2003TIDES MT Evaluation Workshop 6 Main Contributions to Shared Resources

July 21, 2003TIDES MT Evaluation Workshop 7 Elicited Data Collection Goal: Acquire high quality word aligned Hindi- English data to support system development, especially grammar development and automatic grammar learning We recruited a sizeable team of bilingual speakers “Original” Elicitation Corpus was translated into Hindi Corpus of Phrases extracted from Brown Corpus (NPs and PPs) was broken into files and assigned to translators, here and in India Resulting in total of word aligned translated phrases

July 21, 2003TIDES MT Evaluation Workshop 8 The CMU Elicitation Tool

July 21, 2003TIDES MT Evaluation Workshop 9 Transfer System (XFER): Overview

July 21, 2003TIDES MT Evaluation Workshop 10 Learning Transfer-Rules for Languages with Limited Resources Rationale: –Large bilingual corpora not available –Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool –Elicitation corpus designed to be typologically comprehensive and compositional –Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

July 21, 2003TIDES MT Evaluation Workshop 11 XFER System Architecture User Learning Module Elicitation Process SVS Learning Process Transfer Rules Run-Time Module SL Input SL Parser Transfer Engine TL Generator Decoder Module TL Output

July 21, 2003TIDES MT Evaluation Workshop 12 The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他 看 书。 (he read book) S NP VP N V NP 他 看 书 Transfer A target language tree is created by reordering, insertion, and deletion. S NP VP N V NP he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book”

July 21, 2003TIDES MT Evaluation Workshop 13 Transfer Rule Formalism Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the man, TL: der Mann NP::NP [DET N] -> [DET N] ( (X1::Y1) (X2::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X2 AGR) = *3-SING) ((X2 COUNT) = +) ((Y1 AGR) = *3-SING) ((Y1 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y1 GENDER)) )

July 21, 2003TIDES MT Evaluation Workshop 14 Transfer Rule Formalism (II) Value constraints Agreement constraints ;SL: the man, TL: der Mann NP::NP [DET N] -> [DET N] ( (X1::Y1) (X2::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X2 AGR) = *3-SING) ((X2 COUNT) = +) ((Y1 AGR) = *3-SING) ((Y1 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y1 GENDER)) )

July 21, 2003TIDES MT Evaluation Workshop 15 Rule Learning - Overview Goal: Acquire Syntactic Transfer Rules Use available knowledge from the source side (grammatical structure) Three steps: 1.Flat Seed Generation: first guesses at transfer rules; no syntactic structure 2.Compositionality: use previously learned rules to add structure 3.Seeded Version Space Learning: refine rules by generalizing with validation (learn appropriate feature constraints)

July 21, 2003TIDES MT Evaluation Workshop 16 Summary of our Final Hindi-to- English Transfer System Overview of our Lexical Resources and how they were used in the system Grammar Development Transfer System Runtime Configuration Dev-test Evaluation Results Observations and Lessons Learned

July 21, 2003TIDES MT Evaluation Workshop 17 Summary of Lexical Resources Manual: manually written phrase transfer rules (72) Postpos: manually writen postpos rules (105) Bigram: translations of 500 most frequent bigrams in Hindi (from Ralf) Elicited: elicited data from controlled corpus and Brown, w-to-w and p-to-p, total of lexical and phrase rules LDC: “master” bilingual dict from LDC, frequency sorted, Richard and Shobha cleaned up manually top 12% of entries, total of rules NE: Named Entity lists from LDC website and from Fei, total of = 3346 rules IBM: statistical w-to-w and p-to-p lexicon from IBM, sorted by translation prob, rules JOY: SMT system w-to-w and p-to-p lexicon, sorted by translation prob, rules TOTAL: rules

July 21, 2003TIDES MT Evaluation Workshop 18 Ordering of Lexical Resources Corresponds to three passes of system: –Phrase-to-phrase (used in first pass) –POS-tagged w-to-w pass (morph, enhanced, sorted, can feed into grammar) –LEX-tagged w-to-w pass (full forms, can only be used for w-to-w, no grammar).

July 21, 2003TIDES MT Evaluation Workshop 19 Ordering of Lexical Resources Man rules (p-to-p, w-to-w) Postpos (w-to-w) Bigrams (p-to-p) LDC (w-to-w, enhanced, sorted) Etposrules (w-to-w, enhanced, sorted) NE (p-to-p, w-to-w) Etlexrules (w-to-w, sorted) Etphraserules (p-to-p) IBM (p-to-p, w-to-w, sorted) JOY (p-to-p, w-to-w, sorted) Cleaned up and duplicates removed Total Rules in Global Lexicon:

July 21, 2003TIDES MT Evaluation Workshop 20 Manual Grammar Development Manual grammar covers mostly NPs, PPs and VPs (verb complexes) 73 VP grammar rules, covering all tenses, active and passive, subjunctive

July 21, 2003TIDES MT Evaluation Workshop 21 Example Grammar Rule ;; SIMPLE PRESENT AND PAST (depends on the tense of the Aux) ; Ex: (tu) bolta hE -> (I) (usually) speak ; Ex: (maiM) sotA hUM -> (I) sleep (now) ; Ex: (maiM) sotA thA -> (I) slept (used to spleep) {VP,5} VP::VP : [V Aux] -> [V] ( (X1::Y1) ((x1 form) = part) ((x1 aspect) = imperf) ((x2 lexwx) = 'honA') ((x2 tense) = (*NOT* fut)) ((x2 tense) = (*NOT* subj)) ((x0 tense) = (x2 tense)) ((x0 agr num) = (x2 agr num)) ((x0 agr pers) = (x2 agr pers)) (x0 = x1) ((y1 tense) = (x0 tense)) ((y1 agr num) = (x0 agr num)) ; not always agrees, try commenting ((y1 agr pers) = (x0 agr pers)) )

July 21, 2003TIDES MT Evaluation Workshop 22 Examples of Learned Rules (I) {NP,14613} ;;Score: NP::NP [DET N] -> [DET N] ( (X1::Y1) (X2::Y2) ) {NP,14595} ;;Score: NP::NP [NUM N] -> [NUM N] ( (X1::Y1) (X2::Y2) ) {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP NP] ( (X2::Y1) (X1::Y2) )

July 21, 2003TIDES MT Evaluation Workshop 23 Examples of Learned Rules (II) ;; OF DEQUINDRE AND 14 MILE ROAD EAST PP::PP [N CONJ NUM N N N POSTP] -> [PREP N CONJ NUM N N N] ( (X7::Y1) (X1::Y2) (X2::Y3) (X3::Y4) (X4::Y5) (X5::Y6) (X6::Y7) ) NP::NP [ADJ N] -> [ADJ N] ( (X1::Y1) (X2::Y2) ((Y1 NUM) = (X2 NUM)) ((Y2 NUM) = (Y1 NUM)) )

July 21, 2003TIDES MT Evaluation Workshop 24 Pure XFER System for SLE Three passes: –Pass1: match against p-to-p entries –Pass2: morph analyze word and match against all w- to-w resources, halt if match found –Pass3: match original word against all w-to-w resources, provides only w-to-w output, no feeding into grammar rules. “Weak” decoding: greedy left-to-right search that prefers longer input segments Unk word policy: remove, or replace with “the” Post-processing: –remove be/give at eos if preceded by a verb –Replace all remaining “be” with “is”

July 21, 2003TIDES MT Evaluation Workshop 25 Development Testing Three dev-test sets: –India Today: 59 sentences, single ref –Full ISI: 358 sents, newswire, single ref –Small ISI: first 25 sentences of Full ISI Full ISI was most meaningful test-set, tested on IT earlier on, and to ensure no over-fitting. Post-SLE tests were done on a section of the JHU data (with 4 refs).

July 21, 2003TIDES MT Evaluation Workshop 26 Debug Output with Sources amerikI senA ne kahA hE ki irAka kI galiyoM meM cAro waraPa vyApwa aparAXa ko niyaMwriwa karane ke lie uMhoMne irAkiyoM ko senA ke kAma meM Ane vAle haWiyAra sOMpane ke lie 2 sapwAha kA samaya xiyA hE.

July 21, 2003TIDES MT Evaluation Workshop 27 June 30 Evaluation Submissions CMU submitted EBMT, SMT, XFER For XFER we submitted XFER-ONLY and XFER+LM (“quick” version with strong decoder) Results: –EBMT –SMT –XFER-ONLY –XFER+LM Aggregate stats from our XFER-ONLY run: –Coverage: 88.3% –Compounds matched: 2279 (token) –Went thru Morph and matched: 6256/9605 –Unkown Hindi words: 1122

July 21, 2003TIDES MT Evaluation Workshop 28 Limited Resource Scenario The “rules of the game” were skewed against us in this exercise: –1.5 Million words of parallel text –Noisy statistical lexical resources –Pure XFER system did not have a strong decoder How would we do in a “real” minority language scenario, with very limited resources? How does this compare with EBMT and SMT under the same scenario? How do we do when we add a strong decoder to our XFER system? What is the effect of Multi-Engine combination using a strong decoder?

July 21, 2003TIDES MT Evaluation Workshop 29 Limited Resources Scenario: Data Resources Data resources used in this scenario: –Elicited Data corpus: phrases –Cleaned portion (top 12%) of LDC dictionary: ~4500 Hindi words (23612 translation pairs) –500 manual bigram translations –72 manually written phrase transfer rules –105 manually written postpos rules –48 manually written time expression rules No additional parallel text!!

July 21, 2003TIDES MT Evaluation Workshop 30 Adding a “Strong” Decoder Current pure XFER system has very weak decoder: –no meaningful scores on edges  poor word translation selection, poor ordering of grammar rules –no use of a target Language Model Adding a “Strong” decoder: –XFER system produces a full lattice –ONLY edges from XFER are allowed! –Edges are scored using w-to-w translation probabilities, trained from limited data –Decoder uses an English LM (70m words) –Decoder can also reorder words/phrases (up to 4 words long)

July 21, 2003TIDES MT Evaluation Workshop 31 Limited Resources Scenario: Testing Conditions SMT system (stand-alone) EBMT system (stand-alone) XFER system (naïve decoding) XFER system with “strong” decoder –No grammar rules (baseline) –Manually developed grammar rules –Automatically learned grammar rules XFER+SMT with strong decoder (MEMT)

July 21, 2003TIDES MT Evaluation Workshop 32 Limited Resources Scenario: Results on JHU-dev7 SystemBLEUM-BLEUNIST EBMT SMT XFER (naïve) man grammar XFER (strong) no grammar XFER (strong) man grammar XFER (strong) learned grammar XFER+SMT

July 21, 2003TIDES MT Evaluation Workshop 33 Observations and Lessons Strong decoder for XFER system is essential, even with extremely limited data XFER system with manual or automatically learned grammar outperforms SMT and EBMT in the extremely limited data scenario –where is the cross-over point? MEMT based on strong decoder produced best results in this scenario Reordering within the decoder provided very significant score improvements –room for more sophisticated grammar rules Conclusion: transfer rules (both manual and learned) offer significant contributions that can complement existing data-driven approaches –Also in medium and large data settings?