The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.

The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University

July 21, 2003TIDES MT Evaluation Workshop 2 Main CMU Milli-RADD Activities Participation in the “Small-data” track Chinese/English MT Evaluation, including –SMT system –EBMT system Development of the CMU Trainable Transfer MT Engine Participation in the June-03 Hindi/English SLE (SMT, EBMT, XFER)

July 21, 2003TIDES MT Evaluation Workshop 3 May 2003 Chinese “Small” Data Track Results Second-best result for Small Data Track –EBMT 5.3701 –SMT 6.7136 –(best result was ISI 6.8481)

July 21, 2003TIDES MT Evaluation Workshop 4 CMU Main Activities during June-03 Hindi SLE Joint endeavor of CMU Mega-RADD and Milli-RADD teams Data Collection and distribution SMT system development EBMT system development Transfer system development

July 21, 2003TIDES MT Evaluation Workshop 5 Main Data Collection Efforts

July 21, 2003TIDES MT Evaluation Workshop 6 Main Contributions to Shared Resources

July 21, 2003TIDES MT Evaluation Workshop 7 Elicited Data Collection Goal: Acquire high quality word aligned Hindi- English data to support system development, especially grammar development and automatic grammar learning We recruited a sizeable team of bilingual speakers “Original” Elicitation Corpus was translated into Hindi Corpus of Phrases extracted from Brown Corpus (NPs and PPs) was broken into files and assigned to translators, here and in India Resulting in total of 17589 word aligned translated phrases

July 21, 2003TIDES MT Evaluation Workshop 8 The CMU Elicitation Tool

July 21, 2003TIDES MT Evaluation Workshop 9 Transfer System (XFER): Overview

July 21, 2003TIDES MT Evaluation Workshop 10 Learning Transfer-Rules for Languages with Limited Resources Rationale: –Large bilingual corpora not available –Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool –Elicitation corpus designed to be typologically comprehensive and compositional –Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

July 21, 2003TIDES MT Evaluation Workshop 11 XFER System Architecture User Learning Module Elicitation Process SVS Learning Process Transfer Rules Run-Time Module SL Input SL Parser Transfer Engine TL Generator Decoder Module TL Output

July 21, 2003TIDES MT Evaluation Workshop 12 The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他看书。 (he read book) S NP VP N V NP 他看书 Transfer A target language tree is created by reordering, insertion, and deletion. S NP VP N V NP he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book”

July 21, 2003TIDES MT Evaluation Workshop 13 Transfer Rule Formalism Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the man, TL: der Mann NP::NP [DET N] -> [DET N] ( (X1::Y1) (X2::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X2 AGR) = *3-SING) ((X2 COUNT) = +) ((Y1 AGR) = *3-SING) ((Y1 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y1 GENDER)) )

July 21, 2003TIDES MT Evaluation Workshop 14 Transfer Rule Formalism (II) Value constraints Agreement constraints ;SL: the man, TL: der Mann NP::NP [DET N] -> [DET N] ( (X1::Y1) (X2::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X2 AGR) = *3-SING) ((X2 COUNT) = +) ((Y1 AGR) = *3-SING) ((Y1 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y1 GENDER)) )

July 21, 2003TIDES MT Evaluation Workshop 15 Rule Learning - Overview Goal: Acquire Syntactic Transfer Rules Use available knowledge from the source side (grammatical structure) Three steps: 1.Flat Seed Generation: first guesses at transfer rules; no syntactic structure 2.Compositionality: use previously learned rules to add structure 3.Seeded Version Space Learning: refine rules by generalizing with validation (learn appropriate feature constraints)

July 21, 2003TIDES MT Evaluation Workshop 16 Summary of our Final Hindi-to- English Transfer System Overview of our Lexical Resources and how they were used in the system Grammar Development Transfer System Runtime Configuration Dev-test Evaluation Results Observations and Lessons Learned

July 21, 2003TIDES MT Evaluation Workshop 17 Summary of Lexical Resources Manual: manually written phrase transfer rules (72) Postpos: manually writen postpos rules (105) Bigram: translations of 500 most frequent bigrams in Hindi (from Ralf) Elicited: elicited data from controlled corpus and Brown, w-to-w and p-to-p, total of 84619 lexical and phrase rules LDC: “master” bilingual dict from LDC, frequency sorted, Richard and Shobha cleaned up manually top 12% of entries, total of 87902 rules NE: Named Entity lists from LDC website and from Fei, total of 1237+2109= 3346 rules IBM: statistical w-to-w and p-to-p lexicon from IBM, sorted by translation prob, 81664 rules JOY: SMT system w-to-w and p-to-p lexicon, sorted by translation prob, 189583 rules TOTAL: 447791 rules

July 21, 2003TIDES MT Evaluation Workshop 18 Ordering of Lexical Resources Corresponds to three passes of system: –Phrase-to-phrase (used in first pass) –POS-tagged w-to-w pass (morph, enhanced, sorted, can feed into grammar) –LEX-tagged w-to-w pass (full forms, can only be used for w-to-w, no grammar).

July 21, 2003TIDES MT Evaluation Workshop 19 Ordering of Lexical Resources Man rules (p-to-p, w-to-w) Postpos (w-to-w) Bigrams (p-to-p) LDC (w-to-w, enhanced, sorted) Etposrules (w-to-w, enhanced, sorted) NE (p-to-p, w-to-w) Etlexrules (w-to-w, sorted) Etphraserules (p-to-p) IBM (p-to-p, w-to-w, sorted) JOY (p-to-p, w-to-w, sorted) Cleaned up and duplicates removed Total Rules in Global Lexicon: 430753

July 21, 2003TIDES MT Evaluation Workshop 20 Manual Grammar Development Manual grammar covers mostly NPs, PPs and VPs (verb complexes) 73 VP grammar rules, covering all tenses, active and passive, subjunctive

July 21, 2003TIDES MT Evaluation Workshop 21 Example Grammar Rule ;; SIMPLE PRESENT AND PAST (depends on the tense of the Aux) ; Ex: (tu) bolta hE -> (I) (usually) speak ; Ex: (maiM) sotA hUM -> (I) sleep (now) ; Ex: (maiM) sotA thA -> (I) slept (used to spleep) {VP,5} VP::VP : [V Aux] -> [V] ( (X1::Y1) ((x1 form) = part) ((x1 aspect) = imperf) ((x2 lexwx) = 'honA') ((x2 tense) = (*NOT* fut)) ((x2 tense) = (*NOT* subj)) ((x0 tense) = (x2 tense)) ((x0 agr num) = (x2 agr num)) ((x0 agr pers) = (x2 agr pers)) (x0 = x1) ((y1 tense) = (x0 tense)) ((y1 agr num) = (x0 agr num)) ; not always agrees, try commenting ((y1 agr pers) = (x0 agr pers)) )

July 21, 2003TIDES MT Evaluation Workshop 22 Examples of Learned Rules (I) {NP,14613} ;;Score:0.2024 NP::NP [DET N] -> [DET N] ( (X1::Y1) (X2::Y2) ) {NP,14595} ;;Score:0.0480 NP::NP [NUM N] -> [NUM N] ( (X1::Y1) (X2::Y2) ) {PP,4894} ;;Score:0.0470 PP::PP [NP POSTP] -> [PREP NP] ( (X2::Y1) (X1::Y2) )

July 21, 2003TIDES MT Evaluation Workshop 23 Examples of Learned Rules (II) ;; OF DEQUINDRE AND 14 MILE ROAD EAST PP::PP [N CONJ NUM N N N POSTP] -> [PREP N CONJ NUM N N N] ( (X7::Y1) (X1::Y2) (X2::Y3) (X3::Y4) (X4::Y5) (X5::Y6) (X6::Y7) ) NP::NP [ADJ N] -> [ADJ N] ( (X1::Y1) (X2::Y2) ((Y1 NUM) = (X2 NUM)) ((Y2 NUM) = (Y1 NUM)) )

July 21, 2003TIDES MT Evaluation Workshop 24 Pure XFER System for SLE Three passes: –Pass1: match against p-to-p entries –Pass2: morph analyze word and match against all w- to-w resources, halt if match found –Pass3: match original word against all w-to-w resources, provides only w-to-w output, no feeding into grammar rules. “Weak” decoding: greedy left-to-right search that prefers longer input segments Unk word policy: remove, or replace with “the” Post-processing: –remove be/give at eos if preceded by a verb –Replace all remaining “be” with “is”

July 21, 2003TIDES MT Evaluation Workshop 25 Development Testing Three dev-test sets: –India Today: 59 sentences, single ref –Full ISI: 358 sents, newswire, single ref –Small ISI: first 25 sentences of Full ISI Full ISI was most meaningful test-set, tested on IT earlier on, and to ensure no over-fitting. Post-SLE tests were done on a section of the JHU data (with 4 refs).

July 21, 2003TIDES MT Evaluation Workshop 26 Debug Output with Sources amerikI senA ne kahA hE ki irAka kI galiyoM meM cAro waraPa vyApwa aparAXa ko niyaMwriwa karane ke lie uMhoMne irAkiyoM ko senA ke kAma meM Ane vAle haWiyAra sOMpane ke lie 2 sapwAha kA samaya xiyA hE.

July 21, 2003TIDES MT Evaluation Workshop 27 June 30 Evaluation Submissions CMU submitted EBMT, SMT, XFER For XFER we submitted XFER-ONLY and XFER+LM (“quick” version with strong decoder) Results: –EBMT 5.9765 –SMT 6.7441 –XFER-ONLY 5.3514 –XFER+LM 5.4732 Aggregate stats from our XFER-ONLY run: –Coverage: 88.3% –Compounds matched: 2279 (token) –Went thru Morph and matched: 6256/9605 –Unkown Hindi words: 1122

July 21, 2003TIDES MT Evaluation Workshop 28 Limited Resource Scenario The “rules of the game” were skewed against us in this exercise: –1.5 Million words of parallel text –Noisy statistical lexical resources –Pure XFER system did not have a strong decoder How would we do in a “real” minority language scenario, with very limited resources? How does this compare with EBMT and SMT under the same scenario? How do we do when we add a strong decoder to our XFER system? What is the effect of Multi-Engine combination using a strong decoder?

July 21, 2003TIDES MT Evaluation Workshop 29 Limited Resources Scenario: Data Resources Data resources used in this scenario: –Elicited Data corpus: 17589 phrases –Cleaned portion (top 12%) of LDC dictionary: ~4500 Hindi words (23612 translation pairs) –500 manual bigram translations –72 manually written phrase transfer rules –105 manually written postpos rules –48 manually written time expression rules No additional parallel text!!

July 21, 2003TIDES MT Evaluation Workshop 30 Adding a “Strong” Decoder Current pure XFER system has very weak decoder: –no meaningful scores on edges  poor word translation selection, poor ordering of grammar rules –no use of a target Language Model Adding a “Strong” decoder: –XFER system produces a full lattice –ONLY edges from XFER are allowed! –Edges are scored using w-to-w translation probabilities, trained from limited data –Decoder uses an English LM (70m words) –Decoder can also reorder words/phrases (up to 4 words long)

July 21, 2003TIDES MT Evaluation Workshop 31 Limited Resources Scenario: Testing Conditions SMT system (stand-alone) EBMT system (stand-alone) XFER system (naïve decoding) XFER system with “strong” decoder –No grammar rules (baseline) –Manually developed grammar rules –Automatically learned grammar rules XFER+SMT with strong decoder (MEMT)

July 21, 2003TIDES MT Evaluation Workshop 32 Limited Resources Scenario: Results on JHU-dev7 SystemBLEUM-BLEUNIST EBMT0.0580.1654.22 SMT0.0934.64 XFER (naïve) man grammar 0.0550.1774.46 XFER (strong) no grammar 0.1090.2245.29 XFER (strong) man grammar 0.1340.2425.57 XFER (strong) learned grammar 0.1160.2315.28 XFER+SMT0.1170.2295.45

July 21, 2003TIDES MT Evaluation Workshop 33 Observations and Lessons Strong decoder for XFER system is essential, even with extremely limited data XFER system with manual or automatically learned grammar outperforms SMT and EBMT in the extremely limited data scenario –where is the cross-over point? MEMT based on strong decoder produced best results in this scenario Reordering within the decoder provided very significant score improvements –room for more sophisticated grammar rules Conclusion: transfer rules (both manual and learned) offer significant contributions that can complement existing data-driven approaches –Also in medium and large data settings?

The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.

Similar presentations

Presentation on theme: "The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.

Similar presentations

Presentation on theme: "The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback