Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

Similar presentations


Presentation on theme: "Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia."— Presentation transcript:

1 Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia Tribble, Fei Huang, Ashish Venugopal, Bing Zhao, Matthias Eck, Ralf Brown Language Technologies Institute Carnegie Mellon University 08/06/2003

2 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 2 Outline Overview The Knowledge Sources LDC Dictionary Phrase Translations Named Entities The CMU-SMT Decoder Building the Translation Lattice First-best Search Recombination and Pruning of Hypotheses Recent Advances Overlapping Sub-sentential Translations Reordering Conclusion

3 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 3 The CMU SMT System: An Overview Participation in Arabic-English and Chinese-English evaluations Using large bilingual corpora to train word-to-word and phrase-to-phrase translations Selection of training data which best covers the test data Improvements in phrase-to-phrase translations Using large monolingual English corpora for Language model Online named entity detection and translation Initial language model adaptation experiments

4 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 4 Improvements 2002 to 2003 Significant improvement for Chinese Dramatic improvement for Arabic 7.91 7.34Chinese Large 6.61 6.14Chinese Small 8.95 5.46 (6.45)Arabic 20032002

5 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 5 Domain Portability Tides evaluations on news data, translation systems are tuned towards particular test sets Research systems outperform COTS on specific test data, but lose when tested other domains Example: Medical domain C-EE-SC-SC-E-S SMT1.42.31.10.9 SMT with rules2.6 1.71.1 Systran2.14.01.5

6 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 6 Chinese and Arabic Systems: The Data Bilingual Dictionary Chinese-English Dictionary from LDC; 53K Chinese entries 10K Chinese-English Dictionary, most frequent Chinese words from full dictionary Bilingual Corpora Chinese Small Data Track: Treebank data (100k words) Chinese Large Data Track: Large UN corpus, FBIS data, Hongkong News, Sinorama, Xhinua News comparable corpus; Arabic: Large UN corpus, Ummah, multi-translation corpus Monolingual Data 10 years of Xinhua news for language models Target side of bilingual corpora

7 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 7 Preprocessing, Cleaning, Postprocessing Preprocessing and Cleaning Length balancing: remove sentence pairs that differ in the number of words by a given factor, usually 1.5. Remove sentence pairs with too many non-words. Normalize numbers: convert into ascii digits; write large numbers in the form ‘5 thousand’; fix numbers in Arabic UN data. Lower casing for English. Extract subcorpus covering all phrases of any length. Postprocessing Remove untranslated words or replace them using a LM to find the most likely word. Mixed-casing based on Named Entity detection and simple rules.

8 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 8 Translation Model Manual dictionary (for Chinese to English) Phrase to phrase translations extracted from bilingual corpora using different approaches: HMM Alignments (HMM): Train HMM word alignment and extract phrase pairs from Viterbi path. Training in both directions. Integrated Segmentation/Alignment (ISA): Assign scores to each source-target word pair according to their Point-wise Mutual Information (MI), then extract contiguous areas of high MI as phrase translations. Bilingual Bracketing (BiBr): Use the SBTG approach (D. Wu) to generate Viterbi alignment. Extract phrase pairs based on bracketing. Named Entities from different sources

9 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 9 Integrating Word and Phrase Translation Most phrase pairs seen only once or twice, i.e. no good statistics possible. Want to have phrase translation probabilities near word to word translation probabilities. For HMM and BiBr phrases, assign probabilities for a phrase pair based on word translation probabilities: Π j [ Σ i p( f j | e i )] > Π j [ max i p( f j | e i )] ISA phrase probabilities adjusted to be comparable.

10 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 10 Named Entities: Offline Resources Parallel/comparable corpus Bilingual dictionary Tools NE Taggers for both languages Multiple Features NE Tagging NE Transliteration NE Translation

11 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 11 Named Entities: Online Translate documents (Chinese to English) Search for similar documents in English newswire, using Lemur Toolkit Extract NEs (persons, locations) from retrieved documents Compute transliteration cost NEs in test data (source language) NEs in relevant documents (target language) Match NE pairs with minimum cost, added for second pass translation Translate again with augmented NE list

12 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 12 Improvement from NE Translation Test data: 887 sentences Small track NIST score Large track NIST score baseline6.56657.8218 +Offline NE6.61247.8733 +Online NE6.80547.9648 Improvement with offline and online approach Online approach has lower precision, but higher recall

13 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 13 The Decoder Build translation lattice Organize all translation pairs as prefix tree over the source side Run left-right over test sentence Search for matching phrases For each translation, insert edges into the lattice First best search Run left-right over lattice Apply trigram language model Combine translation model score and language model score Recombine and prune hypotheses At sentence end: add sentence length model score Trace back best hypothesis

14 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 14 Overlapping Phrases Goal: capture longer phrase translations, even those not seen in training corpus. First implemented and tested in EBMT system Improvement for 203 dry run test data: 5.6% NIST, 27% Bleu relative For SMT system Augment the phrase transducers by merging phrase pairs that overlap on both source and target side Overlap up to 4 words considered Probabilities assigned based on word alignment model

15 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 15 Overlapping Phrases : Results for Arabic Number of newly generated pairs (matching n-grams in test sentences) ISA: orig. 46,641 - new 19,072 HMM: orig. 135,003 - new 686,295 Average length of phrases increases from 1.3 to 1.4 NISTBleu Baseline8.590.385 Overlapping Phrases8.78 (+2.2%)0.425 (+10.4%)

16 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 16 Reordering Phrase translation realizes limited word reordering Goal: Allow for additional word and phrase-level reordering during decoding Implementation: Reordering window: expand hypotheses with all words and phrases lying or starting within this window Bookkeeping for covered positions Recombine and prune hypotheses

17 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 17 Recombination and Pruning of Hypotheses Hypotheses can be kept apart according to any combination of: Total coverage of source sentence Coverage pattern, when allowing for reordering Language model history Length of generated partial translation Pruning of hypotheses based on same features Use fewer features for pruning Prune all hypotheses outside of beam Example: Keep hypotheses apart based on coverage and LM history, prune based on coverage

18 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 18 Reordering: Results Continuous improvement with larger window up to 4 Improvement mainly due to longer matching n-grams Improvement for 2003 Arabic test set: 8.95 -> 9.26 NISTBleu Baseline8.590.385 R49.02 (+4.8%)0.441 (+12.7%) R4 + OP9.09 (+5.8%)0.455 (+18.2%)

19 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 19 Summary Significant improvement from evaluation 2002 to evaluation 2003, esp. for Arabic to English Phrase translation has proven to be key in achieving good translation quality Adding probabilities to manual translation lexicon help Recent extensions of the SMT system (overlapping phrases and reordering) have resulted in further improvements

20 07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 20 Future Research Directions LM Adaptation Class-based LM Discriminative Training Confidence measures and n-best list re-ranking


Download ppt "Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia."

Similar presentations


Ads by Google