Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.

The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

Progress in Chinese EBMT for LingWear Ying Zhang (Joy) Language Technologies Institue Carnegie Mellon University Sep.

Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.

EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

TIDES MT Workshop Review. Using Syntax?  ISI-small: –Cross-lingual parsing/decoding Input: Chinese sentence + English lattice built with all possible.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

Machine translation Context-based approach Lucia Otoyo.

The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen

Direct Translation Approaches: Statistical Machine Translation

Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.

Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Mining Translations of OOV Terms from the Web through Crosslingual Query Expansion Ying Zhang Fei Huang Stephan Vogel SIGIR 2005.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

A non-contiguous Tree Sequence Alignment-based Model for Statistical Machine Translation Jun Sun ┼, Min Zhang ╪, Chew Lim Tan ┼ ┼╪

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)

Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.

Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

Statistical Machine Translation Papers from COLING 2004

Presentation transcript:

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia Tribble, Fei Huang, Ashish Venugopal, Bing Zhao, Matthias Eck, Ralf Brown Language Technologies Institute Carnegie Mellon University 08/06/2003

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 2 Outline Overview The Knowledge Sources LDC Dictionary Phrase Translations Named Entities The CMU-SMT Decoder Building the Translation Lattice First-best Search Recombination and Pruning of Hypotheses Recent Advances Overlapping Sub-sentential Translations Reordering Conclusion

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 3 The CMU SMT System: An Overview Participation in Arabic-English and Chinese-English evaluations Using large bilingual corpora to train word-to-word and phrase-to-phrase translations Selection of training data which best covers the test data Improvements in phrase-to-phrase translations Using large monolingual English corpora for Language model Online named entity detection and translation Initial language model adaptation experiments

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 4 Improvements 2002 to 2003 Significant improvement for Chinese Dramatic improvement for Arabic Chinese Large Chinese Small (6.45)Arabic

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 5 Domain Portability Tides evaluations on news data, translation systems are tuned towards particular test sets Research systems outperform COTS on specific test data, but lose when tested other domains Example: Medical domain C-EE-SC-SC-E-S SMT SMT with rules Systran

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 6 Chinese and Arabic Systems: The Data Bilingual Dictionary Chinese-English Dictionary from LDC; 53K Chinese entries 10K Chinese-English Dictionary, most frequent Chinese words from full dictionary Bilingual Corpora Chinese Small Data Track: Treebank data (100k words) Chinese Large Data Track: Large UN corpus, FBIS data, Hongkong News, Sinorama, Xhinua News comparable corpus; Arabic: Large UN corpus, Ummah, multi-translation corpus Monolingual Data 10 years of Xinhua news for language models Target side of bilingual corpora

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 7 Preprocessing, Cleaning, Postprocessing Preprocessing and Cleaning Length balancing: remove sentence pairs that differ in the number of words by a given factor, usually 1.5. Remove sentence pairs with too many non-words. Normalize numbers: convert into ascii digits; write large numbers in the form ‘5 thousand’; fix numbers in Arabic UN data. Lower casing for English. Extract subcorpus covering all phrases of any length. Postprocessing Remove untranslated words or replace them using a LM to find the most likely word. Mixed-casing based on Named Entity detection and simple rules.

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 8 Translation Model Manual dictionary (for Chinese to English) Phrase to phrase translations extracted from bilingual corpora using different approaches: HMM Alignments (HMM): Train HMM word alignment and extract phrase pairs from Viterbi path. Training in both directions. Integrated Segmentation/Alignment (ISA): Assign scores to each source-target word pair according to their Point-wise Mutual Information (MI), then extract contiguous areas of high MI as phrase translations. Bilingual Bracketing (BiBr): Use the SBTG approach (D. Wu) to generate Viterbi alignment. Extract phrase pairs based on bracketing. Named Entities from different sources

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 9 Integrating Word and Phrase Translation Most phrase pairs seen only once or twice, i.e. no good statistics possible. Want to have phrase translation probabilities near word to word translation probabilities. For HMM and BiBr phrases, assign probabilities for a phrase pair based on word translation probabilities: Π j [ Σ i p( f j | e i )] > Π j [ max i p( f j | e i )] ISA phrase probabilities adjusted to be comparable.

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 10 Named Entities: Offline Resources Parallel/comparable corpus Bilingual dictionary Tools NE Taggers for both languages Multiple Features NE Tagging NE Transliteration NE Translation

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 11 Named Entities: Online Translate documents (Chinese to English) Search for similar documents in English newswire, using Lemur Toolkit Extract NEs (persons, locations) from retrieved documents Compute transliteration cost NEs in test data (source language) NEs in relevant documents (target language) Match NE pairs with minimum cost, added for second pass translation Translate again with augmented NE list

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 12 Improvement from NE Translation Test data: 887 sentences Small track NIST score Large track NIST score baseline Offline NE Online NE Improvement with offline and online approach Online approach has lower precision, but higher recall

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 13 The Decoder Build translation lattice Organize all translation pairs as prefix tree over the source side Run left-right over test sentence Search for matching phrases For each translation, insert edges into the lattice First best search Run left-right over lattice Apply trigram language model Combine translation model score and language model score Recombine and prune hypotheses At sentence end: add sentence length model score Trace back best hypothesis

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 14 Overlapping Phrases Goal: capture longer phrase translations, even those not seen in training corpus. First implemented and tested in EBMT system Improvement for 203 dry run test data: 5.6% NIST, 27% Bleu relative For SMT system Augment the phrase transducers by merging phrase pairs that overlap on both source and target side Overlap up to 4 words considered Probabilities assigned based on word alignment model

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 15 Overlapping Phrases : Results for Arabic Number of newly generated pairs (matching n-grams in test sentences) ISA: orig. 46,641 - new 19,072 HMM: orig. 135,003 - new 686,295 Average length of phrases increases from 1.3 to 1.4 NISTBleu Baseline Overlapping Phrases8.78 (+2.2%)0.425 (+10.4%)

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 16 Reordering Phrase translation realizes limited word reordering Goal: Allow for additional word and phrase-level reordering during decoding Implementation: Reordering window: expand hypotheses with all words and phrases lying or starting within this window Bookkeeping for covered positions Recombine and prune hypotheses

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 17 Recombination and Pruning of Hypotheses Hypotheses can be kept apart according to any combination of: Total coverage of source sentence Coverage pattern, when allowing for reordering Language model history Length of generated partial translation Pruning of hypotheses based on same features Use fewer features for pruning Prune all hypotheses outside of beam Example: Keep hypotheses apart based on coverage and LM history, prune based on coverage

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 18 Reordering: Results Continuous improvement with larger window up to 4 Improvement mainly due to longer matching n-grams Improvement for 2003 Arabic test set: > 9.26 NISTBleu Baseline R49.02 (+4.8%)0.441 (+12.7%) R4 + OP9.09 (+5.8%)0.455 (+18.2%)

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 19 Summary Significant improvement from evaluation 2002 to evaluation 2003, esp. for Arabic to English Phrase translation has proven to be key in achieving good translation quality Adding probabilities to manual translation lexicon help Recent extensions of the SMT system (overlapping phrases and reordering) have resulted in further improvements

07/12/2003 S. Vogel and A. Waibel, The CMU SMT System 20 Future Research Directions LM Adaptation Class-based LM Discriminative Training Confidence measures and n-best list re-ranking