Direct Translation Approaches: Statistical Machine Translation

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.
Novel Reordering Approaches in Phrase-Based Statistical Machine Translation S. Kanthak, D. Vilar, E. Matusov, R. Zens & H. Ney ACL Workshop on Building.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
The Use of Speech in Speech-to-Speech Translation Andrew Rosenberg 8/31/06 Weekly Speech Lab Talk.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Natural Language Processing Expectation Maximization.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Spoken Language Translation 1 Intelligent Robot Lecture Note.
Machine translation Context-based approach Lucia Otoyo.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Large Vocabulary Continuous Speech Recognition. Subword Speech Units.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Haitham Elmarakeby.  Speech recognition
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
(Statistical) Approaches to Word Alignment
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Recent Advances in Speech Translation Systems ESSLLI-2002 Tutorial Course August 12-16, 2002 Course Organizers: Alon Lavie – Carnegie Mellon University.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
23.3 Information Extraction More complicated than an IR (Information Retrieval) system. Requires a limited notion of syntax and semantics.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.
Language Model for Machine Translation Jang, HaYoung.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Neural Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
Approaches to Machine Translation
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
Presentation transcript:

Direct Translation Approaches: Statistical Machine Translation Stephan Vogel, Alicia Tribble Interactive Systems Lab Carnegie Mellon University & University Karlsruhe Speech-to-Speech Translation Workshop ESSLLI 2002, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Overview Translation Approaches Statistical Machine Translation Translating with Cascaded Transducers Experiments on Nespole Data 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Translation Approaches Interlingua based Transfer based Direct Example based Statistical 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Statistical Machine Translation Based on Bayes´ Decision Rule: ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) } 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Tasks in SMT Modelling build statistical models which capture characteristic features of translation equivalences and of the target language Training train translation model on bilingual corpus, train language model on monolingual corpus Decoding find best translation for new sentences according to models 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Alignment Example 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Translation Models IBM1 – lexical probabilities only IBM2 – lexicon plus absolut position HMM – lexicon plus relative position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4 [Brown, et.al. 93, Vogel, et.al. 96] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy HMM Alignment Model p(f|e) = Sa p(f1J, a1J | e1I) = Sa Pj p(fj , aj | f1j-1, a1j-1, e1I) = Sa Pj p(aj | aj-1) p(fj | ea(j)) ~ maxa Pj p(aj | aj-1) p(fj | ea(j)) Alignment aj of current word fj depends on alignment aj-1 of previous word fj-1 . 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Phrase Translation Why? To capture context Local word reordering How? Train alignment model Extract phrase-to-phrase translations from Viterbi path Notes: Often better results when training target to source for extraction of phrase translations Phrases are not fully integrated into alignment model, they are extracted only after training is completed 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Translation with Transducers Finite state machine Read sequence of words, write sequene of words Output vocaculary can be different from input vocabulary Transducer used in current implementation: Tree Transducer, i.e. prefix tree over input strings Output from final states Used to encode lexicon, phrase translations, bilingual word classes and grammers 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Cascaded Transducers Generalization through cascaded transducers: Replace words by category labels and have a transducer for each category [Vogel, Ney 2000] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Language Model Standard n-gram model: p(w1 ... wn) = Pi p(wi | w1... wi-1) = Pi p(wi | wi-2 wi-1) trigram = Pi p(wi | wi-1) bigram Many events not seen -> smoothing required 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Decoding Strategies Sequential construction of target sentence Extend partial translation by words which are translations of words in the source sentence Language model can be applied immediately Mechanism to ensure proper coverage of source sentence required Left – right over source sentence Find translations for sequences of words Construct translation lattice Apply language model and select best path 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Translation Graph 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech Recognition and Translation Search best string in target language for given acoutsic signal in source language ê = argmax{ p(e) p(x|e) } = argmax{ p(e) Sf p(f,x|e) } = argmax{ p(e) Sf p(f|e) p(f) p(x|f,x) } = argmax{ p(e) Sf p(f|e) p(f) p(x|f) } i.e. recognizer language model not needed !? [Ney, 2001] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Coupling Recognition and Translation Sequential – first recognition, then translation First best recognition hypothesis N-best list – translate n times Word lattice – translate all pathes in lattice, reuse results from partial pathes Integrated – recognition and translation in combined search Subsequential transducer approach uses this Note: In Eutrans project best results when translation on first-best hypothesis 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Example-Based Machine Translation Re-use translations to create new translations: Store bilingual corpus with (partial) alignment Find partial matches, i.e. sequences of words in stored corpus to cover a new sentence Extract translation(s) and build translation lattice Apply language model to find best path, i.e. best translation 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Nespole Experiments Application of direct translation techniques to dialogue data collected in Nespole! Testing the effect of phrase translation Experiments with additional knowledge sources Preexisting: monolingual data for the LM and publically available Lexica Engineered: handwritten rules for fixed expressions and knowledge extracted from semantic grammars 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Nespole Project Data CMU database of dialogues in the travel domain German, English (Italian, French) Speech recognizer hypotheses and human transcriptions both available Segmented into SDUs (Speech Dialogue Units) NOTE: transition only!! Just flash it up and take it down!! This is a description of a set of experiments in speech translation- specifically Nespole! – using a direct approach, i.e. the statistical approach with cascaded transducers described by Stephan earlier. 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Nespole Corpus: Training 3182 Parallel SDUs Language English German Tokens 15572 14992 Vocabulary 1032 1338 Singletons 404 620 50% of the German and 40% of the English are singletons: indicates that we Are far from the edge of our Zipfian type-token curve. i.e. far from saturation, And we can expect a lot of unknowns in the test data... 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Nespole Corpus: Testing 70 Parallel SDUs German Reference A Reference B Tokens 437 610 607 Vocabulary 183 (45 OOV) 165 160 German stats are for the transcribed data, but the speech-recognizer data is very similar in these statistics 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Corpus Challenges: Sentence Length Training Data Testing Data These are averages for the two reference translators, but their avg. Lengths agreed very closely However we didn‘t tune for this length difference... Sort of a disadvantage to our system that We did not do so because the training data didn‘t tell us to... 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Evaluation Human Scoring Good, Okay, Bad (c.f. Nespole evaluation) Collapsed into a „human score“ on [0,1] Bleu Score Average of N-gram precisions from (1..N), typically N=3 or 4 Penalty for short translations to substitute for recall measure Note: automatic is good b-c it is cheap and reproducible, human is good only if it is consistent... Etc. [Papinini et.al. 2001] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Phrase Translation Unequal sentence lengths means that training can be improved directionally: S T or T S German compounds are better for 1 to many alignments with English multiword phrases, so direction is important Statistical lexicon alone Statistical lexicon, phrases from S T training Statistical lexicon, phrases from bidir. training 0,1903 0,2350 0,2654 ´as stephan pointed out earlier, training in different directions can affect the phrase translation quality´ Restriction of the alignment model is 1-M but not M-1 in training, so we need to use these morphological Differences wisely (q: what about when we have to translate from english to german? -> then we can Use the German to English model also, but we don´t have to go to the add´l trouble of flipping the Transducer...) Note: unequal sentence lengths is a product(?) or at least an indicator of a mismatch between the amt. Of morphology in the two languages. 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Language Model Monolingual text available from Verbmobil 500.000 words (32x the size of orig. English corpus) Helps to choose among translation hypotheses but will not generate new ones Stat. lexicon, phrases, fixed expression rules, gen. lexicon, and small LM Stat. lexicon, phrases, fixed expression rules, gen. lexicon, and large LM 0,2613 0,3172 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

General-Purpose Lexicon Statistical lexicon, phrases, and fixed exp´s with small LM 0,2654 Adding general-purpose lexicon as a transducer 0,2522 Using large instead of small LM 0,3141 general-purpose lexicon as training data instead of separate transducer 0,3275 Lexicon size is 160.000 words, ~15x orig. English training data. NOTE: why do results fall with additional knowledge?? - don´t always; even adding as a transducer gave small improvements in some configurations. - Adding to the training data allows the common words in the non-statistical lexicon to be given probabilities commensurate with their numbers in the corpus. Adding as a separate transducer throws these weights off for the alignment model and causes translation quality to fall as a result. -gives better results in combination with large LM than with small, and better when combined with training data than when not, both bc of probabilities as stated above... 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Fixed Expression Rules Transducer rules are human readable and can be added by hand Fixed expressions for times and dates are re-usable, require less time to build than domain-specific rules and improve coverage of some semi-idiomatic constructions. Statistical lexicon with small LM Statistical lexicon and fixed-expression transducer with small LM 0,1893 0,1903 Format of a transducer rule is similar to a rule in a probabalistic context-free grammar: @LABEL # LHS # RHS # prob === LHS -> RHS w/ prob. Tradeoff of human effort vs. Reusability... Note that these parts are not like interlingual Grammar rules bc they are language-pair specific  but they are domain-generic (probably / by design) Improvement in voc. Coverage is not great but helps with word reordering. Can be weigted over the phrase or stat. Lexicon transducers in order to let these expressions show in the final translation output. 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Knowledge from Existing Grammars Could help in domain- but not language- portability Benefit mostly in additional vocabulary Statistical lexicon, fixed exp´s, phrases, and general lexicon with large LM Statistical lexicon, fixed exp´s, phrases, general lexicon and I-transducer with large LM 0,3141 0,3172 *** experimental and requires more testing/development *** Portability: domain- portability as opposed to language- portability because this has to be specific to a given language pair, like all parts of the statistical system. The direct systems have to be retrained in new languages, Anyway... Vocabulary expansion: we treat the grammars as another domain-specific language source but they are Not exactly paralell- best we can do is take a guess at what means the same thing and add entries to our Lexicon this way (q: why didn´t we try adding these entries to the training data as we did the additional Lexicon... We already argued that this works better!! – think we were in the mindset that we could eventually get a deeper structure out of the grammars so we were reserving a transducer for that...) Allows us to take advantage of generalization on the part of the grammar writers: have they introduced new vocabulary (synonyms, etc) based on the forms in the training data: Not a lot – we checked the additional voc. Coverage which we get in our test data by adding the grammar Vocabulary and it was only 8 types  But this is out of 45 unknowns so we have a 5th of the unknowns Recovered for this test set  16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Comparative Evaluation Results Good Okay Bad Score Bleu Text IF 77 104 227 0,32 0,068 SMT 127 80 205 0,40 0,333 Speech 64 101 243 0,28 0,059 95 83 0,34 0,262 -Automatic evaluation is good for measureing the effect of these small changes but we need human evaluation to determine how good the translation actually is, and as a reality check on comparative results Human evaluation here shows that the Bleuscore was too unequal *but* did give higher scores to the Direct approaches in both text and speech. Inter-coder agreement was very good. NOTE: what about the fact that we included grammar knowledge and then compared to grammars alone? Could consider the final SMT system as a sort of hybrid but looking at the scores for adding grammar knowledge shows that this didn´t improve the SMT system that much... Also not including the grammars but extracting translation pairs. In the future we would like to see whether we could extract better structure from grammars- applicable when a new system is being built and old grammars are around without infrastructure to support them... Then the work that went into building them would still be usable by simply extracting useful information and adding it to a direct system. ´legacy´ 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy

Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Selected References Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 1993, 19,2,  pp.263—311 Stephan Vogel, Hermann Ney, Christoph Tillmann. HMM-Based Word Alignment in Statistical Translation. Int. Conf. on Computational Linguistics, Kopenhagen, Danemark, pp. 836-841, August 1996. Stephan Vogel, Hermann Ney. Translation with Cascaded Finite State Transducers. 36th Annual Conference of the Association for Computational Linguistics, pp. 23-30, Hongkong, China, October2000. Stephan Vogel, Alicia Tribble. Improving statistical machine translation for a speech-to-speech translation task. To appear in ICSLP 2002. H. Ney. The Statistical Approach to Spoken Language Translation. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, 8 pages, CD ROM, IEEE Catalog No. 01EX544, December 2001. Kishore Papinini, Salim Roukos, Todd Ward, Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation ofMachine Translation. IBM Research Report RC22176(W0109-022), September17, 2001.    16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy