Statistical Machine Translation SMT – Basic Ideas

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
Stephan Vogel - Machine Translation1 Machine Translation Factored Models Stephan Vogel Spring Semester 2011.
Stephan Vogel - Machine Translation1 Machine Translation Word Alignment Stephan Vogel Spring Semester 2011.
Natural Language Processing Expectation Maximization.
Stephan Vogel - Machine Translation1 Statistical Machine Translation Word Alignment Stephan Vogel MT Class Spring Semester 2011.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Direct Translation Approaches: Statistical Machine Translation
Statistical Machine Translation Part IV – Log-Linear Models Alex Fraser Institute for Natural Language Processing University of Stuttgart Seminar:
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Statistical Machine Translation Part IV – Log-Linear Models Alexander Fraser Institute for Natural Language Processing University of Stuttgart
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
The CMU Arabic-to-English Statistical MT System Alicia Tribble, Stephan Vogel Language Technologies Institute Carnegie Mellon University.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
1 Machine Translation MIRA and MBR Stephan Vogel Spring Semester 2011.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Stochastic Inversion Transduction Grammars Dekai Wu Advanced Machine Translation Seminar Presented by: Sanjika Hewavitharana 04/13/2006.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Stephan Vogel - Machine Translation1 Machine Translation Decoder for Phrase-Based SMT Stephan Vogel Spring Semester 2011.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
SMT – TIDES – and all that Stephan Vogel Language Technologies Institute Carnegie Mellon University Aus der Vogel-Perspektive A Bird’s View (human translation)
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
(Statistical) Approaches to Word Alignment
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock 
Statistical Machine Translation Part II: Word Alignments and EM
Approaches to Machine Translation
Statistical NLP: Lecture 13
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
Machine Translation(MT)
Dekai Wu Presented by David Goss-Grubbs
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Neural Machine Translation
Presentation transcript:

Statistical Machine Translation SMT – Basic Ideas Stephan Vogel MT Class Spring Semester 2011 Stephan Vogel - Machine Translation

Overview Deciphering foreign text – an example Principles of SMT Data processing Stephan Vogel - Machine Translation

Deciphering Example Apinaye – English Apinaye belongs to the Ge family of Brazil Spoken by 800 (according to SIL, 1994) http://www.ethnologue.com/show_family.asp?subid=90784 http://www.language-museum.com/a/apinaye.php Example from Linguistic Olympics 2008, see http://www.naclo.cs.cmu.edu Parallel Corpus (some characters adapted) Kukre kokoi The monkey eats Ape kre The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly Can we translate new sentence? Stephan Vogel - Machine Translation

Deciphering Example Parallel Corpus (some characters adapted) Can we build a lexicon from these sentence pairs? Observations: Apinaye: Kukre (1) Ape (5), English: The (6), works (5) Aha! -> first guess: Ape – works monkey in 1, 3; child in 2, 4; man in 4, 6 different distribution over corpus: do we find words with similar distribution on the Apinaye side? Kukre kokoi The monkey eats Ape kra The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly Stephan Vogel - Machine Translation

… Vocabularies Corpus Vocabularies Observations: Expectations: Kukre kokoi The monkey eats Ape kra The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly Apinaye English kukre The kokoi monkey ape eats kra child rats works mi big mets good punui man pinjets well old badly Observations: 9 Apinaye words, 11 English words Expectations: English words without translation? Apinaye words corresponding to more then 1 English word? Stephan Vogel - Machine Translation

… Word Frequencies Corpus Vocabularies, with frequencies Suggestions: Kukre kokoi The monkey eats Ape kra The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly Apinaye English kukre 1 The 6 kokoi 2 monkey ape 5 eats kra child rats works mi big mets good punui man pinjets well old badly Suggestions: ‘ape’ (5) could align to ‘The’ (6) or ‘works’ (5) More likely that content word ‘works’ has match, i.e. ‘ape’ = ‘works’ Other word pairs difficult to predict – too many similar frequencies Stephan Vogel - Machine Translation

… Location in Corpus Corpus Vocabularies, with occurrences Kukre kokoi The monkey eats Ape kra The child works Ape kokoi rats The big monkey works Ape mi mets The good man works Ape mets kra The child works well Ape punui mi pinjets The old man works badly Apinaye Sentences English kukre 1 The 1 2 3 4 5 6 kokoi 1 3 monkey ape 2 3 4 5 6 eats kra 2 5 child rats 3 works mi 4 6 big mets 4 5 good 4 punui 6 man pinjets well 5 old badly Observations: Same sentences: ‘kukre’ – ‘eats’, ‘kokoi’ – ‘monkey’, ‘ape’ – ‘works’, ‘kra’ – ‘child’, ‘rats’ – ‘big’, ‘mi’ – ‘man’ ‘mets’ (4 and 5) =? ‘good’ (4) and ‘well’ (5); makes sense ‘punui’ and ‘pinjets’ match ‘old’ and ‘badly’ – which is which? Stephan Vogel - Machine Translation

… Location in Sentence Corpus Observations: Hypothesis: Apinaye English Alignment EN - AP Kukre kokoi The monkey eats 1-0 2-2 3-1 Ape kra The child works Ape kokoi rats The big monkey works 1-0 2-3 3-2 4-1 Ape mi mets The good man works Ape mets kra The child works well 1-0 2-3 3-1 4-2 Ape punui mi pinjets The old man works badly 1-0 2-??? 3-3 4-1 5-??? Observations: First English word (‘The’) does not align; we say it aligns to the NULL word Apinaye verb in first position English last word aligns to 1st or 2nd position English -> Apinaye: reverse word order (not strictly in sentence pair 5) Hypothesis: alignment for last sentence pair is 1-0 2-4 3-3 4-1 5-2 I.e: ‘pinjets’ – ‘old’ and ‘punui’ – ‘badly’ Stephan Vogel - Machine Translation

… POS Information Corpus Observations: Hypothesis: Kukre kokoi V N The monkey eats DET N V Ape kra The child works Det N V Ape kokoi rats V N Adj The big monkey works Det Adj N V Ape mi mets The good man works Ape mets kra V Adv N The child works well Det N V Adv Ape punui mi pinjets V ??? N ??? The old man works badly Det Adj N V Adv Observations: English determiner (‘The’) does not align; perhaps no determiners in Apinaye English Verb Adverb -> Apinaye: Verb Adverb -> no reordering English Adjective Noun -> Apinaye: Noun Adjective -> reordering Hypothesis: ‘pinjets’ is Adj to make it N Adj, ‘punui’ is Adv (consistent with alignment hypothesis) Stephan Vogel - Machine Translation

Translate New Sentences: Ap - En Source Sentence: Ape rats mi mets Lexical information: works big man good/well Reordering information: The good man works big Better lexical choice: The good man works hard Compare: Ape mi mets -> The good man works Source Sentence: Kukre rats kokoi punui Lexical information: eats big monkey badly Reordering information: The bad monkey eats big Better lexical choice: The bad monkey eats a lot Stephan Vogel - Machine Translation

Translate New Sentences: En - Ap Source Sentence: The old monkey eats a lot Lexical information: NULL pinjets kokio kukre rats Reordering information: kukre rats kokio pinjets Or Deleting words: old monkey eats a lot Rephrase: old monkey eats big Reorder: eats big monkey old Lexical information: kukre rats kokio pinjets Source Sentence: The big child works a long time Delete plus rephrase: big child works big Reorder: works big child big Lexical information: Ape rats kra rats Stephan Vogel - Machine Translation

Overview Deciphering foreign text – an example Principles of SMT Data processing Stephan Vogel - Machine Translation

Principles of SMT We will use the same approach – learning from data Build translation models using frequency, co-occurrence, word position, etc. information Use the models to translate new sentences Not manually, but fully automatically The training will be automatically The is still lots of manual work left: designing models, preparing data, running experiments, etc. Stephan Vogel - Machine Translation

Machine Translation Approaches Grammar-based Interlingua-based Transfer-based Direct Example-based Statistical Stephan Vogel - Machine Translation

Statistical Approach Using statistical models Advantages Disadvantages Create many alternatives; we call them hypotheses Give a score to each hypothesis; based on statistical models Select the best -> search problem Advantages Avoid hard decisions Sometimes, optimality can be guaranteed Speed can be traded with quality, not all-or-nothing It works better ! Disadvantages Difficulties in handling structurally rich models, mathematically and computationally (but that’s also true for non-statistical systems) Need data to train the model parameters Stephan Vogel - Machine Translation

Statistical versus Grammar-Based Often statistical and grammar-based MT are seen as alternatives, even opposing approaches – wrong !!! Dichotomies are: Use probabilities || everything is equally likely, yes/no decision Rich (deep) structure || no or only flat structure Both dimensions are continuous Examples EBMT: no/little structure and heuristics SMT: (initially only) flat structure and probabilities XFER: deep(er) structure and heuristics Goal: structurally rich probabilistic models statXFER: deep structure and probabilities Syntax-augmented SMT: deep structure and probabilities No Probs Probs Flat Structure EBMT SMT Deep Structure XFER, Interlingua Holy Grail Stephan Vogel - Machine Translation

Statistical Machine Translation Translator translates source text Use machine learning techniques to extract useful knowledge Translation model: word and phrase translations Language model: how likely words follow in a particular sequence Translation system (decoder) uses these models to translates new sentences Advantages: Can quickly train for new languages Can adopt to new domains Problems: Need parallel data All words, even punctuation, are equal Difficult to pin-point the causes of errors Source Target Translation Model Language Model Source Sentence Translation Stephan Vogel - Machine Translation

Tasks in SMT Modelling build statistical models which capture characteristic features of translation equivalences and of the target language Training train translation model on bilingual corpus, train language model on monolingual corpus Decoding find best translation for new sentences according to models Evaluation Subjective evaluation: fluency, adequacy Automatic evaluation: WER, Bleu, etc And all the nitty-gritty stuff Text preprocessing, data cleaning Parameter tuning (minimum error rate training) Stephan Vogel - Machine Translation

Noisy Channel View “French is actually English, which has been garbled during transmission; recover the correct, original English” Noisy channel distorts into French Speaker speaks English You hear French, but need to recover the English Stephan Vogel - Machine Translation

Bayesian Approach Select translations which has highest probability ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) } Model Channel Search Process Model Source Stephan Vogel - Machine Translation

SMT Architecture p(e) – language model p(f | e) – translation model Stephan Vogel - Machine Translation

Log-Linear Model In practice: ê = argmax{ log(p(e)) + log( p(f | e)) } Translaiton model (TM) and language model (LM) may be of different quality: - simplifying assumptions - trained on different abounts of data Give different weights to both models ê = argmax{ w1 * log(p(e)) + w2 * log( p(f | e)) } Why not add more features? ê = argmax{ w1 * h1(e,f) + ... wn * hn(e, f) } Note: We don‘t need the normalization constant for the argmax Stephan Vogel - Machine Translation

Overview Deciphering foreign text – an example Principles of SMT Data processing Stephan Vogel - Machine Translation

Corpus Statistics We want to know how much data Corpus size: not file size, not documents, but words and sentences Why is file size not important? Vocabulary: number of word types We want to know some distributions How many words are seen only once? Why is this interesting? Does it help to increase the corpus? … How long are the sentence Does it matter if we have many short of fewer, but longer sentences? Stephan Vogel - Machine Translation

All Simple, Basic, Important Important: When you publish, these numbers are important To be able to interpret the results E.g. what works on small corpora may not work on large corpora To make them comparable to other papers Basic: no deep thinking, no fancy Simple: a few unix commands, a few simple scripts wc, grep, sed, sort, uniq perl, awk (my favorite), perhaps python, … Let’s look at some data! Stephan Vogel - Machine Translation

BTEC Spa-Eng Corpus Statistics Sentence length balance Corpus and vocabulary size Percentage of singletons Number of unknown words, out-of-vocabulary (OOV) rate Sentence length balance Text normalization Spoken language forms: I’ll, we’ar, but also I will, we are Note: this was shown online Stephan Vogel - Machine Translation

Tokenization Punctuation attached to words Tokenization can be tricky Example: ‘you’ ‘you,’ ‘you.’ ‘you?’ All different strings, i.e. all are different words Tokenization can be tricky What about punctuation in numbers What about appreviations(A5-0104/1999) Numbers are not just numbers Percentages: 1.2% Ordinals: 1st, 2. Ranges: 2000-2006, 3:1 And more: (A5-0104/1999) Stephan Vogel - Machine Translation

GigaWord Corpus Distributed by LDC Collection of new papers: NYT, Xinhua News, … > 3 billion words How large is vocabulary? Some observations in vocabulary Number of entries with digits Number of entries with special characters Number of strange ‘words’ Some observations in corpus Sentences with lots of numbers Sentences with lots of punctuation Sentences with very long words Note: this was shown online Stephan Vogel - Machine Translation

And then the more interesting Stuff POS tagging Parsing For syntax-based MT systems How parallel are the parse trees? Word segmentation Morphological processing In all these tasks the central problem is: How to make the corpus more parallel? Stephan Vogel - Machine Translation