Translation Models Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Joint Parsing and Alignment with Weakly Synchronized Grammars David Burkett, John Blitzer, & Dan Klein TexPoint fonts used in EMF. Read the TexPoint manual.

Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Machine Translation- 4 Autumn 2008 Lecture Sep 2008.

Machine Translation Introduction to Statistical MT.

Word Alignment Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides.

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.

Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein.

A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.

+ Natural Language Processing CS151 David Kauchak.

THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.

Natural Language Processing Expectation Maximization.

Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.

Translation Model Parameters (adapted from notes from Philipp Koehn & Mary Hearne) 24 th March 2011 Dr. Declan Groves, CNGL, DCU

EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.

+. + Natural Language Processing CS159 – Fall 2014 David Kauchak.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Thanks to Dan Klein of UC Berkeley and Chris Manning of Stanford for many of the materials used in this lecture. CS 479, section 1: Natural Language Processing.

Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.

Machine Translation Course 5 Diana Trandab ă ț Academic year:

PROBABILITY David Kauchak CS159 – Spring Admin  Posted some links in Monday’s lecture for regular expressions  Logging in remotely  ssh to vpn.cs.pomona.edu.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.

1 CMSC 671 Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29.

Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.

Daisy Arias Math 382/Lab November 16, 2010 Fall 2010.

Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

(Statistical) Approaches to Word Alignment

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Discriminative Modeling extraction Sets for Machine Translation Author John DeNero and Dan KleinUC Berkeley Presenter Justin Chiu.

January 2012Spelling Models1 Human Language Technology Spelling Models.

Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.

Computing & Information Sciences Kansas State University Friday, 05 Dec 2008CIS 530 / 730: Artificial Intelligence Lecture 39 of 42 Friday, 05 December.

Neural Machine Translation

Statistical Machine Translation Part II: Word Alignments and EM

Approaches to Machine Translation

CSE 517 Natural Language Processing Winter 2015

Parsing in Multiple Languages

Statistical NLP Spring 2011

Statistical NLP: Lecture 13

Statistical Machine Translation

Statistical Machine Translation Part III – Phrase-based SMT / Decoding

CSCI 5832 Natural Language Processing

CSCI 5832 Natural Language Processing

Approaches to Machine Translation

Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn

Introduction to Statistical Machine Translation

Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky

Word embeddings (continued)

Statistical NLP Spring 2011

Machine Translation: Word alignment models

Neural Machine Translation

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Translation Models Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics University of Edinburgh Some slides adapted from David Kauchak CS159 – Fall 2014 Kevin Knight Computer Science Department UC Berkeley Dan Klein

Admin Assignment 5 extended to Thursday at 2:45 –Should have worked on it some already! –Extra credit (5%) if submitted by Tuesday (i.e. now) –No late days

Language translation Yo quiero Taco Bell

Noisy channel model model translation model language model how do English sentences get translated to foreign? what do English sentences look like?

Problems for Statistical MT Preprocessing –How do we get aligned bilingual text? –Tokenization –Segmentation (document, sentence, word) Language modeling –Given an English string e, assigns P(e) by formula Translation modeling –Given a pair of strings, assigns P(f | e) by formula Decoding –Given a language model, a translation model, and a new sentence f … find translation e maximizing P(e) * P(f | e) Parameter optimization –Given a model with multiple feature functions, how are they related? What are the optimal parameters? Evaluation –How well is a system doing? How can we compare two systems?

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation

From No Data to Sentence Pairs Easy way: Linguistic Data Consortium (LDC) Really hard way: pay $$$ –Suppose one billion words of parallel data were sufficient –At 20 cents/word, that’s $200 million Pretty hard way: Find it, and then earn it! How would you obtain data? What are the challenges?

From No Data to Sentence Pairs Easy way: Linguistic Data Consortium (LDC) Really hard way: pay $$$ –Suppose one billion words of parallel data were sufficient –At 20 cents/word, that’s $200 million Pretty hard way: Find it, and then earn it! –De-formatting –Remove strange characters –Character code conversion –Document alignment –Sentence alignment –Tokenization (also called Segmentation)

What is this? c 69 6b e 4c c Hexadecimal file contents I _ l i k e _ N L P _ c l a s s Totally depends on file encoding!

ISO (Latin2) ISO (Arabic)

Chinese? GB Code GBK Code Big 5 Code CNS …

If you don’t get the characters right…

Document Alignment Input: –Big bag of files obtained from somewhere, believed to contain pairs of files that are translations of each other. Output: –List of pairs of files that are actually translations. E E E E E C C C C C C C C C C C C C C C E E E E E E E E E E

Input: –Big bag of files obtained from somewhere, believed to contain pairs of files that are translations of each other. Output: –List of pairs of files that are actually translations. Document Alignment E E E E E C C C C C C C C C C C C C C C E E E E E E E E E E

Sentence Alignment The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await. El viejo está feliz porque ha pescado muchos veces. Su mujer habla con é l. Los tiburones esperan.

Sentence Alignment 1.The old man is happy. 2.He has fished many times. 3.His wife talks to him. 4.The fish are jumping. 5.The sharks await. 1.El viejo está feliz porque ha pescado muchos veces. 2.Su mujer habla con él. 3.Los tiburones esperan. What should be aligned?

Sentence Alignment 1.The old man is happy. 2.He has fished many times. 3.His wife talks to him. 4.The fish are jumping. 5.The sharks await. 1.El viejo está feliz porque ha pescado muchos veces. 2.Su mujer habla con él. 3.Los tiburones esperan.

Sentence Alignment 1.The old man is happy. He has fished many times. 2.His wife talks to him. 3.The sharks await. 1.El viejo está feliz porque ha pescado muchos veces. 2.Su mujer habla con él. 3.Los tiburones esperan. Note that unaligned sentences are thrown out, and sentences are merged in n-to-m alignments (n, m > 0).

Tokenization (or Segmentation) English –Input (some byte stream): "There," said Bob. –Output (7 “tokens” or “words”): " There, " said Bob. Chinese –Input (byte stream): –Output: 美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件。

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation

Language Modeling Most common: n-gram language models More data the better (Google n-grams) Domain is important

Problems for Statistical MT Preprocessing Language modeling Translation modeling Decoding Parameter optimization Evaluation

Translation Model Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Can we just model this directly, i.e. p(foreign | english) ? How would we estimate these probabilities, e.g. p( “Maria …” | “Mary …”)? Want: probabilistic model gives us how likely one sentence is to be a translation of another, i.e p(foreign | english)

Translation Model Mary did not slap the green witch Maria no d ió una botefada a la bruja verde p( “Maria…” | “Mary…” ) = Want: probabilistic model gives us how likely one sentence is to be a translation of another, i.e p(foreign | english) count(“Mary…” aligned-to “Maria…”) count( “Mary…”) Not enough data for most sentences!

Translation Model Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Key: break up process into smaller steps sufficient statistics for smaller steps

What kind of Translation Model? Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Word-level models Phrasal models Syntactic models Semantic models

IBM Word-level models Generative story: description of how the translation happens 1. Each English word gets translated as 0 or more Foreign words 2. Some additional foreign words get inserted 3. Foreign words then get shuffled Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Word-level model

IBM Word-level models Each foreign word is aligned to exactly one English word. Key idea: decompose p( foreign | english ) into word translation probabilities of the form p( foreign_word | english_word ) IBM described 5 different levels of models with increasing complexity (and decreasing independence assumptions) Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Word-level model

Some notation English sentence with length |E| Foreign sentence with length |F| Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Translation model:

Word models: IBM Model 1 Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Each foreign word is aligned to exactly one English word This is the ONLY thing we model! Does the model handle foreign words that are not aligned, e.g. “a”? p(verde | green)

Word models: IBM Model 1 Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Each foreign word is aligned to exactly one English word This is the ONLY thing we model! NULL Include a “NULL” English word and align to this to account for deletion p(verde | green)

Word models: IBM Model 1 generative story -> probabilistic model –Key idea: introduce “hidden variables” to model the word alignment one variable for each foreign word a i corresponds to the ith foreign word each a i can take a value 0…|E|

Alignment variables Mary did not slap the green witch Maria no d ió una botefada a la bruja verde a1a1 1 a2a2 3 a3a3 4 a4a4 4 a5a5 4 a6a6 0 a7a7 5 a8a8 7 a9a9 6

Alignment variables And the program has been implemented Le programme a ete mis en application Alignment?

Alignment variables And the program has been implemented Le programme a ete mis en application a1a1 ? a2a2 ? a3a3 ? a4a4 ? a5a5 ? a6a6 ? a7a7 ?

Alignment variables And the program has been implemented Le programme a ete mis en application a1a1 2 a2a2 3 a3a3 4 a4a4 5 a5a5 6 a6a6 6 a7a7 6

Probabilistic model = ? NO! How do we get rid of variables?

Joint distribution NLPPass, EngPassP(NLPPass, EngPass) true, true.88 true, false.01 false, true.04 false, false.07 What is P(ENGPass)?

Joint distribution NLPPass, EngPassP(NLPPass, EngPass) true, true.88 true, false.01 false, true.04 false, false.07 How did you figure that out? 0.92

Joint distribution NLPPass, EngPassP(NLPPass, EngPass) true, true.88 true, false.01 false, true.04 false, false.07 Called “marginalization”, aka summing over a variable

Probabilistic model Sum over all possible values, i.e. marginalize out the alignment variables

Independence assumptions IBM Model 1: What independence assumptions are we making? What information is lost?

And the program has been implemented Le programme a ete mis en application And the program has been implemented application en programme Le mis ete a Are the probabilities any different under model 1?

And the program has been implemented Le programme a ete mis en application And the program has been implemented application en programme Le mis ete a No. Model 1 ignores word order!

IBM Model 2 Mary did not slap the green witch Maria no d ió una botefada a la bruja verde p(i aligned-to a i ) Maria no d ió una botefada a la verde bruja p(la|the) Models word movement by position, e.g. Words don’t tend to move too much Words at the beginning move less than words at the end NULL

IBM Model 3 Mary did not slap the green witch Mary not slap slap slap the green witch p(3|slap) Maria no d ió una botefada a la bruja verde p(i aligned-to a i ) Maria no d ió una botefada a la verde bruja p(la|the) NULL Incorporates “fertility”: how likely a particular English word is to produce multiple foreign words

Word-level models Problems/concerns? –Multiple English words for one French word IBM models can do one-to-many (fertility) but not many-to- one –Phrasal Translation “real estate”, “note that”, “interest in” –Syntactic Transformations Verb at the beginning in Arabic Translation model penalizes any proposed re-ordering Language model not strong enough to force the verb to move to the right place

Benefits of word-level model Rarely used in practice for modern MT systems Why talk about them? Mary did not slap the green witch Maria no d ió una botefada a la bruja verde Two key side effects of training a word-level model: Word-level alignment p(f | e): translation dictionary

Training a word-level model Where do these come from? The old man is happy. He has fished many times. His wife talks to him. The sharks await. … El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. … Have to learn them!

Training a word-level model The old man is happy. He has fished many times. His wife talks to him. The sharks await. … El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan. … : probability that e is translated as f How do we learn these? What data would be useful?

Thought experiment The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. His wife talks to him. Su mujer habla con él. The sharks await. Los tiburones esperan. ?

Thought experiment The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. His wife talks to him. Su mujer habla con él. The sharks await. Los tiburones esperan. p(el | the) = 0.5 p(Los | the) = 0.5 Any problems concerns?

Thought experiment The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. His wife talks to him. Su mujer habla con él. The sharks await. Los tiburones esperan. Getting data like this is expensive! Even if we had it, what happens when we switch to a new domain/corpus

a b x y c b z x Training without alignments How should these be aligned? There is some information! (Think of the alien translation task last time)

Thought experiment #2 The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. Annotator 1 Annotator 2 What do we do?

Thought experiment #2 The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. 80 annotators 20 annotators What do we do?

Thought experiment #2 The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. The old man is happy. He has fished many times. El viejo está feliz porque ha pescado muchos veces. 80 annotators 20 annotators Use partial counts: -count(viejo | man) 0.8 -count(viejo | old) 0.2