5/28/031 Data Intensive Linguistics Statistical Alignment and Machine Translation.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Planning under Uncertainty
Improving Word-Alignments for Machine Translation Using Phrase-Based Techniques Mike Rodgers Sarah Spikes Ilya Sherman.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Expectation Maximization Algorithm
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
Natural Language Processing Expectation Maximization.
1 Machine Translation (MT) Definition –Automatic translation of text or speech from one language to another Goal –Produce close to error-free output that.
Statistical Alignment and Machine Translation
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Graphical models for part of speech tagging
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
5/28/031 Data Intensive Linguistics Statistical Alignment and Machine Translation.
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
John Lafferty Andrew McCallum Fernando Pereira
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Machine Translation Course 4 Diana Trandab ă ț Academic year:
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Statistical Machine Translation Part II: Word Alignments and EM
Statistical NLP: Lecture 13
Data Mining Lecture 11.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS224N Section 2: PA2 & EM Shrey Gupta January 21,2011.
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

5/28/031 Data Intensive Linguistics Statistical Alignment and Machine Translation

5/28/032 Overview MT is very hard: by any reasonable standard translation programs available today do not perform very well. But they are still used and useful. Most MT systems are a mix of probabilistic and non-probabilistic components, though there are a few completely statistical translation systems.

5/28/033 Overview (Cont’d) A large part of implementing an MT system is not specific to MT. Nonetheless, parts of MT that are specific to it are: text alignment and word alignment. Definition: In the sentence alignment problem, one seeks to say that some group of sentences in one language corresponds in content to some other group of sentences in another language. Such a grouping is referred to as a bead of sentences.

5/28/034 Overview of the Lecture Text Alignment Word Alignment Fully Statistical Attempt at MT

5/28/035 Text Alignment: Aligning Sentences and Paragraphs Text alignment is useful for bilingual lexicography, MT, but also as a first step to using bilingual corpora for other tasks. Text alignment is not trivial because translators do not always translate one sentence in the input into one sentence in the output, although they do so in 90% of the cases. Another problem is that of crossing dependencies, where the order of sentences are changed in the translation.

5/28/036 Different Approached to Text Alignment Length-Based Approaches: short sentences will be translated as short sentences and long sentences as long sentences. Offset Alignment by Signal Processing Techniques: these approaches do not attempt to align beads of sentences but rather just to align position offsets in the two parallel texts. Lexical Methods: Use lexical information to align beads of sentences.

5/28/037 Length-Based Methods I: General Approach Goal: Find alignment A with highest probability given the two parallel texts S and T: arg max A P(A|S, T)=argmax A P(A, S, T) To estimate the above probabilities, the aligned text is decomposed in a sequence of aligned beads where each bead is assumed to be independent of the others. Then P(A, S, T)   k=1 K P(B k ). The question, then, is how to estimate the probability of a certain type of alignment bead given the sentences in that bead.

5/28/038 Length-Based Methods II: Gale and Church, 1993 The algorithm uses sentence length (measured in characters) to evaluate how likely an alignment of some number of sentences in L1 is with some number of sentences in L2. The algorithm uses a Dynamic Programming technique that allows the system to efficiently consider all possible alignments and find the minimum cost alignment. The method performs well (at least on related languages). It gets a 4% error rate. It works best on 1:1 alignments [only 2% error rate]. It has a high error rate on more difficult alignments.

5/28/039 Length-Based Methods II: Other Approaches Brown et al., 1991: Same approach as Gale and Church, except that sentence lengths are compared in terms of words rather than characters. Other difference in goal: Brown et al. Didn’t want to align entire articles but just a subset of the corpus suitable for further research. Wu, 1994: Wu applies Gale and Church’s method to a corpus of parallel English and Cantonese Text. The results are not much worse than on related languages. To improve accuracy, Wu uses lexical cues.

5/28/0310 Lexical Methods of Sentence Alignment I: Kay & Röscheisen, 1993 Assume the first and last sentences of the texts align. These are the initial anchors. Then, until most sentences are aligned: 1. Form an envelope of possible alignments. 2. Choose pairs of words that tend to co-occur in these potential partial alignments. 3. Find pairs of source and target sentences which contain many possible lexical correspondences. The most reliable of these pairs are used to induce a set of partial alignments which will be part of the final result.

5/28/0311 Lexical Methods of Sentence Alignment II: Chen, 1993 Chen does sentence alignment by constructing a simple word-to-word translation model as he goes along. The best alignment is the one that maximizes the likelihood of generating the corpus given the translation model. This best alignment is found by using dynamic programming.

5/28/0312 Lexical Methods of Sentence Alignment III: Haruno & Yamazaki, 1996 Their method is a variant of Kay & Röscheisen (1993) with the following differences: For structurally very different languages, function words impede alignment. They eliminate function words using a POS Tagger. If trying to align short texts, there are not enough repeated words for reliable alignment using Kay & Röscheisen (1993). So they use an online dictionary to find matching word pairs

5/28/0313 Word Alignment A common use of aligned texts is the derivation of bilingual dictionaries and terminology databases. This is usually done in two steps: First, the text alignment is extended to a word alignment. Then, some criterion, such as frequency is used to select aligned pairs for which there is enough evidence to include them in the bilingual dictionary. Using a  2 measure works well unless one word in L1 occurs with more than one word in L2. Then, it is useful to assume a one-to-one correspondence. Future work is likely to use existing bilingual dictionaries.

5/28/0314 Fully Statistical MT Suppose we are translating F -> E We conceptualize this by imagining that the French has arisen by a stochastic process from an underlying but unknown English version. We need to guess the most likely such version

5/28/0315 Stochastic MT Model P(E|F) as P(F|E)P(E) / P(F) P(F) is constant, all we want is to choose E P(F|E)P(E) is a noisy channel model, where P(F|E) is a distorting channel that converts E into F

5/28/0316 Why not just model P(E|F)? It seems circular to model P(E|F) by an approach that requires P(F|E)

5/28/0317 Why not just model P(E|F)? It seems circular to model P(E|F) by an approach that requires P(F|E) P(F|E) is the translation specialist P(E) is the English specialist Both can be sloppy

5/28/0318 Scene of the crime To be guilty You have to have a motive You have to have the means The CSI person can ignore motive The police psychologist can ignore means To be a good translation You have to convey the message You have to be plausible English P(F|E) can score translation while ignoring many aspects of English P(E) doesn’t have to know anything about French

5/28/0319 Summa cum laude “Topmost with praise” scores high on P(L|E) but poorly on P(E) “Suit and tie” scores much higher on P(E) but badly on P(L|E) “With highest honors” maximizes P(E)P(L|E) (we hope)

5/28/0320 What are the models Source model p(E) could be trigram model Guarantees semi-fluent English Channel model p(F|E) or p(L|E) could be finite-state transducer Stochastically translates each word + allows a little random rearrangement – with high prob, words stay more or less put Maximizing p(F|E) would give really lousy French translation of English Random word translation is stupid – need word sense from context Random word rearrangement is stupid – phrases rearrange! This channel has no idea what fluent French looks like But maximizing p(E)*p(F|E) gives a better English translation of French because p(E) knows what English should look like.

5/28/0321 IBM models 1,2,3,4,5 Models for P(F|E) There is a set of English words and the extra English word NULL Each English word generates and places 0 or more French words Any remaining French words are deemed to have been produced by NULL

5/28/0322 IBM models 1,2,3,4,5 In Model 2, the placement of a word in the French depends on where it was in the English

5/28/0323 IBM models 1,2,3,4,5 In model 3 we model how many French words and English word can produce, using a concept called fertility

5/28/0324 IBM models 1,2,3,4,5 In model 4 the placement of later French words produced by an English word depends on what happened to earlier French words generated by that same English word

5/28/0325 Alignments The basic idea in all 5 is to build models of translation on top of models of the alignment. We really want P(F|E), which we could obtain from direct models of the joint distribution P(F,E) Instead we introduce probabilistic models of the alignment A, and work with P(F,A,E)

5/28/0326 Alignments The dog ate my homework Le chien a mangé mes devoirs

5/28/0327 IBM models 1,2,3,4,5 In model 5 they do non-deficient alignment. To understand what is at stake, you need to know a little more about the internals of the models

5/28/0328 Alignments All of models 1 through 5 are actually models of how alignments come to be Conceptually, we form the probability of a pair of sentences by summing over all alignments that relate these sentences. In practice, the Candide team sometimes preferred to approximate by assuming that the sentence pair having the single best alignment was also the best sentence pair.

5/28/0329 Why all the models We don’t start with aligned text, so we have to get initial alignments from somewhere. Model 1 is words only, and is relatively easy to train with technology similar to the forward- backward algorithm for HMMs. This is the EM algorithm

5/28/0330 Why all the models We are working in a space with many parameters and many local minima. EM guarantees only local optimality, so it pays (as in Pereira and Schabes’ work on grammar) to start from a place which is already good.

5/28/0331 Why all the models The purpose of model 1 is to find a good place in which to start training model 2 Model 2 feeds model 3 And so on

5/28/0332 Maximum entropy Going beyond the 5 models, there is a systematic use of log-linear maximum entropy models These refine the probability models using features like French word is “en”, English word is “in”, neighboring English word is a month name

5/28/0333 Practicalities Training for 2 million Hansard sentences takes 3600 processor hours on a farm of vintage IBM PowerPC processors. My guess is that we could do this in a week (8*2* hours

5/28/0334 Performance Rated on fluency and adequacy by independent raters (see Candide paper for details) Systran gets a score of.743 on adequacy, Candide gets.670 Candide gets a score of.580 on fluency, where Systran gets only.540 Both are significantly less good than Transman, which is human-in-the-loop system

5/28/0335 Prospects More elaborate translation models, with serious notions of phrase, word-sense, etc. are in the works. Nobody really knows what level of quality is attainable. Nobody really knows what level of quality is required for wider adoption. For specific tasks, MT is clearly useful

5/28/0336 Prospects Many other tasks can be coerced to look like translation Summarization (E -> Summary) Information retrieval (E -> Keywords) Title generation (E -> Title) Paraphrasing (E -> Paraphrase) There’s a large and growing literature using this idea.

5/28/0337 Where to read more M&S chapter 13 and references therein Kevin Knight Automating Knowledge Acquisition for Machine translation AI Magazine 18(4) Especially the overview paper on Candide and the longer paper on Maximum Entropy with Berger as first author

5/28/0338 Offset Alignment by Signal Processing Techniques I : Church, 1993 Church argues that length-based methods work well on clean text but may break down in real-world situations (noisy OCR or unknown markup conventions) Church’s method is to induce an alignment by using cognates (words that are similar across languages) at the level of character sequences. The method consists of building a dot-plot, i.e., the source and translated text are concatenated and then a square graph is made with this text on both axes. A dot is placed at (x,y) when there is a match. [Unit=4-grams].

5/28/0339 Offset Alignment by Signal Processing Techniques II: Church, 1993 (Cont’d) Signal processing methods are then used to compress the resulting plot. The interesting part in a dot-plot is called the bitext maps. These maps show the correspondence between the two languages. In the bitext maps, can be found faint, roughly straight diagonals. A heuristic search along this diagonal provides an alignment in terms of offsets in the two texts.

5/28/0340 Offset Alignment by Signal Processing Techniques III: Fung & McKeown, 1994 Fung and McKeown’s algorithm works: Without having found sentence boudaries. In only roughly parallel text (with certain sections missing in one language) With unrelated language pairs. The technique is to infer a small bilingual dictionary that will give points of alignment. For each word, a signal is produced, as an arrival vector of integer numbers giving the numver of words between each occurrence of the word at hand.