THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Translation Model Parameters & Expectation Maximization Algorithm Lecture 2 (adapted from notes from Philipp Koehn & Mary Hearne) Dr. Declan Groves, CNGL,
Copyright © Cengage Learning. All rights reserved. 8 Introduction to Statistical Inferences.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial
DP-based Search Algorithms for Statistical Machine Translation My name: Mauricio Zuluaga Based on “Christoph Tillmann Presentation” and “ Word Reordering.
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Statistical Machine Translation. General Framework Given sentences S and T, assume there is a “translator oracle” that can calculate P(T|S), the probability.
Computer Science Department Jeff Johns Autonomous Learning Laboratory A Dynamic Mixture Model to Detect Student Motivation and Proficiency Beverly Woolf.
1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:
Machine Translation (II): Word-based SMT Ling 571 Fei Xia Week 10: 12/1/05-12/6/05.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Expectation Maximization Algorithm
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
Natural Language Processing Expectation Maximization.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Conditional & Joint Probability A brief digression back to joint probability: i.e. both events O and H occur Again, we can express joint probability in.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Multiplying Whole Numbers © Math As A Second Language All Rights Reserved next #5 Taking the Fear out of Math 9 × 9 81 Single Digit Multiplication.
Big Ideas Differentiation Frames with Icons. 1. Number Uses, Classification, and Representation- Numbers can be used for different purposes, and numbers.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Martin KayTranslation—Meaning1 Martin Kay Stanford University with thanks to Kevin Knight.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Dividing Mixed Numbers © Math As A Second Language All Rights Reserved next #7 Taking the Fear out of Math
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Multiplying Decimals © Math As A Second Language All Rights Reserved next #8 Taking the Fear out of Math 8.25 × 3.5.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Design and Implementation of Speech Recognition Systems Fall 2014 Ming Li Special topic: the Expectation-Maximization algorithm and GMM Sep Some.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
23.3 Information Extraction More complicated than an IR (Information Retrieval) system. Requires a limited notion of syntax and semantics.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Money and Banking Lecture 11. Review of the Previous Lecture Application of Present Value Concept Internal Rate of Return Bond Pricing Real Vs Nominal.
Alexander Fraser CIS, LMU München Machine Translation
Statistical Models for Automatic Speech Recognition
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Expectation-Maximization Algorithm
Word-based SMT Ling 580 Fei Xia Week 1: 1/3/06.
Machine Translation and MT tools: Giza++ and Moses
Statistical Machine Translation Part IIIb – Phrase-based Model
Machine Translation and MT tools: Giza++ and Moses
Pushpak Bhattacharyya CSE Dept., IIT Bombay 31st Jan, 2011
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Vocabulary Algebraic expression Equation Variable
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam

2 Sriraman Tallam August 18, 2015 The Problem  The problem of machine translation is discussed.  Five Statistical Models are proposed for the translation process. Algorithms for estimating their parameters are described.  For the learning process, pairs of sentences that are translations of one another are used.  Previous work shows statistical methods to be useful in achieving linguistically interesting goals. natural extension - matching up words within pairs of aligned sentences.  Results show the power of statistical methods in extracting linguistically interesting correlations.

3 Sriraman Tallam August 18, 2015 Statistical Translation  Warren Weaver first suggested the use of statisitical techniques for machine translation. [Weaver 1955]  Fundamental Equation for Machine Translation Pr(e|f) = Pr(e) Pr(f|e) Pr(f) ê = argmax Pr(e) Pr(f|e)

4 Sriraman Tallam August 18, 2015 Statistical Translation  A translator when writing a French sentence, even a native speaker, conceives an English sentence and then mentally translates it. Machine translation’s goal is to find that English sentence.  Equation summarizes the 3 computational challenges presented by statistical translation. Language Model Probability Estimation - Pr(e) Translational Model Probability Estimation - Pr(f|e) Search Problem - maximizing their product  Why not reverse the translation models ? Class Discussion !!

5 Sriraman Tallam August 18, 2015 Alignments  What is a translation ? Pair of strings that are translations of one another (Qu’ aurions-nous pu faire ? | What could we have done ?)  What is an alignment ?

6 Sriraman Tallam August 18, 2015 Alignments  The mapping in an alignment could be from one-one to many-many.  The alignment in the figure is expressed as (Le programme a ete mis en application | And the(1) program(2) has(3) been(4) implemented(5,6,7)).  This alignment though acceptable has a lower probability. (Le programme a ete mis en application | And(1,2,3,4,5,6,7) the program has been implemented).  A(e,f) is the set of alignments of (f|e) If e has length ‘l’ and f has length ‘m’, there are 2 lm alignments in all.

7 Sriraman Tallam August 18, 2015 Cepts  What is a cept ? To express the fact that each word is related to a concept, in a figurative sense, a sentence is a web of concepts woven together The cepts in the example are The, poor and don’t have any money There is the notion of an empty cept.

8 Sriraman Tallam August 18, 2015 Translation Models  Five Translation models have been developed.  Each model is a recipe for computing Pr(f|e), which is called the likelihood of the translation (f,e).  The likelihood is a function of many parameters (  !).  The idea is to guess values for these parameters and to apply the EM algorithm iteratively.

9 Sriraman Tallam August 18, 2015 Translation Models  Models 1 and 2. all possible lengths are equally possible In Model 1, all connections for each french position are equally likely. In Model 2, connection probabilities are more realistic These models lead to unsatisfactory alignments very often  Models 3,4 and 5. No assumptions on the length of the French string Models 3 and 4 make more realistic assumptions on the connection probabilities Models are a stepping stone for the training of Model 5 Start with Model 1 for initial estimates and pipe thru the models,

10 Sriraman Tallam August 18, 2015 Translation Models  The likelihood of f | e is, over all elements of A(e,f)  Then, choose the length of the French string given the English for each french word position, choose the alignment, given previous alignments and words choose the identity of the word at this position given our knowledge of the previous alignments and words.

11 Sriraman Tallam August 18, 2015 Model 1 Assumptions  We assume Pr(m|e) is independent of e and m All reasonable lengths of the French string are equally likely.  Also, depends only on l. All connections are equally likely, and for a word there are (l + 1) connections, so this quantity is equal to (l + 1) -1  is called the translation probability of f j given e a j

12 Sriraman Tallam August 18, 2015 Model 1  The joint likelihood function for Model 1 is, and for j = 1 … m, and a j from 1 … l  Therefore,  subject to,

13 Sriraman Tallam August 18, 2015 Model 1  Technique of Lagrange Multipliers,  EM algorithm is applied repeatedly.  = X = Y =  The expected number of times e connects to f is t (f | e) f, e, l set of a j

14 Sriraman Tallam August 18, 2015 Model 1

15 Sriraman Tallam August 18, 2015 Model 1 -> Model 2  Model 1 does not take into account where words appear in either string All connections are equally probable  In Model 2, alignment probabilities are introduced and, which satisfy the constraints,

16 Sriraman Tallam August 18, 2015 Model 2  The likelihood function now is, and the cost function is,

17 Sriraman Tallam August 18, 2015 Fertitlity and Tablet  Fertility of a english word is the number of French words it is connected to -  i  Each english word translates to a set of French words called the Tablet - T i  The collection of Tablets is the Tableau - T.  The final French string is a permutation of the words in the Tableau - 

18 Sriraman Tallam August 18, 2015 Joint Likelihood of a Tableau and Permutation  The joint likelihood of a Tableau and Permutation is,  and,

19 Sriraman Tallam August 18, 2015 Model 3 Assumptions  The fertility probability of an english word only depends on the word.  The translation probability is,  The distortion probability is,

20 Sriraman Tallam August 18, 2015 Model 3  The likelihood function for Model 3 is now,

21 Sriraman Tallam August 18, 2015 Deficiency of Model 3  The fertility of word i does not depend on the fertility of previous words. Does not always concentrate its probability on events of interest.  This deficiency is no serious problem.  It might decrease the probability of all well-formed strings by a constant factor.

22 Sriraman Tallam August 18, 2015 Model 4  Allowing Phrases in the English String to move and be translated as units in the French String Model 3 doesn’t account for this well, because of the word by word movement. where, A and B are functions of the French and English words.  Using this they account for facts that an adjective appears before a noun in English and reverse in Frernch. - THIS IS GOOD !

23 Sriraman Tallam August 18, 2015 Model 4  For example, implemented produces mis en application, all occuring together, whereas not produces ne pas which occurs with a word in between.  So, d >1 (2 | B(pas)) is relatively large when compared to d >1 (2 | B(en))  Models 3 and 4 are both deficient. Words can be placed before the first position or beyond the last position in the French string. Model 5 removes this deficiency.

24 Sriraman Tallam August 18, 2015 Model 5  They defineto be the number of vacancies up to and including position j just before forming the words of the i th cept.  And, this gives rise to the following distortion probability equation,  Model 5 is powerful but must be used in tandem with the other 4 models.

25 Sriraman Tallam August 18, 2015 Results

26 Sriraman Tallam August 18, 2015 Changing Viterbi Alignments with Iterations

27 Sriraman Tallam August 18, 2015 Key Points from Results  Words like nodding have a large fertility because they don’t slip gracefully into French.  Words like should do not have a fertility greater than one but they translate into many different possible words, their translation probability is spread more.  Words like the have zero fertlility some times since English prefers an article in some places where French does not.