Improving Word-Alignments for Machine Translation Using Phrase-Based Techniques Mike Rodgers Sarah Spikes Ilya Sherman.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Heuristic Search techniques
The Game of Algebra or The Other Side of Arithmetic The Game of Algebra or The Other Side of Arithmetic © 2007 Herbert I. Gross by Herbert I. Gross & Richard.
AI Pathfinding Representing the Search Space
Performance of Cache Memory
Backtrack Algorithm for Listing Spanning Trees R. C. Read and R. E. Tarjan (1975) Presented by Levit Vadim.
The Assembly Language Level
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Factor Analysis for Data Reduction. Introduction 1. Factor Analysis is a set of techniques used for understanding variables by grouping them into “factors”
Machine Learning and Data Mining Clustering
Chapter 6: Machine dependent Assembler Features
CPSC 322, Lecture 9Slide 1 Search: Advanced Topics Computer Science cpsc322, Lecture 9 (Textbook Chpt 3.6) January, 23, 2009.
1 Introduction to Computability Theory Lecture15: Reductions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Tutorial 12 Unconstrained optimization Conjugate gradients.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Review for Final Exam Systems of Equations.
THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam.
1 Advanced Smoothing, Evaluation of Language Models.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
(c) University of Washingtonhashing-1 CSC 143 Java Hashing Set Implementation via Hashing.
Chapter 2 How to Compile and Execute a Simple Program.
Chapter 6.7 Determinants. In this chapter all matrices are square; for example: 1x1 (what is a 1x1 matrix, in fact?), 2x2, 3x3 Our goal is to introduce.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Nachos Phase 1 Code -Hints and Comments
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
The Program Development Cycle
Demo. Overview Overall the project has two main goals: 1) Develop a method to use sensor data to determine behavior probability. 2) Use the behavior probability.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Complexity of algorithms Algorithms can be classified by the amount of time they need to complete compared to their input size. There is a wide variety:
Dynamic Programming. What is dynamic programming? Break problem into subproblems Work backwards Can use ‘recursion’ ‘Programming’ - a mathematical term.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Systems Life Cycle. Know why it is necessary to evaluate a new system Understand the need to evaluate in terms of ease-of- use, appropriateness and efficiency.
Program Development Cycle Modern software developers base many of their techniques on traditional approaches to mathematical problem solving. One such.
Copyright © Curt Hill Error detection and correction Techniques to Increase the Reliability.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Instructions. Portability In addition to making hardware backward compatible, we have also made software portable. In describing software, “portable”
Processor Architecture
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Problem Reduction So far we have considered search strategies for OR graph. In OR graph, several arcs indicate a variety of ways in which the original.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Lecture on Set! And Local CS 2135 Copyright Kathi Fisler, 2002 This material requires Advanced Language Level.
Abstracting.  An abstract is a concise and accurate representation of the contents of a document, in a style similar to that of the original document.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
MA/CSSE 473 Day 30 Optimal BSTs. MA/CSSE 473 Day 30 Student Questions Optimal Linked Lists Expected Lookup time in a Binary Tree Optimal Binary Tree (intro)
MA/CSSE 473 Days Optimal linked lists Optimal BSTs.
Chapter 3 Systems of Equations. Solving Systems of Linear Equations by Graphing.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Machine dependent Assembler Features
Hash table CSC317 We have elements with key and satellite data
Quantum One.
Expectation-Maximization Algorithm
COP4620 – Programming Language Translators Dr. Manuel E. Bermudez
CS 416 Artificial Intelligence
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Presentation transcript:

Improving Word-Alignments for Machine Translation Using Phrase-Based Techniques Mike Rodgers Sarah Spikes Ilya Sherman

IBM Model 2 - recap Alignments are word-to-word Factors considered: the words themselves position within source and target sentences Formally, probability that i th word of sentence S aligns with j th word of sentence T depends on: what S[i] and T[j] are f(i, j, length(S), length(T))

Introducing Phrases Groups of words tend to translate as a unit (i.e., a “phrase”). IBM Model 2 has no notion of this. We began with a working IBM Model 2 word aligner (from PA2) and looked at three ways to extend this model using the notion of phrases.

Technique 1: Nearby Neighbors Ideal: instead of measuring displacement relative to diagonal, measure displacement relative to the previous alignment. This is hard: to be efficient, EM assumes that all alignments are independent. Referring to “the previous alignment” has no meaning. We get around this by means of a weaker dependency. For the likelihood of aligning S[i] to T[j]: don’t ask if S[i-1] is aligned to T[i-1]. ask whether S[i-1] to T[j-1] would be a good alignment.

Technique 1: Nearby Neighbors Suppose we have P(S, T, i, j) that returns probability S[i] aligns with T[j]. Define P’(S, T, i, j) = 1 · P(S, T, i, j) + 2 · P(S, T, i - 1, j - 1) 1 = 0.95, 2 = 0.05 Use this distribution (in EM phase and in computing final results) Also tried a variety of similar models

Technique 1:Nearby Neighbors Results When added to Model 2, provided only slight improvement in quality of final results. Provided massive speedup in EM convergence pre-encoding information that would otherwise have to be learned When added to Model 1, provided notable improvement in quality of results model adds information, but most of that information already captured by Model 2

Technique 2: Beam Search The IBM models had a slightly different solution IBM Model 2 penalized alignments of S[i] to T[j] that had higher displacements d(S[i], T[j]) from the diagonal Since phrases tend to move together, each word in the phrase incurs the penalty So, IBM Model 4 instead penalizes alignments of S[i] to T[j] that have a high displacement relative to the alignment of S[i − 1] Thus, only the first word in each phrase is penalized.

Technique 2: Beam Search But, to know where the previous source word was aligned, we need to keep track of each partial alignment for the sentence We cannot afford to evaluate every possible alignment (exponential in the length of sentence) Instead, we can maintain a beam of the n best alignments for the previous word.

Technique 2: Beam Search To assess a penalty for aligning S[i] to T[j], we compute d’(S[i], T[j]) as the minimal displacement measured either absolutely from the diagonal, or Relative to one of the previous n best alignments. The two cases represent a new and an old phrase, respectively Formally, d’(S[i], T[j]) = min(d(S[i], T[j]), min(d(S[i − 1], T[j]) − d(S[i], T[k m ])), where T[k m ] is the m th best alignment for S[i − 1] 1 ≤ m ≤ n

Technique 2: Beam Search Results In practice, n = 2 worked best Gives just enough context without blurring distinctions between phrases Resulted in more than 20% improvement in AER Combined with nearby neighbors approach gives a massive speedup as well

Technique 3: Phrase Pre-chunking Another idea was to find common phrases in each language and store them as a set Take the sentences, and whenever we see our common phrases, just treat them as words using Model 2 Ideally, we would find phrases of any length, taking the most probable phrases over the sentence as our chunks

Technique 3: Phrase Pre-chunking Implementation issues We began by just using bigrams as our phrases, for simplicity However, we found that this did not work well with our pre-existing Model 2 code The function to get the best word alignment expects alignments based on the original sentences’ indices We need to pre-chunk the sentences to get any meaningful results based on our training This destroys the original indices, so we have to either store the old sentence or reconstruct the indices as we go

Technique 3: Phrase Pre-chunking Ideas for Improvement Expanding to N-grams Finding the best bigrams/N-grams rather than just the first one we see that is in our “good enough” set Once we had a bigram, we tried checking if the second word and the following made a “better” bigram, and if so, used that one instead This could potentially be improved upon with better techniques, though it would obviously be more complicated with longer N-grams

Summary Nearby neighbors approach: Massive speed-up Beam Search 20% AER improvement Combined neighbors and beam Both improvements were maintained (speed and AER) Phrase Pre-Chunking Good idea for further exploration