Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8
Ethan Phelps-Goodman

Outline “Tricky” Syntactic Features Reranking with Perceptrons
Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.

“Tricky” Syntactic Features
Constituent Alignment Features Markov Assumption Bi-gram model over elementary trees

Constituent Alignment Features
Hypothesis: target sentences should be rewarded for having similar syntactic structure to source sentence Experiments: Tree to String Penalty Tree to Tree Penalty Constituent Label Probability Uses: Parse trees in both languages Word Alignments Is that hypothesis really true?

Tree to String Penalty Penalize target words that cross source constituents

Tree to Tree Penalty Penalize target constituents that don’t align to source constituents

Constituent Label Probability
Learns Pr( target node label | source node label ) e.g.: Pr( target = VP | source = NP ) = Align tree nodes by finding minimum common ancestor of aligned leaves Training: ML counts from training data Also: Pr( target label, target leaf count | source label, source leaf count )

Results Possible problems: noisy parses noisy alignments
not stat. significant Possible problems: noisy parses noisy alignments insensitivity of BLEU

Markov Assumption for Tree Models
Tree-based translation models from Chapter 4: too slow—they only had time to parse 300 out of 1000-best limited reordering among higher levels of tree Solution: Split trees into independent tree fragments Lesson: Difficult to get complicated features to work when there is so much noise in system.

Markov Example

TAG Elementary Trees Break into tree fragments by head word
Build n-gram model on tree fragments Unigram model: Bi-gram model: where ei and fi are source and target word, tei and tfi are source and target tree Give intuition: simple bi-gram model, but basic unit is syntactically motivated

Finding TAG Elementary Trees
Heuristically assign head nodes

Finding TAG Elementary Trees
Split at head words

Results Why does performance improve w/ independence assumption? Just because coverage increases?

Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.

Why rerank? Successful in POS tagging and parsing
Easy incorporation of new features Decrease decoding complexity But the entire report is about reranking. Why would perceptron be better than MER on Log-linear model?

Reranking with a linear classifier
Log-linear and Perceptron both give linear reranking rule: Difference is in training: Log-linear MER: optimize BLEU of 1-best Perceptron: attempt to separate “good” from “bad” in 1,000-best note: lambda_m trained with MER on dev set

Training Data For a single sentence in dev set:
sort 1,000 best translations by BLEU score count top 1/3 of 1,000-best as good count bottom 1/3 as bad Finds features that on average lead to translations with higher BLEU scores This gives us our training data

Separating in feature space
For a single sentence:

Separating in feature space
For many sentences: Separator could be in different place but restrict to same direction for all sentences

Reranking Distance from hyperplane is score of translation
Perceptron is trying to predict BLEU score

Features Baseline: 12 features used in Och’s baseline
Full: Baseline + all features from workshop POS sequence: a 0/1 feature for every possible POS sequence Parse trees: a 0/1 feature for every possible subtree last 2 need explaining

Results Log-linear reranking: Perceptron reranking: baseline 31.6
workshop features Perceptron reranking: baseline workshop features POS sequence Parse tree

Analysis Dev set was too small to optimize for POS and tree features
Using only POS sequence gives results as good as much more complicated baseline

Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.

An Example Say our 4-best list is: MAP: choose single most likely
“Ball at the from Bob” .31 “Bob hit the ball” “Bob struck the ball” .23 “Bob hit a ball” MAP: choose single most likely MBR: weight by similarity w/ other hypotheses If they only get one slide from this section, this should be it.

MBR, formally where f = source sentence e, e’ = target sentence
a, a’ = word alignment T, T’, T(f) = parse trees L((e’,a’,T’),(e,a,T);f,T(f)) = similarity between translations e and e’

Making MBR tractable Restrict e to 1,000-best list
Use translation model score for P(e,a | f )

Loss functions: translation similarity
“Tier 1.” String based: L(e, e’) = BLEU score between e and e’ “Tier 2.” Syntax based: L((e, T),(e, T’) = tree kernel between T and T’ “Tier 3.” Alignment based: L((e’,a’,T’),(e,a,T);f,T(f)) = number of source to target node alignments that are the same in T and T’

Results Using loss based on a particular measurement leads to better performance on that measurement Blue score goes up by 0.3

Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.

What Works Baseline 31.6 Model 1 (on 250M words) 32.5
All features from workshop Human upper bound Simple Model 1 is only feature to give statistically significant improvement in isolation Complex features helpful in aggregate

Their wish list Better evaluation metrics Better parameter tuning
for individual sentences specific issues, such as missing content words Better parameter tuning larger dev set Take divergence into account in syntax Better quality parses confidence measure in parse

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Similar presentations

Presentation on theme: "Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Similar presentations

Presentation on theme: "Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman."— Presentation transcript:

Similar presentations

About project

Feedback