Presentation is loading. Please wait.

Presentation is loading. Please wait.

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Similar presentations


Presentation on theme: "Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman."— Presentation transcript:

1 Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8
Ethan Phelps-Goodman

2 Outline “Tricky” Syntactic Features Reranking with Perceptrons
Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.

3 “Tricky” Syntactic Features
Constituent Alignment Features Markov Assumption Bi-gram model over elementary trees

4 Constituent Alignment Features
Hypothesis: target sentences should be rewarded for having similar syntactic structure to source sentence Experiments: Tree to String Penalty Tree to Tree Penalty Constituent Label Probability Uses: Parse trees in both languages Word Alignments Is that hypothesis really true?

5 Tree to String Penalty Penalize target words that cross source constituents

6 Tree to Tree Penalty Penalize target constituents that don’t align to source constituents

7 Constituent Label Probability
Learns Pr( target node label | source node label ) e.g.: Pr( target = VP | source = NP ) = Align tree nodes by finding minimum common ancestor of aligned leaves Training: ML counts from training data Also: Pr( target label, target leaf count | source label, source leaf count )

8 Results Possible problems: noisy parses noisy alignments
not stat. significant Possible problems: noisy parses noisy alignments insensitivity of BLEU

9 Markov Assumption for Tree Models
Tree-based translation models from Chapter 4: too slow—they only had time to parse 300 out of 1000-best limited reordering among higher levels of tree Solution: Split trees into independent tree fragments Lesson: Difficult to get complicated features to work when there is so much noise in system.

10 Markov Example

11 TAG Elementary Trees Break into tree fragments by head word
Build n-gram model on tree fragments Unigram model: Bi-gram model: where ei and fi are source and target word, tei and tfi are source and target tree Give intuition: simple bi-gram model, but basic unit is syntactically motivated

12 Finding TAG Elementary Trees
Heuristically assign head nodes

13 Finding TAG Elementary Trees
Split at head words

14 Results Why does performance improve w/ independence assumption? Just because coverage increases?

15 Outline “Tricky” Syntactic Features Reranking with Perceptrons
Reranking with Minimum Bayes Risk Conclusion In the interest of time I’m skipping some features. If there are one you’d like to talk about, ask me.

16 Why rerank? Successful in POS tagging and parsing
Easy incorporation of new features Decrease decoding complexity But the entire report is about reranking. Why would perceptron be better than MER on Log-linear model?

17 Reranking with a linear classifier
Log-linear and Perceptron both give linear reranking rule: Difference is in training: Log-linear MER: optimize BLEU of 1-best Perceptron: attempt to separate “good” from “bad” in 1,000-best note: lambda_m trained with MER on dev set

18 Training Data For a single sentence in dev set:
sort 1,000 best translations by BLEU score count top 1/3 of 1,000-best as good count bottom 1/3 as bad Finds features that on average lead to translations with higher BLEU scores This gives us our training data

19 Separating in feature space
For a single sentence:

20 Separating in feature space
For many sentences: Separator could be in different place but restrict to same direction for all sentences

21 Reranking Distance from hyperplane is score of translation
Perceptron is trying to predict BLEU score

22 Features Baseline: 12 features used in Och’s baseline
Full: Baseline + all features from workshop POS sequence: a 0/1 feature for every possible POS sequence Parse trees: a 0/1 feature for every possible subtree last 2 need explaining

23 Results Log-linear reranking: Perceptron reranking: baseline 31.6
workshop features Perceptron reranking: baseline workshop features POS sequence Parse tree

24 Analysis Dev set was too small to optimize for POS and tree features
Using only POS sequence gives results as good as much more complicated baseline

25 Outline “Tricky” Syntactic Features Reranking with Perceptrons
Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.

26 An Example Say our 4-best list is: MAP: choose single most likely
“Ball at the from Bob” .31 “Bob hit the ball” “Bob struck the ball” .23 “Bob hit a ball” MAP: choose single most likely MBR: weight by similarity w/ other hypotheses If they only get one slide from this section, this should be it.

27 MBR, formally where f = source sentence e, e’ = target sentence
a, a’ = word alignment T, T’, T(f) = parse trees L((e’,a’,T’),(e,a,T);f,T(f)) = similarity between translations e and e’

28 Making MBR tractable Restrict e to 1,000-best list
Use translation model score for P(e,a | f )

29 Loss functions: translation similarity
“Tier 1.” String based: L(e, e’) = BLEU score between e and e’ “Tier 2.” Syntax based: L((e, T),(e, T’) = tree kernel between T and T’ “Tier 3.” Alignment based: L((e’,a’,T’),(e,a,T);f,T(f)) = number of source to target node alignments that are the same in T and T’

30 Results Using loss based on a particular measurement leads to better performance on that measurement Blue score goes up by 0.3

31 Outline “Tricky” Syntactic Features Reranking with Perceptrons
Reranking with Minimum Bayes Risk Conclusion This one will take a bit of background in machine learning theory.

32 What Works Baseline 31.6 Model 1 (on 250M words) 32.5
All features from workshop Human upper bound Simple Model 1 is only feature to give statistically significant improvement in isolation Complex features helpful in aggregate

33 Their wish list Better evaluation metrics Better parameter tuning
for individual sentences specific issues, such as missing content words Better parameter tuning larger dev set Take divergence into account in syntax Better quality parses confidence measure in parse


Download ppt "Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman."

Similar presentations


Ads by Google