A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009
把 7 月 11 日 設立 為 航海 節 Introduction Machine Translation –Chinese to English –Chinese 把 7 月 11 日 設立 為 航海 節 An ideal case: to establish July 11 as Sailing Festival day
Wrong Linguistic Structure 航海 節 is a syntactic constituent 把 7 月 11 日 設立 為 航海 節 to set up for navigation on July 11 knots
A Naive Solution Employ syntactic constraints –Fully respect linguistic structures
把 今天 設立 為航海 節 A Naive Solution (2) Unfortunately, it damages the performance –Non-syntactic translations are sometimes useful Sailing Festival dayestablish today as
Syntax-Driven Bracketing Model SDB model Translation unit is more important –Whether it is syntactic or non-syntactic Include but not limited to constituent matching/violation Protect the strength of the phrase-based system
Translation Unit Bracketable source phrase and its corresponding translation Bracketable –A source phrase is bracketable Its translation is contiguous –A pair of neighboring phrases is bracketable Their translations are contiguous after combined
establish today as Translation Unit Examples Bracketable 把 今天 設立 為 establish today as 把 今天 設立為 把 今天 設立 and 為 are bracketable 把 今天 設立 為 is bracketable
把 今天 設立 為 establish today as Translation Unit Examples Unbracketable 設立 and 為 are unbracketable 設立 為 is unbracketable
Bracketing Instances Extraction Extract bracketable and unbracketable instances from training data –Aligned sentence pair + parsed source sentence Estimate whether a source phrase is bracketable at run time
SDB Features
Rule Features Rule Features (RF) –CFG rule –Horizontal context
Rule Features (2) S 1 : ADVP AD S 2 : VP VV AS NP S: VP ADVP VP
Path Features Path features (PF) –Path to roots S1 to the root of S S2 to the root of S S to the root of this tree –Vertical context
Path Features (2) S 1 : ADVP VP S 2 : VP VP S: VP IP
Constituent Boundary Matching Features Constituent Boundary Matching Features (CBMF) –Exact match Source phrase covers the boundaries of its tree –Inside match Source phrase covers a sequence of its tree –Crossing match Source phrase crosses the subtree of its tree
Constituent Boundary Matching Features (3) Exact match Inside match Crossing match
Integration into Phrase-based MT SDB model estimate the probability that a source phrase is bracketable. –Whether it can be translated as a unit Integrated into BTG MT system –Bracketing Transduction Grammar (Wu, 1997) establish today as 把 今天 設立為 as establish today 把 今天 設立為 Straight Inverted
Experiment Comparing models –Baseline: BTG system –XP+ (Marton and Resnik, 2008) NP, VP, PP, ADVP…. Penalize each time when violating the syntactic boundaries. (soft constraint) –UniSDB Only S features –BiSDB S 1, S 2 and S features
Experiment (2) Chinese parser –Lexicalized PCFG parser (Xiong et al., 2005) Parallel corpus –FBIS corpus Word alignment –GIZA++ Four-gram language model –Built with SRILM –Xinhua section of the the English Gigaword corpus Maximum Entropy (ME) Trainer –Zhang 2004
Result SDB receives the largest feature weight –Imply its impact on decoder. Baseline features (Common for phrase-based systems) XP+ and SDB
Result (2) NIST MT-05 test set –Improvement of 1.67 BLEU over baseline –Improvement of 0.59 BLEU over XP+
Result (3) Based on CBMF, adding rule and path feature achieves further improvement BiSDB is constantly better than UniSDB –Inner contexts (S1 and S2) are useful
XP+ and SDB Same –Consider syntactic constituent Different –XP+ only punishes non-syntactic source phrase –SDB is able to encourage non-syntactic if the phrase is bracketable
XP+ and SDB
Conclusion SDM model predict whether a source phrase can be translated as a unit. Appropriate constituent violations are helpful –Because it better inherit the strength of phrase-based approach