Imposing Constraints from the Source Tree on ITG Constraints for SMT Hirofumi Yamamoto, Hideo Okuma, Eiichiro Sumita National Institute of Information and Communications Technology ATR Spoken Language Communication Research Labs. Kindai University School of Science and Engineering Department of Information
Background In current SMT, erroneous word reordering is one of the most serious problems, especially for dis- similar language pair such as English-Chinese or English-Japanese. 1) To introduce linguistic syntax directly. Not robust to parsing error Tree-to-string String-to-tree Tree-to-tree
Background In current SMT, erroneous word reordering is one of the most serious problems, especially for not similar language pair such as English-Chinese or English-Japanese. 2) To assign probabilistic constraints for word reordering Weaker constraints than the first type To introduce syntax information to second type IBM distortion, Lexical reordering, ITG
ITG Constraints Translation source sentences are represented by binary tree. Translation target sentences can be generated by rotating branches of nodes of source tree. BADC dbca BADC acbd Above target word order cannot be generated from any source binary tree. Source binary tree instance is not considered.
Basic Idea of IST-ITG To use ITG constraints under the given source tree BADC BADC abcd, abdc, bacd, badc, cdad, cdba, dcab, dcba abcd, bacd, cabd, cbad, dabc, dbac, dcab, dcba In original ITG constraints, 22 combinations are allowed.
The Number of Word Order Combinations For binary source tree, word order combinations are allowed without constraints. Under the IST-ITG constraints, this number is reduced to. If Without constraints ITG constraints IST-ITG If Without constraints ITG constraints IST-ITG
Extension to Non-binary Tree Parsing results sometimes are not binary tree. For the nodes which have more than two branches, any word reorderings are allowed. BADC abcd, abdc, acbd, acdb, adbc, adcb, bcda, bdca, cbda, cdba, dbca, dcba
Extension to Non-binary Tree Parsing results sometimes are not binary tree. For the node which have more than two branches, any word reorderings are allowed. For non-binary tree, the number of combinations of IST-ITG can represented by. ( represents number of branches in -th node)
IST-ITG in Phrase-based SMT (1) × The unit of parsing tree is “word”, but the unit of phrase-based SMT is “phrase”. Units are different. Additional rules for phrase-based SMT 1) Word reordering that breaks a phrase is not allowed. 2) Phrase internal word reordering is not checked. ○ Word-to-word alignments are sometimes not one-to-one. But phrase-to-phrase alignments are always one-to-one
IST-ITG in Phrase-based SMT (2) EFG 23 A Ph BCD :NG 2:NG 3:OK 4:NG 5:OK (unacceptable)
IST-ITG in Phrase-based SMT (2) EFG 23 A Ph BCD :NG 2:NG 3:OK 4:NG 5:OK Ph
IST-ITG in Phrase-based SMT (2) EFG 23 A Ph BCD :NG 2:NG 3:OK 4:NG 5:OK Ph
IST-ITG in Phrase-based SMT (2) EFG 23 A Ph BCD :NG 2:NG 3:OK 4:NG 5:OK
IST-ITG in Phrase-based SMT (2) EFG 23 A Ph BCD :NG 2:NG 3:OK 4:NG 5:OK Ph
IST-ITG in Phrase-based SMT (2) EFG 23 A Ph BC D :NG 2:NG 3:OK 4:NG 5:OK
Decoding Algorithm with IST-ITG EFGA BCD :Untranslated 1 : Translated 2 : Translating d e HI 00 0
Decoding Algorithm with IST-ITG EFGA BCD NG 0 HI 00 0 If phrases A and B are translated, Sub-tree that includes more than two “2” NG d e a b
Decoding Algorithm with IST-ITG EFGA BCD HI 00 0 Consider minimum Translating sub-tree (sub-tree that includes both “0” and “1”.) d e
Decoding Algorithm with IST-ITG EFGA BCD HI 10 2 All of minimum Translating sub-tree are translated. OK d e f g h
Decoding Algorithm with IST-ITG EFGA BCD HI 00 0 Translate sub-part of minimum Translating sub-tree. OK d e g
English and Japanese Patent Corpus Experiments # of sent. Total Words # of entry E/J Train E/J Dev E/J Eval Experimental corpus size 1.8M M/64M 30K/32K 29K/32K 188K/118K 4,072/3,646 3,967/3,682 Single reference
Other Experimental Conditions LM training: SRI Language model toolkit (5-grams) Word alignment for TM training: GIZA++ Decoder: Moses compatible in-house decoder named CleopATRa Evaluation measures BLEU,NIST,WER,PER
English and Japanese Patent Translation Experimental Results IBM+Lex IBM+Lex+IST BLEUNISTWERPER English-to-Japanese IST-ITG Monotone No Constraint IBM
English and Japanese Patent Translation Experimental Results IBM+Lex IBM+Lex+IST BLEUNISTWERPER English-to-Japanese IST-ITG Monotone No Constraint IBM
English and Japanese Patent Translation Experimental Results IBM+Lex +IST-ITG BLEUNISTWERPER Japanese-to-English
English and Japanese Patent Translation Experimental Results IBM+Lex +IST-ITG BLEUNISTWERPER Japanese-to-English
Chinese-to-English Translation Experiments NIST MT08 English-to-Chinese track IBM+Lex +IST-ITG W-BleuC-BleuWERCER Experimental Results Training data for TM Training data for LM Development data Evaluation data 6.2M 20.1M 1,664 1,859 1 reference 4 reference
Conclusion We proposed new word reordering constrains IST-ITG using source tree structure. It is extension of ITG constraints. We conducted three experiments of proposed method: E-J and J-E patent translation and NIST MT08 E-C track. In all experiments, improvements of BLEU and WER are confirmed. Especially, improvement for WER is very large, and effectiveness for global word reordering is confirmed.
Thank you!