Presentation is loading. Please wait.

Presentation is loading. Please wait.

NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

Similar presentations

Presentation on theme: "NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based."— Presentation transcript:

1 NRC Report Conclusion Tu Zhaopeng 2009-09-08

2 NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based system:  Pre-process source text  Viterbi decoding using loglinear model  Nbest rescoring using fancier loglinear model  Post-process raw translation

3 NIST06  Pre-processing:  Convert to GB2312, removing traditional characters with no GB2312 representation  Segment using LDC segmenter  Translate numbers and dates using rules  Strip non-ASCII OOV’s

4 NIST06  Post-processing  Truecase using 4-gram HMM (via SRILM disambig) trained on parallel corpus  Detokenization heuristics

5 NIST06  Rescoring  Rescoring based on 5k-best lists, using Powell’s algorithm to find max-BLEU weights  Features (22)  All 12 decoder features  Character length  IBM2 scores in both directions  IBM1-based “missing word” feature (compare score of best translation for each word to best known)  Posterior probabilities calculated from nbest list for: sentence length, phrases, words, unigrams, and bigrams.

6 NIST06  Search Parameters

7 NIST08  Towards Tighter Integration of Rule-based and Statistical MT in Serial System Combination  Rule-based  Systran  Phrase-based  Portage

8 NIST08  Annotation of Systran output, five different chunk types:  named entities, numbers, dates  unknown words or unlikely sequences of short words  ‘strong’ rules : very reliable chunks, e.g., rules based on a long distance syntactic relationship, or a long multiword expression

9 NIST09  Serial system combination

10 NIST09  NRC system trained on SY/EN parallel corpus:  use SYSTRAN to translate ZH half of parallel ZH/EN training corpus, discarding UN, HKH/L corpora for eciency ! 3M sentence pairs  preprocess SY: strip markup, tokenize, lowercase  standard phrase-based training

11 NIST09  Two strategies that didn't work:  Exploit SY/EN surface similarity: boost HMM ttable scores of similar forms, prior to phrase extraction ! no improvement  Use SY case information: adopt SY case for aligned EN words|no improvement compared to baseline independent truecaser

12 NIST09  Common features:  phrase table based on symmetrized HMM word alignments (4 features: lex+rf, fwd+bkw)  5g mixture LM from parallel corpus (Foster & Kuhn, WMT07)  6g LM from GW  word count and distortion

13 NIST09

14  Useful  rescoring with IBM- and nbest-based features (Ueng and Ney, CL07; Chen et al, IWSLT05): +0.3 BLEU  greedy feature pruning for rescoring +0.3 BLEU  truecasing with \title trick": +0.3 BLEU

Download ppt "NRC Report Conclusion Tu Zhaopeng 2009-09-08. NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based."

Similar presentations

Ads by Google