Presentation is loading. Please wait.

Presentation is loading. Please wait.

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Similar presentations


Presentation on theme: "Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali."— Presentation transcript:

1 Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali

2 Statistical Machine Translation MOSES (Koehn et al., 2007)‏  Started for European language pairs  Now being used for linguistically distant pairs English-Arabic English-Chinese  Surging interest in English-Hindi SMT Simple Syntactic and Morphological Processing  R. Ananthakrishnan et al., 2008 Global Lexical Selection and Sentence Reconstruction  S. Venkatapathy and S. Bangalore, 2006  Evaluated using BLEU score

3 Observed Problems with English-Hindi SMT Low BLEU scores recorded for small data sets  Linguistically distant languages  Morphological differences leading to data sparseness  Problem of unknown words  Reordering problems  Lack of huge language models  Quality of the reference translation  BLEU score not directly proportional to the quality of the translation

4 Our Approach Proposed and Tested Solutions  Morphological differences leading to data sparseness Use stemming and dictionary based techniques  Problem of unknown words NER techniques and transliteration  Reordering issues Prior reordering of English side using transfer rules  Lack of domain-specific huge language models Using huge monolingual corpus for generating Language Models  Remove erroneous phrases translations from the phrase table

5 Data Sets EILMT parallel corpus  Training Set : 7000 sentences  Development Set : 500 sentences  Test Set : 500 sentences IIIT-TIDES data set  Training Set : 50,000 sentence  Development Set : 1000 sentences  Test Set : 1000 sentences

6 Reordering: Transfer Grammar Transfer rules learned using  Dependency tree of English sentence Libin’s parser used for parsing the English side  POS tags from both the sides Example Rules  IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2 English side of the training and test corpus are reordered

7 Reordering: Learning rules Word-Alignment using GIZA++ For each node  Consider child nodes  Check relative positions of node and child nodes in Source  Check relative positions of projections of word and child nodes in Target  Combine this information to form transfer rule

8 Reordering: Example w2//NN w1//IN w3//VBZ...... t2....... t1.........t3........... Source node Alignment Target Sentence Learnt Rule: IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2

9 Reordering: Simple Syntactic Rules Movement of the verb in a sentence  30 % (~2100) sentences are compound and complex sentences  Based on POS tags  No deep syntactic information (like parsing)‏  Handcrafted rules Tag the corpus with Complex and Compound Sentence tags and, but, because, or mark the conjunctions Move the verbs on the both sides of the connective (e.g. and, but) to the end of the conjuncts

10 Handling Unknown words Dictionary  Extract the root from the word on English side  Generate TAM information  Translate the root into Hindi using dictionary  Map the TAM information to Hindi TAM  Generate the Hindi word using root and TAM information Transliteration  Transliterate the NE  Tool currently not available

11 Experiments with phrase table Observed that end-of-sentence markers such as (.) are aligned wrongly  Remove the phrase translations with (.)s aligned to words  Found 10,000 of them (~2.5% of the total phrase table)‏  Train and tune the system with EOS markers removed and stored elsewhere  Add the EOS markers after running the Decoder  Evaluate the system  Observed that the BLEU score increases  The quality of the translation improves

12 Effect of language models Experimented with 7K tourism corpus  No conclusion can be drawn  Hugely varying results  Cannot explain the variation in the results

13 Results - I Without Tuning System 18.09 Phrase Removal 18.35 Additional Language Model 7K Hindi corpus 17.75 Dictionary 17.70 Reordering with TG 17.70 Baseline

14 Results - II With Tuning System 22.60 Phrase Removal 20.68 Additional Language Model 7K Hindi corpus - Dictionary - Reordering with TG 20.18 Baseline

15 Conclusions & Future Work Incorporate syntactic phrase into SMT Long Distance Reordering Models are required Current systems handle local reordering Robust TG rules for reordering the source side Build huge language models

16 References P. Koehn and H. Hoang. 2007. Factored translation models. In Proc. of the 2007 Conference on Empirical Methods in Natural Language Processing (EMNLP/Co-NLL). P. Koehn, F.J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 48–54. Association for Computational Linguistics Morristown, NJ, USA. P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In ANNUAL MEETING- ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page 2. S. Kumar, W. Byrne, JOHNS HOPKINS UNIV BALTIMORE MD CENTER FOR LANGUAGE, and SPEECH PROCESSING (CLSP. 2004. Minimum Bayes-Risk Decoding for Statistical Machine Translation.

17 References Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51. F.J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 160–167. Association for Computational Linguistics Morristown, NJ, USA. K. Papineni, S. Roukos, T. Ward, and WJ Zhu. 2001. BLEU: a method for automatic evaluation of MT. Research Report, Computer Science RC22176 (W0109-022), IBM Research Division, TJ Watson Research Center, 17. L. Shen. 2006. STATISTICAL LTAG PARSING. Ph.D. thesis, University of Pennsylvania. A. Stolcke. 2002. Srilm-an extensible language modeling toolkit, international conference spoken language processing. SRI, Denver, Colorado, Tech. Rep. S. Venkatapathy and S. Bangalore. Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction. In SSST.


Download ppt "Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali."

Similar presentations


Ads by Google