Presentation is loading. Please wait.

Presentation is loading. Please wait.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Similar presentations


Presentation on theme: "English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011."— Presentation transcript:

1 English-Persian SMT Reza Saeedi Reza.saeedi@stu-mail.um.ac.ir 1 WTLAB Wednesday, May 25, 2011

2 Outline 2  MT Introduction  SMT Introduction  Requirements for SMT  Evaluation metrics  English-Persian MT challenges  English-Persian SMT  System1  System2  Problems in English-Persian SMT

3 MT Introduction 3  Automatic translation of text written in a natural language into another one by the use of computers is referred to as Machine Translation.  There are several way to do this work:  Dictionary-based  Rule-based  Example-based  Statistical approach

4 SMT Introduction 4  First ideas of Statistical machine translation was proposed by Warren Weaver in 1947.  Statistical machine translation tries to learn the translation by examining the translations made by humans.

5 SMT Introduction(Cont.) 5  Statistical MT models take the view that every sentence in the target language is a translation of the source language sentence with some probability.  The best translation, of course, is the sentence that has the highest probability.  The key problems in statistical MT are:  estimating the probability of a translation  and efficiently finding the sentence with the highest probability.

6 SMT Introduction(Cont.) 6  Given a Source sentence f, we seek the target sentence e that maximizes P(e | f). e‘ = argmax e P(e | f)  Intuitively, P(e|f) should depend on two factors:  P(e|f) = P(e) * P(f | e) / P(f)  argmax e P(e | f) = argmax e P(e) * P(f | e) fluency faithfulness

7 SMT Introduction(Cont.) 7  Philipp koehn  http://homepages.inf.ed.ac.uk/pkoehn http://homepages.inf.ed.ac.uk/pkoehn

8 Why SMT? 8  Better use of resources  Not need linguistic knowledge  It can use for any pair of language  But  We need a big training corpus

9 Steps of SMT 9

10 Requirements for SMT 10  Bilingual and Monolingual Corpus:  For bilingual need tow file aligned sentence by sentence (one file for source language and other for target language)  Microsoft Bi-Lingual sentence Aligner  Language Model:  We need a tool to compute P(e)  For this step we need to monolingual corpus  SRILM: a tool for create N-grams

11 LM output 11

12 Requirements for SMT 12  Translation Model:  We need a tool for compute P(f|e)  For this step we need to bilingual corpus  GIZA++  The output of this tool is a phrase table  Decode:  For search and find best translation  Moses

13 Phrase table 13

14 Moses tool 14

15 The training steps 15  Prepare data  Run GIZA++  Align words  Get lexical translation table  Extract phrases  Score phrases  Build reordering model  Build generation models  Create configuration file

16 Evaluation metrics 16  BLEU(BiLingual Evaluation Understudy)  Developed at IBM’s  The closer a MT is to a professional human translation, the better it is  NIST

17 English-Persian MT challenges 17  The Persian language structure is very different in comparison to English  The structure of Persian language is very complex  There has been little previous work done for this language pair  Effective SMT systems rely on very large bilingual corpora but there are not readily available for the English/Persian language pair

18 English-Persian SMT 18  There have been few English-Persian MT systems developed  Most of them are purely rule-based  There are two work on English-Persian SMT  Mohaghegh and Sarrafzadeh (Massey University)  Pilevar and Faili (Tehran University)

19 System1 19  Corpus: BBC news

20 System1(Cont.) 20  Tools: SRILM, GIZA++, Moses

21 System1: Improved Language Modeling 21

22 System2 22  Corpus:  Bidirectional(TEP): Subtitle of films, 3 books, KDE4

23 System2(Cont.) 23  Corpus:  Monolingual: Hamshahri, subtitle of films

24 System2(Cont.) 24  Tools: SRILM, GIZA++, Moses PersianSMT with 4-gram Sub-LM

25 Comparison PersianSMT with Google Translator 25

26 Problems in English-Persian SMT 26  compound verbs (aligning problem)  Use a phrase-based SMT system  But problem is inflectional morphology  Large number of inflected verb forms does not let the system learn to translate all the individual forms of a compound verb  Persian takes personal pronouns as an optional element in the sentence (aligning problem)

27 Problems(Cont.) 27  failure of the system to place the elements of the sentence in the right order  Use a phrase-based SMT system  Re-rank the n-best output list and/or reorder the output sentences  Prior to translation, the input sentence is reordered using morpho-syntactic information, so that the word order resembles better that of the target language.

28 28

29 References 29  [1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report, Department of Computer Science and Engineering Indian Institute of Technology, 2000.  [2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008.  [3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical machine translation”, New Zealand Postgraduate Conference, 2009.  [4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data variation in English-Persian Statistical Machine Translation”, 2009 International Conference on Innovations in Information Technology (IIT 2009)  [5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training data in English-Persian statistical machine translation “, Appear in Proceedings of the 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy, June 9-11, 2010.  [6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010

30 References(Cont.) 30  [7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian Statistical Machine Translation", to appear in Proc. of 10th International Conference on statistical analysis of textual data (JADT 2010)


Download ppt "English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011."

Similar presentations


Ads by Google