Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151.

Similar presentations

Presentation on theme: "Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151."— Presentation transcript:

1 Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151

2 Agenda Introduction - what does my project title mean? Language pair English-Finnish challenges Related works Project direction 4/30/20152

3 Introduction I: phrase-based SMT Statistical: derive statistical information from large data Phrase-base: capture local constraints 4/30/20153 Marianodabaunabotefadaalabrujaverde 123456789 01234567 NULLMarydidnotslapthegreenwitch Source Target

4 Introduction II - Morphology Morpheme: minimal meaning-bearing unit – machines = machine + s – translation = translate + ion – goalkeeper = goal + keeper English is a low-inflected language - simple morphological structure  High-inflected languages are much complicated! 4/30/20154

5 Introduction III – high-inflected languages Concatenate chain of morphemes to form a word Finnish: oppositio + kansa + n + edusta + ja (opposition + people + of + represent + -ative) = opposition of parliarment member Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina (uygar+las, tir+ama+dik+lar+imiz+dan+mis, siniz+casina) = (behaving) as if you are among those whom we could not cause to become civilized 4/30/20155 This is a word!!!

6 Introduction IV – Why morphological-aware SMT? Tackle the data sparseness problem (Statistics from 1.021.180 sentence pairs) Capture the relations among words 4/30/20156 English machine machines Spanish máquina máquinas Type countToken count English105.144121.442.173 Finnish516.102130.128.883

7 Language pair I – our choice? We chose English - Finnish as our main translation task 4/30/20157 Low-inflectedhighly-inflected (Dyer, 2007) Vietnamese

8 Language pair II – why Finnish? Honestly, I don’t know Finnish … But because: – Available corpora – Finnish is an agglutinative morphologically-complex language, suitable for our project scope – Investigate in translation from low to high inflected languages -> an area to explore, yet hard !!! 4/30/20158

9 English-Finnish challenges I – many-to-one word relationship Finnish uses suffixes to express grammatical relations and also to derive new words 4/30/20159 CaseSuffixEnglish prep. Sample word form Translation of the sample nominatiivi -talohouse genetiivi-noftalonof (a) house essiivi-naastalonaas a house inessiivi-ssaintalossain (a) house elatiivi-stafrom (inside)talostafrom (a) house komitatiivi-ne-together (with)taloineniwith my house(s) Many-to-one English-Finnish word relationship  need word-morpheme correspondence (about 14-15 cases for nouns) Not merely concatenating

10 English-Finnish challenges II – word order Word order is “free” in Finnish – Pete rakastaa Annaa = Pete loves Annaa (normal) – Annaa Pete rakastaa: emphasizes Annaa – Rakastaa Pete Annaa: emphasizes rakastaa = Pete does love Anna – Pete Annaa rakastaa: stress on Pete – Rakastaa Annaa Pete. not sound like a normal sentence, quite understandable. 4/30/201510

11 English-Finnish challenges III – surface form generation After translating from English words  Finnish morphemes, need a surface generation step oppositio + kansa + n + edusta + ja  oppositiokansanedustaja What if missing morphemes or changes in morpheme order?  Need a more error-tolerate surface recovery algorithm 4/30/201511

12 Related works I – low-to-high inflected languages Many works from high to low inflected languages, but very few works on the opposite direction, considered hard in (Koehn, 2005) – (Yang & Kirchhoff, 2006): Finnish-English, backoff – (Oflazer & Durgar El-Kahlout, 2006, 2007): English- Turkish, word-morpheme translation, then simply concatenating morphemes All use language-dependent tools & syntactic knowledge: TreeTager, Snowball stemmer … 4/30/201512

13 Related works II – surface form recovery (Toutanova et. al., 2007, 2008): English-Russian, English-Arabic; translate stem-to-stem; predict inflection from stems using many different features (lexical, morphological, and syntactic) (Avramidis & Koehn, 2008): English-Greek Use syntax to get the “missing” morphology, depending on the syntactic position Noun cases agreement and verb person conjugation  Rely mostly on manual annotation data 4/30/201513

14 Project direction Use language-independent tool (Morfessor), and based on the unannotated data only (i.e. no feature data or syntactical information) Work on a general surface-form recovery We would like to have a unified view of the transalation process: separating low-low, low- high, high-low, high-high 4/30/201514 We are at here

15 Reference I Chirs Dyer, 2007 Jurafsky, D., & Martin, J. H. (2007). Speech and language processing book The Finnish language Yang & Kirchhoff, 2006: Phrase-based backoff models for machine translation of highly inflected languages Oflazer & Durgar El-Kahlout, 2006: Initial Explorations in English to Turkish Statistical Machine Translation 4/30/201515

16 Reference II Oflazer & Durgar El-Kahlout, 2007: Exploring different representational units in English-to-Turkish statistical machine translation Toutanova et. al., 2007: Generating complex morphology for machine translation Toutanova et. al., 2008: Applying morphology generation models to machine translation Avramidis & Koehn, 2008: Enriching morphologically poor languages for statistical machine translation 4/30/201516

17 Q & A? 4/30/201517

18 To be continued … Thank you !!! 4/30/201518

Download ppt "Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151."

Similar presentations

Ads by Google