Presentation is loading. Please wait.

Presentation is loading. Please wait.

Raivis SKADIŅŠ a,b, Kārlis GOBA a and Valters ŠICS a a Tilde SIA, Latvia b University of Latvia, Latvia Machine Translation and Morphologically-Rich Languages.

Similar presentations


Presentation on theme: "Raivis SKADIŅŠ a,b, Kārlis GOBA a and Valters ŠICS a a Tilde SIA, Latvia b University of Latvia, Latvia Machine Translation and Morphologically-Rich Languages."— Presentation transcript:

1 Raivis SKADIŅŠ a,b, Kārlis GOBA a and Valters ŠICS a a Tilde SIA, Latvia b University of Latvia, Latvia Machine Translation and Morphologically-Rich Languages University of Haifa, Israel, 23-27 January, 2011

2 The exotic Baltic languages Problems we all know about What Philipp & others said Agreement, reordering Sparseness Alignment and evaluation Morphology integration (Not) using factor models English-Latvian Lithuanian-English Interlude: Human evaluation

3 PIELatvianLithuanian Nom.pod-spēd-apėd-a Voc.podpēd-apėd-a Acc.pod-m̥pēd-upėd-ą Instr.ped-ehpēd-upėd-a Dat.ped-eypēd-aipėd-ai Abl.ped-es Gen.ped-espēd-aspėd-os Loc.ped-(i)pēd-āpėd-oje >2000 morphosyntactic tags, >1000 observed Derivation vs inflection Similar problems as with Czech, German, etc. Case/gender/number/definiteness Quite elaborated participle system Feature redundancy

4 Latv(ij|i|j)a|Latfia|Let(land|[oó]n ia)|Lett(land|ország|oni[ae])|Lot yšk[áo]|Łotwa|Läti |An Laitvia Fancy letters: ā ē ī ū š ž č ķ ģ ņ ļ Lietuva|Liettua|Lith([uw]ania|áe n)|Lit(uani[ea]|[vw]a|vánia|vanij a|(au|ouw)en)|An Liotuáin Fancy letters: ė ę ą ų ū

5 Quality BLEU useful for development, but not for the end user Efficient human evaluation Hardware requirements Decoding speed Memory Not just bare translation Correct casing Domain-specific translation User feedback

6 English: dependency parser Latvian: morphological analysis, disambiguated Add lemmas and morphological tags as factors pēdas pēda | Nfsn Tags also mark agreement for prepositions Alternatively, Split words into stems and suffixes pēdas pēd | as More ambiguous -u Amsa, Ampg, Afsa, Afpg,...

7

8 Tuning and evaluation data 1000 sentence tuning set + 512 sentence eval set Domain/topic mixture Available in EN, LV, LT, ET, RO, SL, HR, DE, RU ACCURAT project

9

10 What Philipp said... 5-gram LM over surface forms 7-gram LM over morphology tags / suffixes

11

12 Comparable corpora Heavily filtered Alignment score Length Alphanumeric Language detection 22.5M parallel units 0.9M left

13 Domain Our SMTGoogleDelta General information about European Union41.6% 0.0% Fiction32.4%22.5%9.9% Letters11.1%12.5%-1.4% News and magazines10.0%8.9%1.1% Popular science and education23.0%40.0%-16.9% Official and legal documents52.0%46.8%5.3% Information technology63.9%47.5%16.5% Specifications, instructions and manuals31.6%29.8%1.8%

14 We got slightly better BLEU scores, but is it really getting better? Looking for practical methods simple, reliable, relatively cheap

15 Ranking of translated sentences relative to each other Only 2 systems The same 500 sentences as for BLEU Web based evaluation system Upload source text and 2 target outputs Send URL to many evaluators See results

16

17

18 We calculate Probability that system A is better than B: Confidence interval: we calculate it Based on all evaluations Based on a sentence level

19

20

21 No morphology tagger for Lithuanian Split Stem and an optional suffix Mark the suffix vienas vien #as Suffixes correspond to endings, but are ambiguous One would expect #ai to/for Prefixes – verb negation

22

23 Automatic evaluation Human evaluation

24 Translating to highly inflected language Some success in predicting the right inflections by a LM Things to try: Two-step approach Marking the relevant source features Translating from highly inflected language Slight decrease in BLEU Decrease in OOV rate Human evaluation suggests users prefer lower OOV rate Things to try: Removing the irrelevant features


Download ppt "Raivis SKADIŅŠ a,b, Kārlis GOBA a and Valters ŠICS a a Tilde SIA, Latvia b University of Latvia, Latvia Machine Translation and Morphologically-Rich Languages."

Similar presentations


Ads by Google