Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation: Challenges and Approaches Based on a talk by Nizar Habash Associate Research.

Similar presentations


Presentation on theme: "Machine Translation: Challenges and Approaches Based on a talk by Nizar Habash Associate Research."— Presentation transcript:

1 Machine Translation: Challenges and Approaches Based on a talk by Nizar Habash Associate Research Scientist Center for Computational Learning Systems Columbia University, New York Natural Language Processing

2 Google offers translations between many languages: Afrikaans, Albanian, Arabic, Belarusian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Filipino, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Welsh, Yiddish

3 WorldMapper The world as youve never seen it before List of languages with maps showing world-wide distribution: Eg Arabic:

4 BBC found similar support!!!

5 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

6 Multilingual Challenges Complex Orthography (writing system) –Ambiguous spelling, eg Arabic vowels omitted كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries, eg Chinese Lexical Ambiguity –Bank بنك (financial) vs. ضفة(river) –Eat essen (human) vs. fressen (animal)

7 Multilingual Challenges Morphological complexity and variation Affixation vs. Root+Pattern write writtenكتب مكتوب kill killedقتل مقتول do doneفعل مفعول conj noun pluralarticle Tokenization: a word can be a whole phrase And the cars and the cars والسيارات w Al SyArAt Et les voitures et les voitures

8 لست هنا I-am-not here am Ihere I am not here not لست هنا Translation Divergences conflation Je ne suis pas ici I not am not here suis Jeicinepas

9 Translation Divergences head swap and categorial EnglishJohn swam across the river quickly SpanishJuan cruzó rapidamente el río nadando Gloss: John crossed fast the river swimming Arabicاسرع جون عبور النهر سباحة Gloss: sped john crossing the-river swimming Chinese Gloss: John quickly (DE) swam cross the (Quantifier) river RussianДжон быстро переплыл реку Gloss: John quickly cross-swam river

10 Corpus resources for training Need a Corpus (eg web-as-corpus: BootCat) and DICTIONARY, other NLP resources. BUT: really need PARALLEL corpus, with SOURCE and TARGET setentence ALIGNED. Some languages have few resources, esp non- European languages: Bengali, Amharic, … WorldMapperWorldMapper : many International languages

11 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

12 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting

13 MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

14 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer

15 MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y

16 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua

17 MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)

18 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer

19 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

20 MT Approaches Statistical vs. Rule-based Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration

21 Statistical MT Noisy Channel Model Portions from

22 Statistical MT Automatic Word Alignment GIZA++ –A statistical machine translation toolkit used to train word alignments. –Uses Expectation-Maximization with various constraints to bootstrap alignments Slide based on Kevin Knights Mary did not slap the green witch Maria no dio una bofetada a la bruja verde

23 Statistical MT IBM Model (Word-based Model)

24 Phrase-Based Statistical MT Foreign input segmented in to phrases –phrase is any sequence of words Each phrase is probabilistically translated into English –P(to the conference | zur Konferenz) –P(into the meeting | zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art! Morgenfliegeichnach Kanadazur Konferenz TomorrowIwill flyto the conferenceIn Canada Slide courtesy of Kevin Knight

25 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) Slide courtesy of Kevin Knight

26 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) Slide courtesy of Kevin Knight

27 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) Slide courtesy of Kevin Knight

28 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … Word Alignment Induced Phrases Slide courtesy of Kevin Knight

29 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch) Word Alignment Induced Phrases Slide courtesy of Kevin Knight

30 Advantages of Phrase-Based SMT Many-to-many mappings can handle non- compositional phrases Local context is very useful for disambiguating –Interest rate … –Interest in … The more data, the longer the learned phrases –Sometimes whole sentences Slide courtesy of Kevin Knight

31 Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Rule-based vs. Hybrid

32 MT Approaches Practical Considerations Resources Availability –Parsers and Generators Input/Output compatability –Translation Lexicons Word-based vs. Transfer/Interlingua –Parallel Corpora Domain of interest Bigger is better Time Availability –Statistical training, resource building

33 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

34 More art than science Wide range of Metrics/Techniques –interface, …, scalability, …, faithfulness,... space/time complexity, … etc. Automatic vs. Human-based –Dumb Machines vs. Slow Humans

35 Human-based Evaluation Example Accuracy Criteria

36 Human-based Evaluation Example Fluency Criteria

37 Automatic Evaluation Example Bleu Metric (Papineni et al 2001) Bleu –BiLingual Evaluation Understudy –Modified n-gram precision with length penalty –Quick, inexpensive and language independent –Correlates highly with human evaluation –Bias against synonyms and inflectional variations

38 Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric

39 Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric

40 Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a 1 a 2 …a n ) 1/n = ( ) ½ = Automatic Evaluation Example Bleu Metric

41 Summary Multilingual Challenges for MT: –Different writing systems, morphology, segmentation, word order; –Need parallel corpus, dictionaries, NLP tools MT Approaches: statistical v rule-based –Gisting: word-for-word; –Transfer: source-to-target phrase mapping; –Interlingua: map to/from semantic representation MT Evaluation: Accuracy, Fluency –IBM BLEU method: count overlaps with Reference (human) translations

42 Interested in MT?? Contact Research courses, projects Languages of interest: –English, Arabic, Hebrew, Chinese, Urdu, Spanish, Russian, …. Topics –Statistical, Hybrid MT Phrase-based MT with linguistic extensions Component improvements or full-system improvements –MT Evaluation –Multilingual computing


Download ppt "Machine Translation: Challenges and Approaches Based on a talk by Nizar Habash Associate Research."

Similar presentations


Ads by Google