Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Machine Translation: Challenges and Approaches
Yemelia International Language Services Translations Translations Translations Interpreting InterpretingInterpreting Multi-lingual IT Presentations Multi-lingual.
Adaptxt® Enhanced Keyboards for Smartphones and Tablets: CUSTOM-MADE FOR OEM SUCCESS KeyPoint Technologies February 25, 2013.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
 They speak German  8.47 million of people live there.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
< Translator Team > 25+ Languages, …and growing!.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
English Language Proficiency 2011 Census Analysis Tristan Browne.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
INTERNATIONAL MARKETING MANAGEMENT SESSION 7: CUSTOMER BEHAVIOR AND MARKET SEGMENTATION 1.
Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University Invited.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Tapta4IPC: helping translation of IPC definitions Bruno Pouliquen 25 feb 2013, IPC workshop Translation.
Collecting Primary Language Information LINKED-DISC - provincial database system for early childhood intervention Services Herb Chan.
In the knowledge society of the 21st century, language competence and inter-cultural understanding are not optional extras, they are an essential part.
The Influence of First Language on Reading and Spelling in English Linda Siegel University of British Columbia Vancouver, CANADA
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant.
Although there are about 225 indigenous languages in Europe – they are still only 3% of the world’s total.
Democratizing health sciences knowledge Erica Frank, MD, MPH Founder and Executive Director, Health Sciences Online Professor and Canada Research Chair.
Defence School of Languages, UK BILC NATO Conference Prague 2012.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
Machine Translation: Challenges and Approaches Nizar Habash Associate Research Scientist Center for Computational Learning Systems Columbia University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
LanguagesLanguages. What is language? A human system of communication that uses arbitrary signals such as voice sounds, gestures, or written symbols.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
Languages of Europe Romance, Germanic, and Slavic.
Advanced Directives: What to Assess with Seniors
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
Sales Presenter Available now
CSS 590 C: Introduction to NLP
CSCI 5832 Natural Language Processing
A Latin corpus for Sketch Engine
Approaches to Machine Translation
Introduction to Machine Translation
Definition of Health WHO approved translation
Part of Speech Tagging with Neural Architecture Search
Sales Presenter Available now Standard v Slim
Presentation transcript:

Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010

Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 95% of world population speaks 100 languages Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 95% of world population speaks 100 languages Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

Machine Translation Science Fiction Star Trek Universal Translator an "extremely sophisticated computer program" which functions by "analyzing the patterns" of an unknown foreign language, starting from a speech sample of two or more speakers in conversation. The more extensive the conversational sample, the more accurate and reliable is the "translation matrix"….

Machine Translation Reality

Machine Translation Reality

Currently, Google offers translations between the following languages  over 3,000 pairs Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician Georgian German Greek Haitian Creole Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Korean Latvian Lithuanian Macedonian Malay Maltese Norwegian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Thai Turkish Ukrainian Urdu Vietnamese Welsh Yiddish

“BBC found similar support”!!!

Why Machine Translation? Full Translation –Domain specific, e.g., Weather reports Machine-aided Translation –Requires post-editing Cross-lingual NLP applications –Cross-language IR –Cross-language Summarization Testing grounds –Extrinsic evaluation of NLP tools, e.g., parsers, pos taggers, tokenizers, etc.

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

Multilingual Challenges Orthographic Variations –Ambiguous spelling كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries Lexical Ambiguity –Bank  بنك (financial) vs. ضفة (river) –Eat  essen (human) vs. fressen (animal)

Multilingual Challenges Morphological Variations Affixational (prefix/suffix) vs. Templatic (Root+Pattern) write  written كتب  مكتوب kill  killed قتل  مقتول do  done فعل  مفعول conj noun pluralarticle Tokenization (aka segmentation+normalization) And the cars  and the cars والسيارات  w Al SyArAt Et les voitures  et le voitures

Multilingual Challenges Syntactic Variations يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book ArabicEnglishChinese Subj-VerbV SubjSubj V Subj … V Verb-PPV…PP V PPPP V AdjectivesN AdjAdj NAdj de N PossessivesN PossN of PossPoss ’s NPoss de N RelativesN Rel Rel de N

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting

MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer

MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure  poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua

MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

MT Approaches Statistical vs. Rule-based Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration

Statistical MT Noisy Channel Model Portions from

Statistical MT Automatic Word Alignment GIZA++ –A statistical machine translation toolkit used to train word alignments. –Uses Expectation-Maximization with various constraints to bootstrap alignments Slide based on Kevin Knight’s Mary did not slap the green witch Maria no dio una bofetada a la bruja verde

Statistical MT IBM Model (Word-based Model)

Phrase-Based Statistical MT Foreign input segmented in to phrases –“phrase” is any sequence of words Each phrase is probabilistically translated into English –P(to the conference | zur Konferenz) –P(into the meeting | zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art! Morgenfliegeichnach Kanadazur Konferenz TomorrowIwill flyto the conferenceIn Canada Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … Word Alignment Induced Phrases Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch) Word Alignment Induced Phrases Slide courtesy of Kevin Knight

Advantages of Phrase-Based SMT Many-to-many mappings can handle non- compositional phrases Local context is very useful for disambiguating –“Interest rate”  … –“Interest in”  … The more data, the longer the learned phrases –Sometimes whole sentences Slide courtesy of Kevin Knight

Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Rule-based vs. Hybrid

MT Approaches Practical Considerations Resources Availability –Parsers and Generators Input/Output compatability –Translation Lexicons Word-based vs. Transfer/Interlingua –Parallel Corpora Domain of interest Bigger is better Time Availability –Statistical training, resource building

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

More art than science Wide range of Metrics/Techniques –interface, …, scalability, …, faithfulness,... space/time complexity, … etc. Automatic vs. Human-based –Dumb Machines vs. Slow Humans

Human-based Evaluation Example Accuracy Criteria

Human-based Evaluation Example Fluency Criteria

Fluency vs. Accuracy Accuracy Fluency conMT FAHQ MT Prof. MT Info. MT

Automatic Evaluation Example Bleu Metric (Papineni et al 2001) Bleu –BiLingual Evaluation Understudy –Modified n-gram precision with length penalty –Quick, inexpensive and language independent –Correlates highly with human evaluation –Bias against synonyms and inflectional variations

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a 1 a 2 …a n ) 1/n = (0.8 ╳ 0.5) ½ =  Automatic Evaluation Example Bleu Metric

Metrics MATR Workshop Workshop in AMTA conference 2008 –Association for Machine Translation in the Americas Evaluating evaluation metrics Compared 39 metrics –7 baselines and 32 new metrics –Various measures of correlation with human judgment –Different conditions: text genre, source language, number of references, etc.

Interested in MT?? Contact me Research courses, projects Languages of interest: –English, Arabic, Hebrew, Chinese, Urdu, Spanish, Russian, …. Topics –Statistical, Hybrid MT Phrase-based MT with linguistic extensions Component improvements or full-system improvements –MT Evaluation –Multilingual computing

Thank You