Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering,

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
Computing & Information Sciences Kansas State University Wednesday, 29 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 39 of 42 Wednesday, 29 November.
Computing & Information Sciences Kansas State University Lecture 38 of 42 CIS 530 / 730 Artificial Intelligence Lecture 38 of 42 Natural Language Processing,
Machine Translation Domain Adaptation Day PROJECT #2 2.
+. + Natural Language Processing CS311, Spring 2013 David Kauchak.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 24 Jim Martin.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 14b 24 August 2007.
Machine Translation: Introduction Slides from: Dan Jurafsky.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Kevin Knight USC/Information Sciences Institute USC/Computer Science Department.
Introduction to Statistical Machine Translation Philipp Koehn USC/Information Sciences Institute USC/Computer Science Department School of Informatics.
CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
Introduction to Statistical Machine Translation Philipp Koehn Kevin Knight USC/Information Sciences Institute USC/Computer Science Department CSAIL Massachusetts.
Machine translation Context-based approach Lucia Otoyo.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
1 Machine Translation Dai Xinyu Outline  Introduction  Architecture of MT  Rule-Based MT vs. Data-Driven MT  Evaluation of MT  Development.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
Machine Translation Course 5 Diana Trandab ă ț Academic year:
Evolution of Machine Translation: systems and use John Hutchins [ homepages/WJHutchins] [
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Natural Language Processing Lecture 23—12/1/2015 Jim Martin.
Machine Translation Diana Trandab ă ţ Academic Year
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Introduction to Machine Translation
Spring 2010 Lecture 2 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn, Kevin Knight, Chris Quirk LING 575: Seminar on statistical machine.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
Neural Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
Approaches to Machine Translation
Introduction to Machine Translation
CSE 517 Natural Language Processing Winter 2015
Statistical NLP: Lecture 13
Machine Translation: Introduction
CSCI 5832 Natural Language Processing
--Mengxue Zhang, Qingyang Li
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
LING 180 SYMBSYS 138 Intro to Computer Speech and Language Processing
Approaches to Machine Translation
Introduction to Machine Translation
Statistical Machine Translation
Statistical Machine Translation Papers from COLING 2004
Machine Translation(MT)
Introduction to Statistical Machine Translation
Machine Translation: Word alignment models
Presentation transcript:

Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering, College of Engineering, University of Tehran,

Machine translation History Machine Translation Architectures Rule Based Machine Translation Statistical Machine Translation State of the Art and Future Trends How to Develop SMT fast ! 2

MT History earliest systems (1950s and 1960s) ALPAC report (1966) ‘quiet’ decade (1966 to 1975) adoption of Systran by CEC (1976); Meteo coming of PC systems (1980s); coming of translation aids increasing use (since mid 1980s): companies, localisation, etc. dominance of ‘transfer’ framework (1980s) interlingua systems (late 1980s) translation memory (since late 1980s) corpus-based MT research (from late 1980s) statistical machine translation (since mid 1990s) online MT (since late 1990s) 3

Warren Weaver writes memorandum on MT At the time, collaborating with Claude Shannon on 'information theory‘ Discussed ideas in 1946 and 1947 Four main suggestions: Disambiguation by examining adjacent words Brain networks similar to computers Use of cryptographic methods: decoding of source text – 'When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”' Universal language 4

1952: first conference appointed at MIT in May 1951, surveyed MT Convened first MT conference at MIT Topics covered: Pre-editing, post-editing Controlled language Domain restriction (Oswald's microglossaries) Syntactic analysis (Bar-Hillel's categorial grammar) Computer hardware, programming Funding 5

1954: first public demonstration Leon Dostert determined to show 'technical feasibility’ of MT Collaboration of Georgetown University and IBM Public demonstration in New York, 7th January 1954 of Russian-English systema Linguistic foundations by Paul Garvin of Georgetown U.– 250 words, 6 rules Programming by Peter Sheridan of IBM Reported widely, worldwide interest Beginning of government funding - in both US and Soviet Union 6

1955: beginnings of MT in Soviet Union 1953: death of Stalin, March 1953 opened access to science of the west: structural linguistics, and computers 1954: news of Georgetown-IBM demonstration 1955: first attempts (BESM) 1956: foundation of groups: Inst Precision Mechnaics, Steklov Mathematical Institute, Inst Linguistics, Leningrad University 7

1966: the ALPAC report Set up by NSF for US sponsors of MT research Concluded: No effective MT despite massive funding, and none in Prospect Poor quality output Criticised at time for short sightedness Brought to end US funding for many years Affected funding elsewhere 8

From 1967 to 1978 Continuation of research in US (Texas, Wayne State), Soviet Union, UK, Canada, France 1970: Systran installed at USAF (Foreign Technology Division) 1970: TITUS installed (restricted language: textile industry abstracts) 1975: Météo ‘sublanguage’ English-French system (weather broadcasts) 1975: CULT Chinese-English (restricted language: mathematics) 1976: European Commission acquires Systran: 1978: Xerox Corporation uses Systran with controlled language (Caterpillar English) 9

1981: MT for personal computers Previously all MT systems for mainframe computers subsequently (in 1980s and 1990s): ESI, Instant Spanish, LogoMedia, Personal Translator, PeTra, PROMT, Systran many Japanese systems e.g. Crossroad, LogoVista 10

1982: AI and interlinguas Beginning of 'Fifth Generation' (AI) program in Japan; influence on US research Research on interlingua systems At Philips (Rosetta) – implementing Montague grammar At Utrecht (DLT) – modified Esperanto, bilingual knowledge base Research on knowledge-based systems At Colgate University, Carnegie-Mellon University, New Mexico State University (PANGLOSS) 11

1986: speech translation ATR in Japan, JANUS at Carnegie-Mellon, Verbmobil (at various German universities) speech recognition, speech synthesis highly context dependent, use of ‘knowledge databases’ discourse semantics, ‘ill-formed’ utterances ellipsis, use of stress, intonation, modality markers Restricted fields (telephone booking of hotels and conferences) Still continuing 12

1988: corpus-based MT Availability of large bilingual corpora Beginning of Example-based MT research, First proposed in 1981 by Makoto Nagao First article on Statistical MT, 1988 (research at IBM) revival of Warren Weaver’s idea (‘decoding’ SL as TL) 13

1993: translation memory Previous tools: dictionaries, termbanks, concordances in 1993 launch of first commercial system: Trados later followed by Transit, Déjà Vu, ProMemoria, WordFast,... using aligned bilingual corpora (of human translation), searchable by words and phrases 14

Developments of statistical MT Bi-lingual alignment and monolingual corpora (as 'Language model') Word alignment Man be khaneh nemiravam I do not go to the house Phrase-based alignment Features/tags on source trees (corpora) to aid reordering Comparable corpora (not bilingual translations) Syntax-aware SMT: syntactic pre-ordering, tree-string decoding Crowd sourcing (for data, evaluations) 15

1992: evaluation of MT Previous evaluations by human judgments: FEMTI (Framework for the Evaluation of Machine Translation in ISLE) In 1992/1994 DARPA investigates unedited US systems, comparing automatic measures and human judgments of adequacy, fluency, informativeness Development of MT evaluation metrics (in parallel with SMT): 2001: BLEU (Bilingual Evaluation Understudy) Statistical measures of similarity of SMT output and human translations : NIST (National Institute of Standards and Technology) 2005: METEOR (Carnegie Mellon), etc. 16

1997: free online MT CompuServe started testing in 1992 (limited to subscribers of some forums) Systran offered online translations of webpages since 1996 Babel Fish launched on AltaVista on December 9, 1997 free for all Internet users subsequently: FreeTranslation, PROMT, Google, etc. usage: mainly short phrases, text not webpages, into native language rapid growth Babelfish: 500,000 per day (May 1998) to 1.3 million (October 2000) FreeTranslation: 50,000 per day (December 1999) to 3.4 million (September 2006) 17

Since 2004: open source toolkits GIZA ++: tool for alignment in SMT Moses: platform for building SMT systems Joshua: decoder for syntax-based (hierarchical) SMT Apertium: platform for building rule-based MT META-SHARE: data for EU projects LetsMT: cloud-based resource for supporting MT research 18

2006- : some current projects Euromatrix, founded 2006 Goal: MT systems between all EU languages (over 500 language pairs), many centres involved, lead by Philipp Koehn at Edinburgh University Statistical phrase-based MT, hybrid systems, statistics-based; treetransfer (dependency) Open source (Moses), collaboration, shared resources; rapid development of systems, open continuous evaluation Lets MT – funded 2010 in Baltic states Online platform for data sharing, and building SMT systems, with easy user interaction; cloud computing 19

Different Architechtures Statistical Approaches Rule-based Approaches

21 Three MT Approaches: Direct, Transfer, Interlingua

Statistical Machine Translation (SMT) 22 Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world

Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Statistical translation models are trained on a sentence-aligned parallel bilingual corpus Train word-level alignment models Extract phrase-to-phrase correspondences Apply them at runtime on source input and “decode” Attractive: completely automatic, no manual rules, much reduced manual labor SMT

Heshaam Faili 10/2/2016 Statistical Approach Consider NLP problems as Decoding an encrypted string Amenable to machine learning (training and generalization) In classical NLP – rules are obtained from linguists In statistical NLP – probabilities are learnt from data Uses Aligned corpora The bigger problem is how to achieved a word-aligned from sentence aligned corpora

Statistical Approach Main drawbacks: Effective only with large volumes (several mega- words) of parallel text Broad domain, but domain-sensitive Still viable only for small number of language pairs! Impressive progress in last 5 years Large DARPA funding programs (TIDES, GALE) Lots of research in this direction GIZA++, Pharoah, CAIRO, Google, languageWeaver, translator,…

Warren Weaver (1947) ingcmpnqsnwf cv fpn owoktvcv hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv...

e e e e ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

e e e the ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

e he e the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

e he e of the fof ingcmpnqsnwf cv fpn owoktvcv e f o e o oe t hu ihgzsnwfv rqcffnw cw owgcnwf ef kowazoanv... Warren Weaver (1947)

e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

e he e is the sis ingcmpnqsnwf cv fpn owoktvcv e s i e i ie t hu ihgzsnwfv rqcffnw cw owgcnwf es kowazoanv... Warren Weaver (1947)

decipherment is the analysis ingcmpnqsnwf cv fpn owoktvcv of documents written in ancient hu ihgzsnwfv rqcffnw cw owgcnwf languages... kowazoanv... Warren Weaver (1947)

The non-Turkish guy next to me is even deciphering Turkish! All he needs is a statistical table of letter- pair frequencies in Turkish … Collected mechanically from a Turkish body of text, or corpus Can this be computerized?

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” - Warren Weaver, March 1947 “... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague... to make any quasi-mechanical translation scheme very hopeful.” - Norbert Wiener, April 1947

Spanish/English corpus 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos.

1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. Translate: Clients do not sell pharmaceuticals in Europe.

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat.

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination

1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate?

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility

“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” - Warren Weaver, March 1947 The required statistical tables have millions of entries…? Too much for the computers of Weaver’s day.  Not enough RAM!

IBM Candide Project ( ) How to get quantities of human translation in computer readable form? parallel corpus IBM’s John Cocke, inventor of CKY parsing & RISC processors

IBM Candide Project [Brown et al 93] French Broken English French/English Bilingual Text English Text Statistical Analysis J’ ai si faim What hunger have I, Hungry I am so, I am so hungry, Have me that hunger … I am so hungry

Heshaam Faili 10/2/2016 Noisy Channel model

Mathematical Formulation J’ ai si faimI am so hungry Translation Model P(f | e) Language Model P(e) Decoding algorithm argmax e P(e) · P(f | e) Given source sentence f: argmax e P(e | f) = argmax e P(f | e) · P(e) / P(f) =by Bayes Rule argmax e P(f | e) · P(e) P(f) same for all e French Broken English

Alternate Approach: Statistics Given a Farsi sentence F, we could do a search for an E that maximizes P(E|F) Three component: Language model Translation model decoder

Why Bayes rule at all? Why not model P(E|F) directly? P(F|E)P(E) decomposition allows us to : P(E) worries about good English P(F|E) worries about Farsi that matches English The two can be trained independently

Jon در تلويزيون نشان داده شد good English? P(E)good match to Farsi? P(F|E) Jon appeared in TV. It back twelve saw. In Jon appeared TV. Jon is happy today. Jon appeared on TV. TV appeared on Jon. Jon was not happy.

good English? P(E)good match to Farsi? P(F|E) Jon appeared in TV. It back twelve saw. In Jon appeared TV. Jon is happy today. Jon appeared on TV. TV appeared on Jon. Jon was not happy. Jon در تلويزيون نشان داده شد

I speak English good. How are we going to model good English? How do we know these sentences are not good English?  Jon appeared in TV.  It back twelve saw.  In Jon appeared TV.  TV appeared on Jon.  Je ne parle pas l'anglais.

Je ne parle pas l'anglais.  These aren’t English words. It back twelve saw.  These are English words, but it’s not valid sentence. Jon appeared in TV.  “appeared in TV” isn’t proper English I speak English good.

Let’s say we have a huge collection of documents written in English  Like, say, the Internet. It would be a pretty comprehensive list of English words  Save for “named entities” People, places, things  Might include some non-English words Speling mitsakes! lol! Could also tell if a phrase is good English I speak English good.

Google, is this good English? Jon appeared in TV.  “Jon appeared” 1,800,000 Google results  “appeared in TV” 45,000 Google results  “appeared on TV” 210,000 Google results It back twelve saw.  “twelve saw” 1,100 Google results  “It back twelve” 586 Google results  “back twelve saw” 0 Google results Imperfect counting… why?

Google, is this good English? Language is often modeled this way  Collect statistics about the frequency of words and phrases  N-gram statistics 1-gram = unigram 2-gram = bigram 3-gram = trigram 4-gram = four-gram 5-gram = five-gram

Back to Translation Anyway, where were we?  Oh right…  So, we’ve got P(e), let’s talk P(f|e)

Where will we get P(F|E)? Books in English Same books, in Farsi Machine Learning Magic P(F|E) model We call collections stored in two languages parallel corpora or parallel texts Want to update your system? Just add more text!

Heshaam Faili 10/2/2016 Translated Corpora The Canadian Parliamentary Debates Available in both French and English UN documents Available in Arabic, Chinese, English, French, Russian and Spanish

Heshaam Faili 10/2/2016 Problem: How are we going to generalize from examples of translations? I’ll spend the rest of this lecture telling you: What makes a useful P(F|E) How to obtain the statistics needed for P(F|E) from parallel texts

we can visualized the word alignment in two different ways 10/2/2016

New Information Call this new info a word alignment (A) With A, we can make a good story The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux روباه سریع از روی سگ تنبل پرید

Where’s “heaven” in Vietnamese?

English :In the beginning God created the heavens and the earth. Vietnamese :Ban dâu Dúc Chúa Tròi dung nên tròi dât. English :God called the expanse heaven. Vietnamese :Dúc Chúa Tròi dat tên khoang không la tròi. English :… you are this day like the stars of heaven in number. Vietnamese :… các nguoi dông nhu sao trên tròi. Where’s “heaven” in Vietnamese?

English :In the beginning God created the heavens and the earth. Vietnamese :Ban dâu Dúc Chúa Tròi dung nên tròi dât. English :God called the expanse heaven. Vietnamese :Dúc Chúa Tròi dat tên khoang không la tròi. English :… you are this day like the stars of heaven in number. Vietnamese :… các nguoi dông nhu sao trên tròi. Where’s “heaven” in Vietnamese?

efP(f | e) national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 Translation Model Sample Translation Probabilities [Brown et al 93]

efP(f | e) national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 new French sentence f Translation Model potential translation e P(f | e)

ef national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 new French sentence f Translation Model Language Model w1w2P(w2 | w1) of the0.13 a0.09 another0.01 some0.01 hong kong0.98 said0.01 stated0.01 potential translation e P(f | e)P(e)

efP(f | e) national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 new French sentence f P(f | e) · P(e)  score for e Translation Model Language Model w1w2P(w2 | w1) of the0.13 a0.09 another0.01 some0.01 hong kong0.98 said0.01 stated0.01 potential translation e P(f | e)P(e)

Classic Decoding Algorithm Given f, find the English string e that maximizes P(e) · P(f | e) NP-Complete [Knight 99], Travel Salesman Problem. Brown et al 93: “In this paper, we focus on the translation modeling problem. We hope to deal with the [decoding] problem in a later paper.”

Beam Search Decoding [Brown et al US Patent #5,477,451] 1 st English word 2 nd English word 3 rd English word 4 th English word start end all source words covered

1 st English word 2 nd English word 3 rd English word 4 th English word start end Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) all source words covered best predecessor link [Jelinek 69; Och, Ueffing, and Ney, 01]

How Much Data Do We Need? Amount of bilingual training data Quality of automatically trained machine translation system

Ready-to-Use Online Bilingual Data (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). Millions of words (English side)

Ready-to-Use Online Bilingual Data (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). Millions of words (English side) + European parliament data [Koehn 05]

Where is Persian?

Flaws of Word-Based MT Can’t translate multiple English words to one French word Can’t translate phrases “real estate”, “note that”, “interest in” Isn’t sensitive to syntax Adjectives/nouns should swap order Verb comes at the beginning in Arabic Doesn’t understand the meaning (?)

The MT Triangle SOURCETARGET words syntax logical form interlingua logical form

The MT Triangle SOURCETARGET words syntax logical form interlingua logical form

The MT Swimming Pool syntax interlingua words logical form

SOURCETARGET Commercial Rule-Based Systems words syntax logical form interlingua logical form

Knight et al 95 - meaning-based translation - composition rules Language Model SOURCETARGET words syntax logical form interlingua logical form

Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text SOURCETARGET Language Model words syntax logical form interlingua logical form

Yamada/Knight (01,02) - tree/string model - used existing target language parser SOURCETARGET Language Model words syntax logical form interlingua logical form

Well, these all seem like good ideas. Which one had the most dramatic effect on MT quality? None of them!

Phrases SOURCETARGET phrases How do you translate “real estate” into French? real estate real number dance number dance card memory card memory stick … words syntax logical form interlingua logical form

THE PHRASE-BASED TRANSLATION MODEL (Koehn et al. (2003).)

Phrase-Based Statistical MT Foreign input segmented into phrases –“phrase” just means “word sequence” Each phrase is probabilistically translated into English –P(to the conference | zur Konferenz) –P(into the meeting | zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an overview. Morgenfliegeichnach Kanadazur Konferenz TomorrowIwill flyto the conferenceIn Canada

How to Learn the Phrase Translation Table? One method: “alignment templates” [Och et al 99] Start with word alignment Collect all phrase pairs that are consistent with the word alignment

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (bruja verde, green witch)

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (bruja verde, green witch) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

104 متشکرم