Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering,

Similar presentations


Presentation on theme: "Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering,"— Presentation transcript:

1 Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering, College of Engineering, University of Tehran, hfaili@ut.ac.ir

2 Machine translation History Machine Translation Architectures Rule Based Machine Translation Statistical Machine Translation State of the Art and Future Trends How to Develop SMT fast ! 2

3 MT History earliest systems (1950s and 1960s) ALPAC report (1966) ‘quiet’ decade (1966 to 1975) adoption of Systran by CEC (1976); Meteo coming of PC systems (1980s); coming of translation aids increasing use (since mid 1980s): companies, localisation, etc. dominance of ‘transfer’ framework (1980s) interlingua systems (late 1980s) translation memory (since late 1980s) corpus-based MT research (from late 1980s) statistical machine translation (since mid 1990s) online MT (since late 1990s) 3

4 Warren Weaver writes memorandum on MT At the time, collaborating with Claude Shannon on 'information theory‘ Discussed ideas in 1946 and 1947 Four main suggestions: Disambiguation by examining adjacent words Brain networks similar to computers Use of cryptographic methods: decoding of source text – 'When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”' Universal language 4

5 1952: first conference appointed at MIT in May 1951, surveyed MT Convened first MT conference at MIT Topics covered: Pre-editing, post-editing Controlled language Domain restriction (Oswald's microglossaries) Syntactic analysis (Bar-Hillel's categorial grammar) Computer hardware, programming Funding 5

6 1954: first public demonstration Leon Dostert determined to show 'technical feasibility’ of MT Collaboration of Georgetown University and IBM Public demonstration in New York, 7th January 1954 of Russian-English systema Linguistic foundations by Paul Garvin of Georgetown U.– 250 words, 6 rules Programming by Peter Sheridan of IBM Reported widely, worldwide interest Beginning of government funding - in both US and Soviet Union 6

7 1955: beginnings of MT in Soviet Union 1953: death of Stalin, March 1953 opened access to science of the west: structural linguistics, and computers 1954: news of Georgetown-IBM demonstration 1955: first attempts (BESM) 1956: foundation of groups: Inst Precision Mechnaics, Steklov Mathematical Institute, Inst Linguistics, Leningrad University 7

8 1966: the ALPAC report Set up by NSF for US sponsors of MT research Concluded: No effective MT despite massive funding, and none in Prospect Poor quality output Criticised at time for short sightedness Brought to end US funding for many years Affected funding elsewhere 8

9 From 1967 to 1978 Continuation of research in US (Texas, Wayne State), Soviet Union, UK, Canada, France 1970: Systran installed at USAF (Foreign Technology Division) 1970: TITUS installed (restricted language: textile industry abstracts) 1975: Météo ‘sublanguage’ English-French system (weather broadcasts) 1975: CULT Chinese-English (restricted language: mathematics) 1976: European Commission acquires Systran: 1978: Xerox Corporation uses Systran with controlled language (Caterpillar English) 9

10 1981: MT for personal computers Previously all MT systems for mainframe computers subsequently (in 1980s and 1990s): ESI, Instant Spanish, LogoMedia, Personal Translator, PeTra, PROMT, Systran many Japanese systems e.g. Crossroad, LogoVista 10

11 1982: AI and interlinguas Beginning of 'Fifth Generation' (AI) program in Japan; influence on US research Research on interlingua systems At Philips (Rosetta) – implementing Montague grammar At Utrecht (DLT) – modified Esperanto, bilingual knowledge base Research on knowledge-based systems At Colgate University, Carnegie-Mellon University, New Mexico State University (PANGLOSS) 11

12 1986: speech translation ATR in Japan, JANUS at Carnegie-Mellon, Verbmobil (at various German universities) speech recognition, speech synthesis highly context dependent, use of ‘knowledge databases’ discourse semantics, ‘ill-formed’ utterances ellipsis, use of stress, intonation, modality markers Restricted fields (telephone booking of hotels and conferences) Still continuing 12

13 1988: corpus-based MT Availability of large bilingual corpora Beginning of Example-based MT research, 1988-89 First proposed in 1981 by Makoto Nagao First article on Statistical MT, 1988 (research at IBM) revival of Warren Weaver’s idea (‘decoding’ SL as TL) 13

14 1993: translation memory Previous tools: dictionaries, termbanks, concordances in 1993 launch of first commercial system: Trados later followed by Transit, Déjà Vu, ProMemoria, WordFast,... using aligned bilingual corpora (of human translation), searchable by words and phrases 14

15 Developments of statistical MT Bi-lingual alignment and monolingual corpora (as 'Language model') Word alignment Man be khaneh nemiravam I do not go to the house Phrase-based alignment Features/tags on source trees (corpora) to aid reordering Comparable corpora (not bilingual translations) Syntax-aware SMT: syntactic pre-ordering, tree-string decoding Crowd sourcing (for data, evaluations) 15

16 1992: evaluation of MT Previous evaluations by human judgments: FEMTI (Framework for the Evaluation of Machine Translation in ISLE) In 1992/1994 DARPA investigates unedited US systems, comparing automatic measures and human judgments of adequacy, fluency, informativeness Development of MT evaluation metrics (in parallel with SMT): 2001: BLEU (Bilingual Evaluation Understudy) Statistical measures of similarity of SMT output and human translations 2001-2005: NIST (National Institute of Standards and Technology) 2005: METEOR (Carnegie Mellon), etc. 16

17 1997: free online MT CompuServe started testing in 1992 (limited to subscribers of some forums) Systran offered online translations of webpages since 1996 Babel Fish launched on AltaVista on December 9, 1997 free for all Internet users subsequently: FreeTranslation, PROMT, Google, etc. usage: mainly short phrases, text not webpages, into native language rapid growth Babelfish: 500,000 per day (May 1998) to 1.3 million (October 2000) FreeTranslation: 50,000 per day (December 1999) to 3.4 million (September 2006) 17

18 Since 2004: open source toolkits GIZA ++: tool for alignment in SMT Moses: platform for building SMT systems Joshua: decoder for syntax-based (hierarchical) SMT Apertium: platform for building rule-based MT META-SHARE: data for EU projects LetsMT: cloud-based resource for supporting MT research 18

19 2006- : some current projects Euromatrix, founded 2006 Goal: MT systems between all EU languages (over 500 language pairs), many centres involved, lead by Philipp Koehn at Edinburgh University Statistical phrase-based MT, hybrid systems, statistics-based; treetransfer (dependency) Open source (Moses), collaboration, shared resources; rapid development of systems, open continuous evaluation Lets MT – funded 2010 in Baltic states Online platform for data sharing, and building SMT systems, with easy user interaction; cloud computing 19

20 Different Architechtures Statistical Approaches Rule-based Approaches

21 21 Three MT Approaches: Direct, Transfer, Interlingua

22 Statistical Machine Translation (SMT) 22 Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world

23 Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Statistical translation models are trained on a sentence-aligned parallel bilingual corpus Train word-level alignment models Extract phrase-to-phrase correspondences Apply them at runtime on source input and “decode” Attractive: completely automatic, no manual rules, much reduced manual labor SMT

24 Heshaam Faili (hfaili@ece.ut.ac.ir) 10/2/2016 Statistical Approach Consider NLP problems as Decoding an encrypted string Amenable to machine learning (training and generalization) In classical NLP – rules are obtained from linguists In statistical NLP – probabilities are learnt from data Uses Aligned corpora The bigger problem is how to achieved a word-aligned from sentence aligned corpora

25 Statistical Approach Main drawbacks: Effective only with large volumes (several mega- words) of parallel text Broad domain, but domain-sensitive Still viable only for small number of language pairs! Impressive progress in last 5 years Large DARPA funding programs (TIDES, GALE) Lots of research in this direction GIZA++, Pharoah, CAIRO, Google, languageWeaver, 1-800 translator,…

26 Warren Weaver (1947) ingcmpnqsnwf cv fpn owoktvcv hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv...

27 e e e e ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

28 e e e the ingcmpnqsnwf cv fpn owoktvcv e e e hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

29 e he e the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

30 e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

31 e he e of the fof ingcmpnqsnwf cv fpn owoktvcv e f o e o oe t hu ihgzsnwfv rqcffnw cw owgcnwf ef kowazoanv... Warren Weaver (1947)

32 e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e t hu ihgzsnwfv rqcffnw cw owgcnwf e kowazoanv... Warren Weaver (1947)

33 e he e is the sis ingcmpnqsnwf cv fpn owoktvcv e s i e i ie t hu ihgzsnwfv rqcffnw cw owgcnwf es kowazoanv... Warren Weaver (1947)

34 decipherment is the analysis ingcmpnqsnwf cv fpn owoktvcv of documents written in ancient hu ihgzsnwfv rqcffnw cw owgcnwf languages... kowazoanv... Warren Weaver (1947)

35 The non-Turkish guy next to me is even deciphering Turkish! All he needs is a statistical table of letter- pair frequencies in Turkish … Collected mechanically from a Turkish body of text, or corpus Can this be computerized?

36 “When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” - Warren Weaver, March 1947 “... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague... to make any quasi-mechanical translation scheme very hopeful.” - Norbert Wiener, April 1947

37 Spanish/English corpus 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos.

38 1a. Garcia and associates. 1b. Garcia y asociados. 7a. the clients and the associates are enemies. 7b. los clients y los asociados son enemigos. 2a. Carlos Garcia has three associates. 2b. Carlos Garcia tiene tres asociados. 8a. the company has three groups. 8b. la empresa tiene tres grupos. 3a. his associates are not strong. 3b. sus asociados no son fuertes. 9a. its groups are in Europe. 9b. sus grupos estan en Europa. 4a. Garcia has a company also. 4b. Garcia tambien tiene una empresa. 10a. the modern groups sell strong pharmaceuticals. 10b. los grupos modernos venden medicinas fuertes. 5a. its clients are angry. 5b. sus clientes estan enfadados. 11a. the groups do not sell zenzanine. 11b. los grupos no venden zanzanina. 6a. the associates are also angry. 6b. los asociados tambien estan enfadados. 12a. the small groups are not modern. 12b. los grupos pequenos no son modernos. Translate: Clients do not sell pharmaceuticals in Europe.

39 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

40 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

41 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat.

42 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

43 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

44 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

45 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

46 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp ???

47 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

48 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp process of elimination

49 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp cognate?

50 Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } 1a. ok-voon ororok sprok. 1b. at-voon bichat dat. 7a. lalok farok ororok lalok sprok izok enemok. 7b. wat jjat bichat wat dat vat eneat. 2a. ok-drubel ok-voon anok plok sprok. 2b. at-drubel at-voon pippat rrat dat. 8a. lalok brok anok plok nok. 8b. iat lat pippat rrat nnat. 3a. erok sprok izok hihok ghirok. 3b. totat dat arrat vat hilat. 9a. wiwok nok izok kantok ok-yurp. 9b. totat nnat quat oloat at-yurp. 4a. ok-voon anok drok brok jok. 4b. at-voon krat pippat sat lat. 10a. lalok mok nok yorok ghirok clok. 10b. wat nnat gat mat bat hilat. 5a. wiwok farok izok stok. 5b. totat jjat quat cat. 11a. lalok nok crrrok hihok yorok zanzanok. 11b. wat nnat arrat mat zanzanat. 6a. lalok sprok izok jok stok. 6b. wat dat krat quat cat. 12a. lalok rarok nok izok hihok mok. 12b. wat nnat forat arrat vat gat. zero fertility

51 “When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” - Warren Weaver, March 1947 The required statistical tables have millions of entries…? Too much for the computers of Weaver’s day.  Not enough RAM!

52 IBM Candide Project (1988-1994) How to get quantities of human translation in computer readable form? parallel corpus IBM’s John Cocke, inventor of CKY parsing & RISC processors

53 IBM Candide Project [Brown et al 93] French Broken English French/English Bilingual Text English Text Statistical Analysis J’ ai si faim What hunger have I, Hungry I am so, I am so hungry, Have me that hunger … I am so hungry

54 Heshaam Faili (hfaili@ece.ut.ac.ir) 10/2/2016 Noisy Channel model

55 Mathematical Formulation J’ ai si faimI am so hungry Translation Model P(f | e) Language Model P(e) Decoding algorithm argmax e P(e) · P(f | e) Given source sentence f: argmax e P(e | f) = argmax e P(f | e) · P(e) / P(f) =by Bayes Rule argmax e P(f | e) · P(e) P(f) same for all e French Broken English

56 Alternate Approach: Statistics Given a Farsi sentence F, we could do a search for an E that maximizes P(E|F) Three component: Language model Translation model decoder

57 Why Bayes rule at all? Why not model P(E|F) directly? P(F|E)P(E) decomposition allows us to : P(E) worries about good English P(F|E) worries about Farsi that matches English The two can be trained independently

58 Jon در تلويزيون نشان داده شد good English? P(E)good match to Farsi? P(F|E) Jon appeared in TV. It back twelve saw. In Jon appeared TV. Jon is happy today. Jon appeared on TV. TV appeared on Jon. Jon was not happy.

59 good English? P(E)good match to Farsi? P(F|E) Jon appeared in TV. It back twelve saw. In Jon appeared TV. Jon is happy today. Jon appeared on TV. TV appeared on Jon. Jon was not happy. Jon در تلويزيون نشان داده شد

60 I speak English good. How are we going to model good English? How do we know these sentences are not good English?  Jon appeared in TV.  It back twelve saw.  In Jon appeared TV.  TV appeared on Jon.  Je ne parle pas l'anglais.

61 Je ne parle pas l'anglais.  These aren’t English words. It back twelve saw.  These are English words, but it’s not valid sentence. Jon appeared in TV.  “appeared in TV” isn’t proper English I speak English good.

62 Let’s say we have a huge collection of documents written in English  Like, say, the Internet. It would be a pretty comprehensive list of English words  Save for “named entities” People, places, things  Might include some non-English words Speling mitsakes! lol! Could also tell if a phrase is good English I speak English good.

63 Google, is this good English? Jon appeared in TV.  “Jon appeared” 1,800,000 Google results  “appeared in TV” 45,000 Google results  “appeared on TV” 210,000 Google results It back twelve saw.  “twelve saw” 1,100 Google results  “It back twelve” 586 Google results  “back twelve saw” 0 Google results Imperfect counting… why?

64 Google, is this good English? Language is often modeled this way  Collect statistics about the frequency of words and phrases  N-gram statistics 1-gram = unigram 2-gram = bigram 3-gram = trigram 4-gram = four-gram 5-gram = five-gram

65 Back to Translation Anyway, where were we?  Oh right…  So, we’ve got P(e), let’s talk P(f|e)

66 Where will we get P(F|E)? Books in English Same books, in Farsi Machine Learning Magic P(F|E) model We call collections stored in two languages parallel corpora or parallel texts Want to update your system? Just add more text!

67 Heshaam Faili (hfaili@ece.ut.ac.ir) 10/2/2016 Translated Corpora The Canadian Parliamentary Debates Available in both French and English UN documents Available in Arabic, Chinese, English, French, Russian and Spanish

68 Heshaam Faili (hfaili@ece.ut.ac.ir) 10/2/2016 Problem: How are we going to generalize from examples of translations? I’ll spend the rest of this lecture telling you: What makes a useful P(F|E) How to obtain the statistics needed for P(F|E) from parallel texts

69 we can visualized the word alignment in two different ways 10/2/2016

70 New Information Call this new info a word alignment (A) With A, we can make a good story The quick fox jumps over the lazy dog Le renard rapide saut par - dessus le chien parasseux روباه سریع از روی سگ تنبل پرید

71 Where’s “heaven” in Vietnamese?

72 English :In the beginning God created the heavens and the earth. Vietnamese :Ban dâu Dúc Chúa Tròi dung nên tròi dât. English :God called the expanse heaven. Vietnamese :Dúc Chúa Tròi dat tên khoang không la tròi. English :… you are this day like the stars of heaven in number. Vietnamese :… các nguoi dông nhu sao trên tròi. Where’s “heaven” in Vietnamese?

73 English :In the beginning God created the heavens and the earth. Vietnamese :Ban dâu Dúc Chúa Tròi dung nên tròi dât. English :God called the expanse heaven. Vietnamese :Dúc Chúa Tròi dat tên khoang không la tròi. English :… you are this day like the stars of heaven in number. Vietnamese :… các nguoi dông nhu sao trên tròi. Where’s “heaven” in Vietnamese?

74 efP(f | e) national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 Translation Model Sample Translation Probabilities [Brown et al 93]

75 efP(f | e) national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 new French sentence f Translation Model potential translation e P(f | e)

76 ef national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 new French sentence f Translation Model Language Model w1w2P(w2 | w1) of the0.13 a0.09 another0.01 some0.01 hong kong0.98 said0.01 stated0.01 potential translation e P(f | e)P(e)

77 efP(f | e) national nationale0.47 national0.42 nationaux0.05 nationales0.03 the le0.50 la0.21 les0.16 l’0.09 ce0.02 cette0.01 farmers agriculteurs0.44 les0.42 cultivateurs0.05 producteurs0.02 new French sentence f P(f | e) · P(e)  score for e Translation Model Language Model w1w2P(w2 | w1) of the0.13 a0.09 another0.01 some0.01 hong kong0.98 said0.01 stated0.01 potential translation e P(f | e)P(e)

78 Classic Decoding Algorithm Given f, find the English string e that maximizes P(e) · P(f | e) NP-Complete [Knight 99], Travel Salesman Problem. Brown et al 93: “In this paper, we focus on the translation modeling problem. We hope to deal with the [decoding] problem in a later paper.”

79 Beam Search Decoding [Brown et al US Patent #5,477,451] 1 st English word 2 nd English word 3 rd English word 4 th English word start end all source words covered

80 1 st English word 2 nd English word 3 rd English word 4 th English word start end Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far) all source words covered best predecessor link [Jelinek 69; Och, Ueffing, and Ney, 01]

81 How Much Data Do We Need? Amount of bilingual training data Quality of automatically trained machine translation system

82 Ready-to-Use Online Bilingual Data (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). Millions of words (English side)

83 Ready-to-Use Online Bilingual Data (Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn). Millions of words (English side) + European parliament data [Koehn 05]

84 Where is Persian?

85 Flaws of Word-Based MT Can’t translate multiple English words to one French word Can’t translate phrases “real estate”, “note that”, “interest in” Isn’t sensitive to syntax Adjectives/nouns should swap order Verb comes at the beginning in Arabic Doesn’t understand the meaning (?)

86 The MT Triangle SOURCETARGET words syntax logical form interlingua logical form

87 The MT Triangle SOURCETARGET words syntax logical form interlingua logical form

88 The MT Swimming Pool syntax interlingua words logical form

89 SOURCETARGET Commercial Rule-Based Systems words syntax logical form interlingua logical form

90 Knight et al 95 - meaning-based translation - composition rules Language Model SOURCETARGET words syntax logical form interlingua logical form

91 Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text SOURCETARGET Language Model words syntax logical form interlingua logical form

92 Yamada/Knight (01,02) - tree/string model - used existing target language parser SOURCETARGET Language Model words syntax logical form interlingua logical form

93 Well, these all seem like good ideas. Which one had the most dramatic effect on MT quality? None of them!

94 Phrases SOURCETARGET phrases How do you translate “real estate” into French? real estate real number dance number dance card memory card memory stick … words syntax logical form interlingua logical form

95 THE PHRASE-BASED TRANSLATION MODEL (Koehn et al. (2003).)

96 Phrase-Based Statistical MT Foreign input segmented into phrases –“phrase” just means “word sequence” Each phrase is probabilistically translated into English –P(to the conference | zur Konferenz) –P(into the meeting | zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an overview. Morgenfliegeichnach Kanadazur Konferenz TomorrowIwill flyto the conferenceIn Canada

97 How to Learn the Phrase Translation Table? One method: “alignment templates” [Och et al 99] Start with word alignment Collect all phrase pairs that are consistent with the word alignment

98 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases

99 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)

100 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (bruja verde, green witch)

101 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (bruja verde, green witch) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)

102 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) …

103 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)

104 104 متشکرم


Download ppt "Machine Translation, Statistical Approach Heshaam Faili Natural Language and Text Processing Laboratory School of Electrical and Computer Engineering,"

Similar presentations


Ads by Google