Machine Translation Course 2 Diana Trandab ă ţ Academic year: 2015-2016.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Machine Translation II How MT works Modes of use.
Anna Sågvall Hein, GSLT, January 2003 Direct translation no intermediary sentence structure translation proceeds in a number of steps, each step dedicated.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
Introduction.  “a technique that enables the computer to encode complex grammatical knowledge such as humans use to assemble sentences, recognize errors.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Towards an NLP `module’ The role of an utterance-level interface.
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Machine Translation Anna Sågvall Hein Mösg F
Introduction to MT Ling 580 Fei Xia Week 1: 1/03/06.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1/23 Applications of NLP. 2/23 Applications Text-to-speech, speech-to-text Dialogues sytems / conversation machines NL interfaces to –QA systems –IR systems.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Machine Translation History of Machine Translation Difficulties in Machine Translation Structure of Machine Translation System Research methods for Machine.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Machine translation Context-based approach Lucia Otoyo.
9/8/20151 Natural Language Processing Lecture Notes 1.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Week 9: resources for globalisation Finish spell checkers Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Globalisation and machine translation Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and First Order Predicate.
Interpreting Dictionary Definitions Dan Tecuci May 2002.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Evolution of Machine Translation: systems and use John Hutchins [ homepages/WJHutchins] [
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Deeper Sentiment Analysis Using Machine Translation Technology Kanauama Hiroshi, Nasukawa Tetsuya Tokyo Research Laboratory, IBM Japan Coling 2004.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Introduction to MT CSE 415 Fei Xia Linguistics Dept 02/24/06.
Jan 2005CSA4050 Machine Translation II1 CSA4050: Advanced Techniques in NLP Machine Translation II Direct MT Transfer MT Interlingual MT.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Machine Translation Diana Trandab ă ţ Academic Year
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Jan 2012MT Architectures1 Human Language Technology Machine Translation Architectures Direct MT Transfer MT Interlingual MT.
Introduction to Machine Translation
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Translation Tutorial Lauri Carlson MOLTO kickoff meeting Barcelona March 2010.
Compiler Design (40-414) Main Text Book:
Approaches to Machine Translation
Introduction to Machine Translation
Course supervisor: Lubna Siddiqui
Approaches to Machine Translation
Introduction to Machine Translation
Natural Language Processing
Chapter 10: Compilers and Language Translation
Presentation transcript:

Machine Translation Course 2 Diana Trandab ă ţ Academic year:

2/21 Brief history war-time use of computers in code breaking Warren Weaver’s memorandum suggests applying decoding techniques to mechanically recognize fundamental aspects f NL "When I look at an article in Russian, I say:this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode." Warren Weaver, “Translation” Alan Turing, 1948: Computers could be used for – "(i) Various games, e.g. chess, noughts and crosses, bridge, poker; – (ii) the learning of languages; – (iii) translation of languages; – (iv) cryptography; – (v) mathematics." Big investment by US Government (mostly on Russian-English) Early promise of FAHQT – Fully automatic high quality translation

Brief history (2) 1952: First MT conference: US MT Conference, 1954: U. Georgetown (Peter Toma) + IBM first MT experiment (English - Russian) 250 words, 6 rules, 49 translated sentences 1956: First international conference on MT at Georgetown University 1955: MT abroad: Anglia (U. Cambridge), Franta (GETA), URSS (Academia din Moscova), Germania (U. Bonn) 1959: MT in Japan (MITI-”Yamato”) Difficulties soon recognised: –no formal linguistics –crude computers –need for “real-world knowledge” –Bar Hillel’s “semantic barrier”

Brief history (3) 1960: Bar-Hillel’s warning – foarte pesimist ca se poate ajunge vreodata la automatizarea completa a procesului de traducere 1966 ALPAC (Automatic Language Processing Advisory Committee) report –“insufficient demand for translation” –“MT is more expensive, slower and less accurate” –“no immediate or future prospect” –should invest instead in fundamental CL research –Result: no public funding for MT research in US for the next 25 years (though some privately funded research continued)

5/ Research confined to Europe and Canada “2nd generation approach”: linguistically and computationally more sophisticated 1976: success of Météo (Canada) 1978: in Europe starts discussions of its own MT project, Eurotra first commercial systems early 1980s FAHQT abandoned in favour of – “Translator’s Workstation” – interactive systems – sublanguage / controlled input

6/ Lots of research in Europe and Japan in this “linguistic” paradigm PC replaces mainframe computers more systems marketed despite low quality, users claim increased productivity general explosion in translation market thanks to international organizations, globalisation of marketplace (“buy in your language, sell in mine”) renewed funding in US (work on Farsi, Pashto, Arabic, Korean; include speech translation) emergence of new research paradigm (“empirical” methods; allows rapid development of new target language) growth of WWW, including translation tools

7/21 Present situation creditable commercial systems now available wide price range, many very cheap (£30) MT available free on WWW widely used for web-page and translation low-quality output acceptable for reading foreign-language web pages but still only a small set of languages covered speech translation widely researched

So that: – meaning(text2) == meaning(text1) i.e. faithful – text2 is grammatically correct i.e. fluency MT is hard! – No MT system completely solved the problem What is MT? Text 1 source language Text 2 target language

MT difficulties Different word order (SVO vs VSO vs SOV languages) “a black cat” (DT ADJ N) --> “o pisica neagra” (DT N ADJ) Multiple transaltions “John knows Bill.” --> “John connaît Bill.” --> “John il cunoaste pe Bill” “John knows Bill will be late. --> “John sait que Bill sera en retard.” --> “John stie ca Bill va intarzia” Cross-lingual mapping of word senses (there are no perfect translation equivalents)

How humans do translation? Learn a foreign language: – Memorize word translations – Learn some patterns: – Exercise: Passive activity: read, listen Active activity: write, speak Translation: – Understand the sentence – Clarify or ask for help (optional) – Translate the sentence Training stage Decoding stage Translation lexicon Templates, transfer rules Parsing, semantics analysis? Interactive MT? Word-level? Phrase-level? Generate from meaning? Reinforced learning Reranking

What kinds of resources are available to MT? Translation lexicon: – Bilingual dictionary Templates, transfer rules: – Grammar books Parallel data, comparable data Thesaurus, WordNet, FrameNet, … NLP tools: tokenizer, morph analyzer, parser, …  More resources for major languages, less for “minor” languages.

Major approaches Transfer-based Interlingua Example-based (EBMT) Statistical MT (SMT) Hybrid approach

Direct translation no complete intermediary sentence structure translation proceeds in a number of steps, each step dedicated to a specific task the most important component is the bilingual dictionary typically general language problems with – ambiguity – inflection – word order and other structural shifts

Simplistic approach sentence splitting tokenisation handling capital letters dictionary look-up and lexical substitution incl. some heuristics for handling ambiguities copying unknown words, digits, signs of punctuation etc. formal editing

Advanced classical approach (Tucker 1987) Source text dictionary look-up and morphological analysis Identification of homographs Identification of compound nouns Identification of nouns and verb phrases Processing of idioms

Advanced approach, cont. processing of prepositions subject-predicate identification syntactic ambiguity identification synthesis and morphological processing of target text rearrangement of words and phrases in target text

Feasibility of the direct translation strategy Is it possible to carry out the direct translation steps as suggested by Tucker with sufficient precision without relying on a complete sentence structure?

Current trends in direct translation re-use of translations – translation memories of sentences and sub-sentence units such as words, phrases and larger units – lexicalistic translation – example-based translation – statistical translation Will re-use of translations overcome the problems with the direct translation approach? If so, how can they be handled?

Systran System Translation developed in the US by Peter Toma first version 1969 (Ru-En) EC bought the rights of Systran in 1976 currently 18 language pairs

Systran Linguistic Resources Dictionaries – POS Definitions – Inflection Tables – Decomposition Tables – Segmentation Dictionaries Disambiguation Rules Analysis Rules

Transfer-based MT Analysis, transfer, generation: 1.Parse the source sentence 2.Transform the parse tree with transfer rules 3.Translate source words 4.Get the target sentence from the tree Resources required: – Source parser – A translation lexicon – A set of transfer rules An example: Mary bought a book yesterday.

analysis --> transfer --> generation Each arrow can be implemented with rule-based methods or probabilistically The Transfer Metaphor Interlingua attraction(NamedJohn, NamedMary, high) English Semantics loves(John, Mary) French Semantics aime(Jean, Marie) English Syntax S(NP(John) VP(loves, NP(Mary))) French Syntax S(NP(Jean) VP(aime, NP(Marie))) English Words John loves Mary French Words Jean aime Marie word transfer (memory-based translation) syntactic transfer semantic transfer knowledge transfer

Transfer-based MT (cont) Parsing: linguistically motivated grammar or formal grammar? Transfer: – context-free rules? A path on a dependency tree? – Apply at most one rule at each level? – How are rules created? Translating words: word-to-word translation? Generation: using LM or other additional knowledge? How to create the needed resources automatically?

Syntactic transfer Solves some problems… –Word order –Some cases of lexical choice Ex: –Dictionary of analysis know: verb ; transitive ; subj: human ; obj: NP || Sentence –Dictionary of transfer know + obj [NP] --> cunoate know + obj [sentence] --> ti But syntax is not enough… –No one-to-one correspondence between syntactic structures in different languages (syntactic mismatch)

Interlingua For n languages, we need n(n-1) MT systems. Interlingua uses a language-independent representation. Conceptually, Interlingua is elegant: we only need n analyzers, and n generators. Resource needed: – A language-independent representation – Sophisticated analyzers – Sophisticated generators

Interlingua (cont) Questions: – Does language-independent meaning representation really exist? If so, what does it look like? – It requires deep analysis: how to get such an analyzer: e.g., semantic analysis – It requires non-trivial generation: How is that done? – It forces disambiguation at various levels: lexical, syntactic, semantic, discourse levels. – It cannot take advantage of similarities between a particular language pair.

Example-based MT Basic idea: translate a sentence by using the closest match in parallel data. First proposed by Nagao (1981). Ex: – Training data: w1 w2 w3 w4  w1’ w2’ w3’ w4’ w5 w6 w7  w5’ w6’ w7’ w8 w9  w8’ w9’ – Test sent: w1 w2 w6 w7 w9  w1’ w2’ w6’ w7’ w9’

EMBT (cont) Types of EBMT: – Lexical (shallow) – Morphological / POS analysis – Parse-tree based (deep) Types of data required by EBMT systems: – Parallel text – Bilingual dictionary – Thesaurus for computing semantic similarity – Syntactic parser, dependency parser, etc.

EBMT (cont) Word alignment: using dictionary and heuristics  exact match Generalization: – Clusters: dates, numbers, colors, shapes, etc. – Clusters can be built by hand or learned automatically. Ex: – Exact match: 12 players met in Paris last Tuesday  12 juc ă tori s-au întâlnit joia trecut ă în Paris – Templates: $num players met in $city $time  $num juc ă tori s-au întâlnit $time în $city

Statistical MT Basic idea: learn all the parameters from parallel data. Major types: – Word-based – Phrase-based Strengths: – Easy to build, and it requires no human knowledge – Good performance when a large amount of training data is available. Weaknesses: – How to express linguistic generalization?

Statistical MT: Being faithful & fluent Often impossible to have a true translation; one that is: – Faithful to the source language, and – Fluent in the target language – Ex: Japanese: “fukaku hansei shite orimasu” Fluent translation: “we apologize” Faithful translation: “we are deeply reflecting (on our past behaviour, and what we did wrong, and how to avoid the problem next time)” So need to compromise between faithfulness & fluency Statistical MT tries to maximise some function that represents the importance of faithfulness and fluency – Best-translation T*= argmax T fluency(T) x faithfulness(T, S)

The Noisy Channel Model Statistical MT is based on the noisy channel model Developed by Shannon to model communication (ex. over a phone line) Noisy channel model in SMT (ex. En to Ro): – Assume that the true text is in English – But when it was transmitted over the noisy channel, it somehow got corrupted and came out in Romanian i.e. the noisy channel has deformed/corrupted the original English input into Romanian So really… Romanian is a form of noisy English – The task is to recover the original English sentence (or to decode the Romanian into English)

Comparison of resource requirement Transfer- based InterlinguaEBMTSMT dictionary+++ Transfer rules + parser+++ (?) semantic analyzer + parallel data++ othersUniversal representation thesaurus

Hybrid MT Basic idea: combine strengths of different approaches: – Syntax-based: generalization at syntactic level – Interlingua: conceptually elegant – EBMT: memorizing translation of n-grams; generalization at various level. – SMT: fully automatic; using LM; optimizing some objective functions. Types of hybrid HT: – Borrowing concepts/methods: SMT from EBMT: phrase-based SMT; Alignment templates EBMT from SMT: automatically learned translation lexicon Transfer-based from SMT: automatically learned translation lexicon, transfer rules; using language models, … – Using two MTs in a pipeline: Using transfer-based MT as a preprocessor of SMT – Using multiple MTs in parallel, then adding a re-ranker.