Machine translation Context-based approach Lucia Otoyo.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Machine Translation II How MT works Modes of use.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Machine Translation Course 9 Diana Trandab ă ț Academic year
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Research methods in corpus linguistics Xiaofei Lu.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
1 Computational Linguistics Ling 200 Spring 2006.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Statistical Machine Translation Part II: Word Alignments and EM
Approaches to Machine Translation
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Statistical NLP: Lecture 13
European Network of e-Lexicography
Approaches to Machine Translation
Introduction to Machine Translation
Presentation transcript:

Machine translation Context-based approach Lucia Otoyo

Machine translation Computerized task of translating from one natural language to another Human vs. machine translation Difficulties of MT

Brief history of MT 17 th century Descartes & Leibniz 1930 bilingual dictionary + rules After war (Warren Wiewer)–decoding msg – first public demonstration of MT IBM (spawned research) 1966 ALPAC – less accurate & more cost 1980 increasing demand, rule-based born 1990 parallel corpora approach

MT approaches Rule based Parallel corpora based Context based Conclusion

Rule Based approach Dominant in 1980 Resourses: Set of rules & bilingual dict. Steps: Syntax -grammar Semantics - meaning Pragmatics – difference btw. Lang. Disadvantages: - language experts for rules -new language pair - new rules -not possible to include all the rules -rules have exceptions MT diagram

Parallel corpora based Example based (word freq. & combination) Statistical (phrase extract. & combination) Resources: parallel corpora (pre-trans.), decoder, alignment software Steps: disassemble test into phrases, search the corpora and match phrases, substitute, align phrases to form text Advantages vs. Disadvantages - Easy to apply to new language -more readable as using human pre-translated text -General translation vs. Specific domain -Lexical ambiguity MT diagram

Context Based MT Target Language N-gram Connector Overlap-based decoder N-gram candidates Substitution request Stored n-gram pairs approved n-gram pairs Source Language N-gram segmenter Cache database Cross-language n-gram database Resources Bilingual dictionary Target corpora Source corpora Gazetteers N-gram builder Flooder Edge Locker Synonym generator MT diagram

CBMT edge illustration ‘This context based machine translation approach looks very interesting’. 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting’ edge locking

CBMT n-grams Break down source text into n-grams(4-8) ‘This context based machine translation approach looks very interesting’. If ‘n’ = 4 then n-grams as follows: 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting’

CBMT n-grams ‘This context based machine translation approach looks very interesting’. 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting’

CBMT n-grams ‘This context based machine translation approach looks very interesting’. 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting’

CBMT n-grams ‘This context based machine translation approach looks very interesting’. 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting’

CBMT n-grams ‘This context based machine translation approach looks very interesting’. 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting’

CBMT n-grams ‘This context based machine translation approach looks very interesting’. 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting’ diagram

CBMT Flooding Search the monolingual corpora with translated n-grams Produces large number of n-grams with different translations for each word words can be in any order, taking into account differences between languages each n-gram high density matches diagram

CBMT Target language lattice overlap maximization Align all the n-grams with each other choose the ones, with the highest number of left and right side overlaps Eliminate non or partially overlapping n-grams 1. n-gram ‘This approach for computer’ 2. n-gram ‘This context based machine’ 3. n-gram ‘based machine translation approach’ diagram

CBMT Cross language database stores cross language n-gram correspondences for later use to speed up the translation process diagram

CBMT target language Find globally longest target language overlap with the highest match density 1.‘This context based machine’ 2. ‘context based machine translation’ 3. ‘based machine translation approach’ 4. ‘machine translation approach looks’ 5. ‘translation approach looks very ’ 6. ‘approach looks very interesting ‘This context based machine translation approach looks very interesting’. diagram

CBMT – synonymy Word and Phrasal Synonymy -increase accuracy if no or only partial overlaps found -dynamic synonyms, no predefined coded patterns Stages: 1.Search for the word in corpus( context related phrases) 1. ‘This establishment was founded in the year’ 2. ‘The number of people working in the establishment is far greater than’ 3. ‘The establishment is the first hotel’, etc

CBMT – synonymy cont. 2. Search the corpus only with the phrases 1. ‘This ________ was founded in the year’ 2. ‘The number of people working in the _______ is far greater than’ 3. ‘The ________ is the first hotel’, etc 3. This may return: 1. ‘This company was founded in the year’ 2. ‘The number of people working in the business is far greater than’ 3. ‘The institution is the first hotel’, etc 4. Rank synonyms according to various criteria and flood diagram

CBMT Edge locking First and last words only confirmed by overlap once or few times search for other source sentences, where first & last words in original n-gram also in middle of newly found n- gram this confirms suitability within a particular context Use also for words around interior punctuation illustration diagram

CBMT Target corpora monolingual Very large (50GB – 1 TB) The bigger the more accurate translation Easy to obtain from the web diagram

CBMT Bilingual dictionary Very large The bigger the more accurate translation Usually widely available for most languages Used to translate the n-grams large number of n-grams different translations for each word Words can be in any order, taking into account differences between languages each n-gram high density matches diagram

Conclusion Can we? –Create a universal foundation for all languages –Eliminate the need for human translators –Solve the biggest obstacle in MT – ambiguity

Conclusion Can we? –Create a universal foundation for all languages –Eliminate the need for human translators –Solve the biggest obstacle in MT – ambiguity It does not seem so in the foreseeable future