EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

How to Use a Translation Memory Prof. Reima Al-Jarf King Saud University, Riyadh, Saudi Arabia Homepage:
1 RTL Example: Video Compression – Sum of Absolute Differences Video is a series of frames (e.g., 30 per second) Most frames similar to previous frame.
Chapter Six 1.
Hashing Part Two Better Collision Resolution Small parts of this material stolen from "File Organization and Access" by Austing and Cassel.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Enhancing Translation Systems with Bilingual Concordancing Functionalities V. ANTONOPOULOSC. MALAVAZOS I. TRIANTAFYLLOUS. PIPERIDIS Presentation: V. Antonopoulos.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Embedded vs. PC Application Programming. Overview  The software design cycle  Designing differences  Code differences  Test differences.
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein.
Word and Phrase Alignment Presenters: Marta Tatu Mithun Balakrishna.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 Lending a Hand: Sign Language Machine Translation Sara Morrissey NCLT Seminar Series 21 st June 2006.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:
SPEECH RECOGNITION LEXICON DAY 19 – OCT 9, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Machine translation Context-based approach Lucia Otoyo.
Using Alignment for Multilingual-Text Compression Ehud S. Conley and Shmuel T. Klein.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Experimental study of morphological priming: evidence from Russian verbal inflection Tatiana Svistunova Elizaveta Gazeeva Tatiana Chernigovskaya St. Petersburg.
Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.
Bits & Bytes Created by Chris McAbee For AAMU AGB199 Extra Credit Created from information copied and pasted from
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
SPEECH PERCEPTION DAY 18 – OCT 9, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Automatic Extraction of Translational Japanese- KATAKANA.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Information Retrieval in Practice
Approaches to Machine Translation
CSE 517 Natural Language Processing Winter 2015
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Approaches to Machine Translation
WALT: Recognise words for pets and say which ones I have in French
Presentation transcript:

EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman

EBMT2 Outline EBMT in outline? What data do we need? How do we create a lexicon? Indexing the corpus. Finding chunks to translate. Matching a chunk against the target. Quality of translation. Speed of translation. Good and bad points Conclusions.

EBMT3 EBMT in outline - Corpus Corpus S1: The cat eats a fish. Le chat mange un poisson. S2: A dog eats a cat. Un chien mange un chat. ….. S99,999,999 …. Index the: S1 cat: S1 eats: S1 … dog: S2

EBMT4 EBMT in outline – find chunks A source language sentence is input. The cat eats a dog. Chunks of this sentence are matched against the corpus. The cat : S1 The cat eats: S1 The cat eats a: S1 a dog : S2

EBMT5 How does EBMT work in outline - Corpus 1. The target language sentences are retrieved for each chunk. The cat eats : S1 Corpus S1: The cat eats a fish. Le chat mange un poisson 2. The chunks are aligned with target sentences (hard!). The cat eats Le chat mange

EBMT6 How does EBMT work in outline - Corpus Chunks are scored to find good match… The cat eats Le chat mange Score 78% The cat eats Le chat dorme Score 43% … a dog un chien Score 67% a dog le chien Score 56% a dog un arbre Score 22% The best translated chunks are put together to make the final translation. The cat eats Le chat mange a dog un chien

EBMT7 What data do we need? 1. A large corpus of parallel sentences. …if possible in the same domain as the translations. 2. A bilingual dictionary …but we can induce this from the corpus. 3. A target language root/synonym list. … so we can see similarity between words and inflected forms (e.g. verbs) 4. Classes of words easily translated … such as numbers, towns, weekdays.

EBMT8 How to create a lexicon. 1. Take each sentence pair in the corpus. 2. For each word in the source sentence, add each word in the target sentence and increment the frequency count. 3. Repeat for as many sentences as possible. 4. Use a threshold to get possible alternative translations.

EBMT9 How to create a lexicon..example The cat eats a fish. Le chat mange un poisson. thele,1chat,1mange,1un,1poisson,1 catle,1chat,1mange,1un,1poisson,1 eatsle,1chat,1mange,1un,1poisson,1 ale,1chat,1mange,1un,1poisson,1 fishle,1chat,1mange,1un,1poisson,1

EBMT10 Create a lexicon…after many sentences the le,956 la,925 un, Threshold chat,47 mange,33 poisson, arbre,18

EBMT11 Create a lexicon…after many sentences cat chat, Threshold le,604 la,485 un,305 mange,33 poisson, arbre,47

EBMT12 Indexing the corpus. For speed the corpus is indexed on the source language sentences. Each word in each source language sentence is stored with info about the target sentence. Words can be added to the corpus and the index easily updated. Tokens are used for common classes of words (e.g. numbers). This makes matching more effective.

EBMT13 Finding chunks to translate. Look up each word in the source sentence in the index. Look for chunks in the source sentence (at least 2 words adjacent) which match the corpus. Select last few matches against the corpus (translation memory). Pangloss uses the last 5 matches for any chunk.

EBMT14 Matching a chunk against the target. For each source chunk found previously, retrieve the target sentences from the corpus (using the index). Try to find the translation for the source chunk from these sentences. This is the hard bit! Look for the minimum and maximum segments in the target sentences which could correspond with the source chunk. Score each of these segments.

EBMT15 Scoring a segment… Unmatched Words : Higher priority is given to sentences containing all the words in an input chunk. Noise : Higher priority is given to corpus sentences which have fewer extra words. Order : Higher priority is given to sentences containing input words in the order which is closer to their order in the input chunk. Morphology : Higher priority is given to sentences in which words match exactly rather than against morphological variants.

EBMT16 Whole sentence match… If we are lucky the whole sentence will be found in the corpus! In that case the target sentence is used without previous alignment. Useful if translation memory is available (sentences recently translated are added to the corpus).

EBMT17 Quality of translation. Pangloss was tested against source sentences in a different domain to the examples in the corpus. Pangloss “covered” about 70% of the sentences input. This means a match was found against the corpus…. …but not necessarily a good match. Others report around 60% of the translation can be understood by a native speaker. Systran manages about 70%.

EBMT18 Speed of translation. Translations are much faster than for Systran. Simple sentences translated in seconds. Corpus can be added to (translation memory) at about 6MBytes per minute (Sun Sparc Station) A 270 Mbytes corpus takes 45 minutes to index.

EBMT19 Good points. Fast Easy to add a new language pair No need to analyse languages (much) Can induce a dictionary from the corpus Allows easy implementation of translation memory Graceful degradation as size of corpus reduced

EBMT20 Bad points. Quality is second best at present Depends on a large corpus of parallel, well translated sentences 30% of source has no coverage (translation) Matching of words is brittle – we can see a match Pangloss cannot. Domain of corpus should match domain to be translated - to match chunks

EBMT21 Conclusions. An alternative to Systran Faster Lower quality Quick to develop for a new language pair – if corpus exists! Needs no linguistics Might improve as bigger corpora become available?