Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
BLEU, Its Variants & Its Critics Arthur Chan Prepared for Advanced MT Seminar.
June 2004 D ARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
1 Lending a Hand: Sign Language Machine Translation Sara Morrissey NCLT Seminar Series 21 st June 2006.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Software Process and Product Metrics
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Kyoshiro SUGIYAMA, AHC-Lab., NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Gholamreza Haffari Simon Fraser University PhD Seminar, August 2009 Machine Learning approaches for dealing with Limited Bilingual Data in SMT.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Korea Maritime and Ocean University NLP Jung Tae LEE
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Haitham Elmarakeby.  Speech recognition
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
English-Lithuanian-English Lexicon Database Management System for MT Gintaras Barisevicius and Elvinas Cernys Kaunas University of Technology, Department.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Centre for Translation Studies FACULTY OF ARTS
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Language Technologies Institute Carnegie Mellon University
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Neural Machine Translation by Jointly Learning to Align and Translate
KantanNeural™ LQR Experiment
Yuri Pettinicchi Jeny Tony Philip
Machine Translation(MT)
Presentation transcript:

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez, Vicent Alabau, Francisco Casacuberta Lagarda, A.L., et al., Translating without in-domain corpus: Machine translation post- editing with online learning techniques. Comput. Speech Lang. (2014),

Outline Overview/Background Methods Results Discussion/Conclusion Application

OVERVIEW / BACKGROUND

Overview / Background Need for high quality translation for variety of corpora Manual translation expensive Machine Translation is not perfect Specialized corpora pose problems due to lack of training data in reliable Machine Translation systems

Overview / Background Popular solutions: –W/O Resources Manual translation –WEB Web-based translation app., with post-editing –RBMT Rule Based Machine Translation, with post-editing –SMT Statistical Machine Translation, with post-editing

METHODS & STUDY DESIGN

Scenario Study Scenario –Comparing the popular solutions with an automatic post-editing module (APE) –Domain Adaptation as an APE problem –Typical APE Translation System employed Statistical APE processes system output –Results calculated in BLEU score

Methods Typical setup of MT with APE system (Fig.1)

Methods APE is set up in an online learning framework –Base translation system employed –User validates/corrects sampled translations –User validated output is included in retraining the models for translation system Updated models apply to successive sentences

Methods SMT with Online Learning/Active learning (Fig. 2)

Study Design Base Translation Systems –w/o Resources – “THOT” Online Learning SMT (no translation memory, dictionary or trained models) –RBMT : 2 for each language pair –WEB : Google, Bing, Yandex –SMT : Moses – open source tool kit –Oracle : in domain Moses-trained SMT.

Study Design Active Learning Automatic Post Editor –SMT Simplified log-linear model to generate translations Feature Functions:

Study Design Active Learning APE –SMT Log-linear model extended from user feedback Incremental EM (expectation maximization) algorithm (convergence after each training sample) Differs from previous work by word alignment –Word alignment used is based on HMM vs edit distance (rule) –WA integrated in online SMT –Estimated by the incremental EM

Evaluation Metrics Results calculated in BLEU score –BiLingual Evaluation Understudy Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL,Philadelphia, PA, USA, pp. 311–318. Usually N=4

Corpora Language Pairs: Spanish (es) English (en) Spanish (es) Catalan (ca) 4 Corpora Xerox – non-public source Europarl - popular multi lingual MT source used for Out of Domain Training

RESULTS

Results EMEA (Health) (fig. 3) English -> Spanish (reverse results same)

Results Xerox (Technical) (fig. 5) English -> Spanish (reverse results similar, slightly worse)

Results Xerox (Technical) (fig. 5) English -> Spanish (reverse results similar, slightly worse) not stat. significant

Results I3media (News) (fig. 7) Spanish -> Catalan (reverse results similar)

DISCUSSION AND CONCLUSION

Discussion & Conclusion Repetition Rate –Measures the degree of document-internal repetition for a given corpus –Rate of non singleton n-grams in a given text set of different n-grams contained in-domain corpus I set of different n-grams occurring only once in I

Discussion & Conclusion Repetition Rate –High Repetition = Active Learning success –Low Repetition = more work for editor –Baseline Online MT highly correlated to repetition rate –Exception: i3media (news) High variability in sub-domains Small documents, each documents repr 95% docs < 30 sent

Application Disparate subdomains in clinical notes Evaluation of repetition can assess whether Active Learning methods can work for multiple tasks –Named Entity Recognition Most frequently occurring NE Pre-Annotation project SMT - phrase-based translation –Phrase-based NE and Dx prevalent in clinical notes