Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Statistical modelling of MT output corpora for Information Extraction.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Machine Translation Anna Sågvall Hein Mösg F
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
MACHINE TRANSLATION TRANSLATION(5) LECTURE[1-1] Eman Baghlaf.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Sensitivity of automated MT evaluation metrics on higher quality MT output Bogdan Babych, Anthony Hartley Centre for Translation.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
School of something FACULTY OF OTHER School of Languages, Cultures and Societies – Faculty of Arts School of Computing – Faculty of Engineering Multilingual.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Analytics in ITS 2.0: Annotation of Named Entities
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department of Computer Science University of Sheffield, UK Anthony Hartley Centre for Translation Studies University of Leeds, UK

Overview Problems of Named Entities (NEs) for MT Experiment set-up –Segmentation of the MT output –Scoring scheme Results of the experiment Discussion –Improving MT with IE techniques Conclusions and future work

Problems of NEs for MT NEs are the weak point for many MT systems Distinct linguistic properties of proper nouns and different translation strategies for NE “NE internal” errors: –Proper / common noun disambiguation errors –Errors in morphosyntactic categories of NEs “NE external” errors in the context of NEs: –Word sense disambiguation errors –Errors in morphosyntactic features in NE context –Segmentation errors

Translation strategies for NEs Language-dependent strategies –Eastern Slavonic languages: person names are transcribed with Cyrillic characters Strategies dependent on a type of NE –[Newmark, 1982: 70-83]: organisation names are often left untranslated –Languages with Cyrillic writing system: organisation names are often left in original Roman orthography E.g.: 4 articles on international economy from BBC Russian site: Roman-script NEs cover 6% of the total 1000 tokens

Proper / common disambiguation errors –English: “Ray Rogers” –MT ProMT E-R: “Луч Rogers” (‘A ray (beam of light) Rogers’) –English: “Bill Fisher” –MT ProMT E-R : “Выставить счёт Рыбаку” –MT ProMT E-F : “Facturez le Pêcheur” (‘(To) send a bill to a fisher’) –English: “Jeff Levy” –MT Systran E-F : “prélèvement de Jeff” (‘Jeff’s imposing of a tax’)

Contextual changes around unrecognised NEs Errors in morphosyntactic categories –English: “… they have been flying in United cockpits” –E-R MT: “… они летали в Объединенных кабинах” (‘they have been flying in united (joined) cockpits’) Segmentation errors –English: “Eastern Airlines executives notified union leaders …” –E-R MT: “Восточные исполнители авиалиний уведомили профсоюзных руководителей…” (‘Oriental executives of the Airlines notified …')

Compound errors -- combining: “NE internal” errors and errors in the context of NEs Lexical disambiguation errors and errors in morphosyntactic disambiguation / segmentation –English: “In Ford-UAW talks…” –E-R MT: “В Броде - UAW говорит” (‘In a ford (shallow place) - UAW is talking’)

Information Extraction (IE) technology IE: from unrestricted text to a database –specific subject domain (e.g. satellite launches) –predefined template with fields to be filled IE tasks: –NE recognition –Co-reference resolution –Word sense disambiguation –Template element filling –Scenario template filling –Summary generation

NE recognition in IE NE recognition is specifically addressed and benchmarked (DARPA MUC6 & MUC7 competitions) Manually annotated “gold standard” available Highly accurate –leading IE systems achieve F-score 80-90% –performance is higher and less dependent on a subject domain (compared to Scenario Template Filling) Available under GPL: NE recognition module ANNIE in Sheffield’s GATE system

Using NE recognition for MT GATE-ANNIE system allows automatic annotation of NEs in English texts MT systems accept Do-Not-Translate (DNT) lists –acceptable translation strategy for many organisation names in certain language pairs Suggestion: if NE recognition is more accurate for IE systems, then general MT quality will improve (compared to the baseline performance) –NE-Internal changes are predictable (DNT strategy) –Changes in the context of NEs are more interesting and more difficult to predict

Experiment set-up Purpose: evaluating morphosyntactic changes in the context of NEs after DNT-processing Corpus: –30 texts (news articles) from MUC6 evaluation set (11,975 tokens, 510 NE occurrences, 174 NE types) –GATE “responses” -- NE recognition output file generated by GATE-1 for MUC6 competition (Precision - 84%; Recall - 94%; F-measure %) MT systems: –E-R ProMT 98; E-F ProMT 2001; E-F Systran 2000

Experiment set-up (contd.) Stage 1: Automatic generation of DNT lists from GATE-1 annotation Stage 2: Generating translations for 3 systems –Baseline translation (without a DNT list) –DNT-processed translation Stage 3: Automatic segmentation of translations into NE-internal and NE-external zones Stage 4: Manual scoring of NE-external differences

Segmentation algorithm Annotated NEs in the English original are looked up in the DNT-processed translation Strings between found NEs are then looked up in the baseline translation If a string is not found, it is highlighted (signaling a difference in the context of the NE) –Result: NE-internal and NE-external zones in the baseline translation are separated –NE-external differences are highlighted –No complex alignment

Segmentation algorithm (contd.)

Scoring scheme Evaluating morphosyntactic well-formedness

Scoring examples: +1 score

Scoring examples: +0.5 score

Scoring examples: =0 score

Scoring examples: -0.5 score

Scoring examples: -1 score

Manually scored part of the corpus 50 highlighted strings for each MT system Gain score: Overall score / Scored differences

Results of the experiment

Results for additional 50 strings...

Improvement in the context of NEs Aspects of improvement: –morphosyntactic features and categories –word sense disambiguation –word order and syntactic segmentation Consistency in improvement –for both languages –for all MT systems

Examples of improvement

Examples of improvement:2

Improvement: languages and systems

Improvement: languages and systems:2

Improvement: languages and systems:3

Discussion Different aspects of MT quality are interdependent –improvements on one level help other levels IE techniques target specific tasks also necessary for the SL analysis stage in MT –NE recognition –co-reference resolution –word sense disambiguation MT can benefit from clearly defined evaluation procedures for specific IE tasks

Conclusions and Future Work NE recognition within IE framework improves not only treatment of NEs by MT, but also boosts the overall MT quality: –morphosyntactic and lexical well-formedness –features of the wider context of NEs Future work: harnessing other focused technologies for MT –co-reference resolution –word sense disambiguation –evaluating the baseline performance of MT systems