An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Transliteration Transliteration CS 626 course seminar by Purva Joshi Mugdha Bapat Aditya Joshi Manasi Bapat
Direct Translation Approaches: Statistical Machine Translation
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
Addressing the Rare Word Problem in Neural Machine Translation
Haitham Elmarakeby.  Speech recognition
A Joint Source-Channel Model for Machine Transliteration Li Haizhou, Zhang Min, Su Jian Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Automatic Question Answering Beyond the Factoid Radu Soricut Information Sciences Institute University of Southern California Eric Brill Microsoft Research.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.
Language Model for Machine Translation Jang, HaYoung.
Neural Machine Translation
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Statistical NLP: Lecture 13
Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.
Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.
Presentation transcript:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson Research Center ACL 2005 Workshop (on Feature Engineering for Machine Learning in NLP)

2/24 Outline (1/2)   Named Entities (NEs) translation is crucial for effective – –cross-language information retrieval – –Machine Translation   NEs (only focus on person names, location names and organization names) – –might be phonetically transliterated   persons names – –might also be mixed between phonetic transliteration and semantic translation   locations and organizations names

3/24 Outline (2/2)   an integrated approach – –phrase-based translation   advantage: frequently used NE phrases   disadvantage: less frequent words – –word-based translation   traditional statistical machine translation techniques such as IBM Model1 (Brown et al., 1993)   disadvantage: many-to-many phrase translations – –transliteration modules   advantage: out of vocabulary, unknown words   disadvantage: frequently used words   aligning NEs across parallel corpora

4/24 Related Work (1/2)   Huang et al., 2003: – –used a bilingual dictionary to extract NE pairs and deployed it iteratively to extract more NEs.   Moore, 2003: – –relies on orthographic clues, only suitable for language pairs with similar scripts and/or orthographic conventions

5/24 Related Work (2/2)   Arabic-related transliteration – –Arbabi et al., 1998: developed a hybrid neural network and knowledge-based system to generate multiple English spellings for Arabic person names. – –Stalls and Knight, 1998: Arabic-English back transliteration. – –Al-Onaizan and Knight, 2002: spelling- based model which directly maps English letter sequences into Arabic letter sequences.

6/24   Persons names tend to be phonetically transliterated – –the idiomatic translation that has been established  For locations and organizations, the translation can be a mixture of translation and transliteration.

7/24 Data   A parallel corpus – –using NE identifiers similar to the systems described in (Florian et al., 2004) for NE detection. – –to separately acquire the phrases for the phrase based system – –the translation matrix for the word based system – –training data for the transliteration system

8/24 Translation and Transliteration Modules   Word Based NE Translation – –Basic multi-cost NE Alignment – –Multi-cost NE Alignment by Content Words Elimination   Phrase Based Named Entity Translation   Named Entity Transliteration   System Integration and Decoding

9/24 Basic multi-cost NE Alignment (1/3)   IBM Model1 (Brown et. al, 1993)   Huang et al – –multi-cost aligning approach – –The cost for aligning any source and target NE word is defined as:  Ed(we,wf): this phonetic-based edit distance employs an Editex style (Zobel and Dart, 1996) distance measure

10/24 Basic multi-cost NE Alignment (2/3)   The Editex distance (d) between two letters a and b is: d(a,b) = – –0 if both are identical – –1 if they are in the same group – –2 otherwise

11/24 Basic multi-cost NE Alignment (3/3)

12/24 Multi-cost NE Alignment by Content Words Elimination (1/2)   Content words might be aligned incorrectly to rare NE words   A two-phase alignment approach – –The first phase is aligning the content words using a content-word-only translation matrix -> remove – –Remaining words -> multi-cost alignment

13/24 Multi-cost NE Alignment by Content Words Elimination (2/2)   Ex – –Wsi: content words in the source sentence. – –NEsi: the Named Entity source words. – –Wti: the content words in the target sentence. – –NEti: the Named Entity target words. – –Source: Ws1 Ws2 NEs1 NEs2 Ws3 Ws4 Ws5 – –Target: Wt1 Wt2 Wt3 NEt1 NEt2 NEt3 Wt4 NEt4 – –Source: NEs1 NEs2 Ws4 Ws5 – –Target: Wt3 NEt1 NEt2 NEt3 NEt4

14/24 Phrase Based Named Entity Translation (1/2)   Tillman (Tillmann, 2003) for block generation with modifications suitable for NE phrase extraction.   A block is defined to be any pair of source and target phrases.   This approach starts from a word alignment generated by HMM Viterbi training (Vogel et. Al, 1996), which is done in both directions between source and target.

15/24 Phrase Based Named Entity Translation (2/2)

16/24 Named Entity Transliteration (1/3)   Out Of Vocabulary (OOV) words that are not covered by the word or phrase based models.   These source and target sequences construct the blocks which enables the modeling of vowels insertion.

17/24 Named Entity Transliteration (2/3)   Arabic name -> “Shoukry”   The system tries to model bi-grams from the source language to n-grams in the target language as follows: – –$k -> shouk – –kr -> kr – –ry -> ry

18/24 Named Entity Transliteration (3/3)   Use the translation matrix, from the word based alignment models. – –Translations with probabilities less than a certain threshold are filtered out. – –Distance between both romanized Arabic and English -> greater than the threshold are also filtered out. – –The remaining highly confident name pairs are used to train a letter to letter translation matrix using HMM Viterbi training (Vogel et al., 1996). – –a source block s and a target block t, P(t | s) – –a Weighted Finite State Transducer (WFST) for translating any source sequence to a target sequence

19/24 System Integration and Decoding   Used a dynamic programming beam search decoder similar to the decoder described by Tillman (Tillmann, 2003).   Monolingual target data -> NE phrases – –The first language model is a trigram language on NE phrases. – –The second language model is a class based language model with a class for unknown NEs.

20/24 Experimental Setup (1/4)   three NE categories, namely names of persons, organizations, and locations.   trained on a news domain parallel corpus containing 2.8 million Arabic words and 3.4 million words.   Monolingual English data was annotated with NE types

21/24 Experimental Setup (2/4)   manually constructed a test set  The BLEU score (Papineni et al., 2002) with a single reference translation was deployed for evaluation.  BLEU-3 which uses up to 3-grams is deployed since three words phrase is a reasonable length for various NE types.

22/24 Experimental Setup (3/4)

23/24 Experimental Setup (4/4)

24/24 Conclusion and Future Work   We have presented an integrated system that can handle various NE categories and requires the regular parallel and monolingual corpora which are typically used in the training of any statistical machine translation system along with NEs identifier.   We will evaluate the effect of the system on CLIR and MT tasks.