An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson Research Center ACL 2005 Workshop (on Feature Engineering for Machine Learning in NLP)

2/24 Outline (1/2)   Named Entities (NEs) translation is crucial for effective – –cross-language information retrieval – –Machine Translation   NEs (only focus on person names, location names and organization names) – –might be phonetically transliterated   persons names – –might also be mixed between phonetic transliteration and semantic translation   locations and organizations names

3/24 Outline (2/2)   an integrated approach – –phrase-based translation   advantage: frequently used NE phrases   disadvantage: less frequent words – –word-based translation   traditional statistical machine translation techniques such as IBM Model1 (Brown et al., 1993)   disadvantage: many-to-many phrase translations – –transliteration modules   advantage: out of vocabulary, unknown words   disadvantage: frequently used words   aligning NEs across parallel corpora

4/24 Related Work (1/2)   Huang et al., 2003: – –used a bilingual dictionary to extract NE pairs and deployed it iteratively to extract more NEs.   Moore, 2003: – –relies on orthographic clues, only suitable for language pairs with similar scripts and/or orthographic conventions

5/24 Related Work (2/2)   Arabic-related transliteration – –Arbabi et al., 1998: developed a hybrid neural network and knowledge-based system to generate multiple English spellings for Arabic person names. – –Stalls and Knight, 1998: Arabic-English back transliteration. – –Al-Onaizan and Knight, 2002: spelling- based model which directly maps English letter sequences into Arabic letter sequences.

6/24   Persons names tend to be phonetically transliterated – –the idiomatic translation that has been established  For locations and organizations, the translation can be a mixture of translation and transliteration.

7/24 Data   A parallel corpus – –using NE identifiers similar to the systems described in (Florian et al., 2004) for NE detection. – –to separately acquire the phrases for the phrase based system – –the translation matrix for the word based system – –training data for the transliteration system

8/24 Translation and Transliteration Modules   Word Based NE Translation – –Basic multi-cost NE Alignment – –Multi-cost NE Alignment by Content Words Elimination   Phrase Based Named Entity Translation   Named Entity Transliteration   System Integration and Decoding

9/24 Basic multi-cost NE Alignment (1/3)   IBM Model1 (Brown et. al, 1993)   Huang et al. 2003 – –multi-cost aligning approach – –The cost for aligning any source and target NE word is defined as:  Ed(we,wf): this phonetic-based edit distance employs an Editex style (Zobel and Dart, 1996) distance measure

10/24 Basic multi-cost NE Alignment (2/3)   The Editex distance (d) between two letters a and b is: d(a,b) = – –0 if both are identical – –1 if they are in the same group – –2 otherwise

11/24 Basic multi-cost NE Alignment (3/3)

12/24 Multi-cost NE Alignment by Content Words Elimination (1/2)   Content words might be aligned incorrectly to rare NE words   A two-phase alignment approach – –The first phase is aligning the content words using a content-word-only translation matrix -> remove – –Remaining words -> multi-cost alignment

13/24 Multi-cost NE Alignment by Content Words Elimination (2/2)   Ex – –Wsi: content words in the source sentence. – –NEsi: the Named Entity source words. – –Wti: the content words in the target sentence. – –NEti: the Named Entity target words. – –Source: Ws1 Ws2 NEs1 NEs2 Ws3 Ws4 Ws5 – –Target: Wt1 Wt2 Wt3 NEt1 NEt2 NEt3 Wt4 NEt4 – –Source: NEs1 NEs2 Ws4 Ws5 – –Target: Wt3 NEt1 NEt2 NEt3 NEt4

14/24 Phrase Based Named Entity Translation (1/2)   Tillman (Tillmann, 2003) for block generation with modifications suitable for NE phrase extraction.   A block is defined to be any pair of source and target phrases.   This approach starts from a word alignment generated by HMM Viterbi training (Vogel et. Al, 1996), which is done in both directions between source and target.

15/24 Phrase Based Named Entity Translation (2/2)

16/24 Named Entity Transliteration (1/3)   Out Of Vocabulary (OOV) words that are not covered by the word or phrase based models.   These source and target sequences construct the blocks which enables the modeling of vowels insertion.

17/24 Named Entity Transliteration (2/3)   Arabic name -> “Shoukry”   The system tries to model bi-grams from the source language to n-grams in the target language as follows: – –$k -> shouk – –kr -> kr – –ry -> ry

18/24 Named Entity Transliteration (3/3)   Use the translation matrix, from the word based alignment models. – –Translations with probabilities less than a certain threshold are filtered out. – –Distance between both romanized Arabic and English -> greater than the threshold are also filtered out. – –The remaining highly confident name pairs are used to train a letter to letter translation matrix using HMM Viterbi training (Vogel et al., 1996). – –a source block s and a target block t, P(t | s) – –a Weighted Finite State Transducer (WFST) for translating any source sequence to a target sequence

19/24 System Integration and Decoding   Used a dynamic programming beam search decoder similar to the decoder described by Tillman (Tillmann, 2003).   Monolingual target data -> NE phrases – –The first language model is a trigram language on NE phrases. – –The second language model is a class based language model with a class for unknown NEs.

20/24 Experimental Setup (1/4)   three NE categories, namely names of persons, organizations, and locations.   trained on a news domain parallel corpus containing 2.8 million Arabic words and 3.4 million words.   Monolingual English data was annotated with NE types

21/24 Experimental Setup (2/4)   manually constructed a test set  The BLEU score (Papineni et al., 2002) with a single reference translation was deployed for evaluation.  BLEU-3 which uses up to 3-grams is deployed since three words phrase is a reasonable length for various NE types.

22/24 Experimental Setup (3/4)

23/24 Experimental Setup (4/4)

24/24 Conclusion and Future Work   We have presented an integrated system that can handle various NE categories and requires the regular parallel and monolingual corpora which are typically used in the training of any statistical machine translation system along with NEs identifier.   We will evaluate the effect of the system on CLIR and MT tasks.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Similar presentations

Presentation on theme: "An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Similar presentations

Presentation on theme: "An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson."— Presentation transcript:

Similar presentations

About project

Feedback