Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.

Similar presentations


Presentation on theme: "1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC."— Presentation transcript:

1 1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC July 22, 2007 Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC July 22, 2007 Second Workshop on Computational Approaches to Arabic Script-based Languages

2 2 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

3 3 Transliteration Translation tools facilitate dialogue across cultures. source language  target language Transliteration is a subtask dealing with transcribing a word written in one writing system into another writing system. Forward Transliteration محمد  Mohammed, Mohammad, Mohamed, Muhammad … Backward Transliteration روبرت  Robert Our task: Arabic to English (for machine translation) Translation tools facilitate dialogue across cultures. source language  target language Transliteration is a subtask dealing with transcribing a word written in one writing system into another writing system. Forward Transliteration محمد  Mohammed, Mohammad, Mohamed, Muhammad … Backward Transliteration روبرت  Robert Our task: Arabic to English (for machine translation)

4 4 Challenges Not a 1-to-1 relationship کاترين can be the equivalent for both Catherine and Katharine. Context can disambiguate: Katharine Hepburn. Lack of diacritics in Arabic writings Long vowels are always explicitly written. و ی ا Short vowels are omitted in writings. مُحَمِد  محمد Mohammed  Mhmmd Lack of certain sounds in Arabic Popowich  Bobowij Different pronunciations based on the letter position in the word. how is ي at the beginning is pronounced? how is ي at the middle or end is usually pronounced? Not a 1-to-1 relationship کاترين can be the equivalent for both Catherine and Katharine. Context can disambiguate: Katharine Hepburn. Lack of diacritics in Arabic writings Long vowels are always explicitly written. و ی ا Short vowels are omitted in writings. مُحَمِد  محمد Mohammed  Mhmmd Lack of certain sounds in Arabic Popowich  Bobowij Different pronunciations based on the letter position in the word. how is ي at the beginning is pronounced? how is ي at the middle or end is usually pronounced?

5 5 Convention Cursive script but not always. ابراهيم  ا ب ر ا ه ی م Right to left From now on, Arabic words are shown letter by letter and from left to right ابراهيم  ا ب ر ا ه ی م e b r a h i m Cursive script but not always. ابراهيم  ا ب ر ا ه ی م Right to left From now on, Arabic words are shown letter by letter and from left to right ابراهيم  ا ب ر ا ه ی م e b r a h i m

6 6 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

7 7 Related Work Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes. Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods They show a spelling-based approach works better than phonetic approach. Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations. Not very useful for machine translation task. Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes. Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods They show a spelling-based approach works better than phonetic approach. Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations. Not very useful for machine translation task.

8 8 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

9 9 Our Approach Consists of three phases Phase 1 (generative): ignore diacritic, simply turn the Arabic letters into English letters. م ح م د  m h mm d Phase 2 (generative): use best candidates from phase 1 to guess the omitted short vowels. م ح م د & m h mm d  mo ha mm d Phase 3 (comparative): compare best candidates from phase 2 with entries in a monolingual dictionary. mo ha mm d  mohammd  mohammed, muhammed … Consists of three phases Phase 1 (generative): ignore diacritic, simply turn the Arabic letters into English letters. م ح م د  m h mm d Phase 2 (generative): use best candidates from phase 1 to guess the omitted short vowels. م ح م د & m h mm d  mo ha mm d Phase 3 (comparative): compare best candidates from phase 2 with entries in a monolingual dictionary. mo ha mm d  mohammd  mohammed, muhammed …

10 10 Training Data Preparation Extract name pairs from two different sources Named entities annotated in the LDC Arabic Treebank 3 Arabic-English parallel news corpus tagged by an entity tagger In total, 9660 pairs are prepared. Extract name pairs from two different sources Named entities annotated in the LDC Arabic Treebank 3 Arabic-English parallel news corpus tagged by an entity tagger In total, 9660 pairs are prepared.

11 11 Tools GIZA++ is used for alignment Implementation of IBM Model 4 Output files are used to rearrange letters Alignment score is used to filter out noise Cambridge Language Model Toolkit For us to use these tools… our words are treated as "sentences" our letters are treated as "words" GIZA++ is used for alignment Implementation of IBM Model 4 Output files are used to rearrange letters Alignment score is used to filter out noise Cambridge Language Model Toolkit For us to use these tools… our words are treated as "sentences" our letters are treated as "words"

12 12 Preprocessing Noise Filtering GIZA++ is run on the character-level training data Bad pairs have low alignment scores and are filtered out the 9660 pairs are reduced to 4255 pairs Normalizing the training data Convert names to lower case. Put space between word letters. Add prefix (B) and suffix (E) to names. example: if we were actually dealing with English Noise Filtering GIZA++ is run on the character-level training data Bad pairs have low alignment scores and are filtered out the 9660 pairs are reduced to 4255 pairs Normalizing the training data Convert names to lower case. Put space between word letters. Add prefix (B) and suffix (E) to names. example: if we were actually dealing with English

13 13 Preprocessing Run GIZA++ with Arabic as the source and English as the target. the most frequent sequences of English letters aligned to the same Arabic letter are added to the alphabet. Apply the new alphabet to the training data. Run GIZA++ with Arabic as the source and English as the target. the most frequent sequences of English letters aligned to the same Arabic letter are added to the alphabet. Apply the new alphabet to the training data. م ح م د m o h a m m e d م ح م د

14 14 Phase 1 Run GIZA++ with Arabic as the source and English as the target. Remove English letters aligned to null from the training set Run GIZA++ with Arabic as the source and English as the target. Remove English letters aligned to null from the training set m o h a mm e d م ح م د

15 15 Phase 1 Translation Model: run GIZA++ with English as the source and Arabic as the target. Language Model: Run Cambridge LM toolkit on the English training set. Use unigram and bigram models for Viterbi training and trigram model for rescoring. Translation Model: run GIZA++ with English as the source and Arabic as the target. Language Model: Run Cambridge LM toolkit on the English training set. Use unigram and bigram models for Viterbi training and trigram model for rescoring. P(e j |e j-1 )P(a i |e j ) E E = argmax E P(A|E) P(E) A = a 0 …a I, E = e 0 …e J

16 16 Phase 1 Beam Search Decoding is used. Relative Threshold Pruning. k best candidates are returned. Beam Search Decoding is used. Relative Threshold Pruning. k best candidates are returned. Bp Bm Bs Bsh Bll Bk dsmjhadsmjha mm n d m l dhE fE kE dE ghE wE محمد

17 17 Phase 2 Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. New letters (phrases) are formed. New translation and language models are created using the new training set. Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. New letters (phrases) are formed. New translation and language models are created using the new training set. m o h a mm e d م ح م د

18 18 Phase 2 Use phase 1 candidates. Phase 1 candidates: e 0 |e 1 |…|e n Phase 2 phrases: p 0 |p 1 |…|p n All the probabilities P(a i |p i ) where p i is not prefixed by given e i are set to zero. The rest is similar to phase 1. Use phase 1 candidates. Phase 1 candidates: e 0 |e 1 |…|e n Phase 2 phrases: p 0 |p 1 |…|p n All the probabilities P(a i |p i ) where p i is not prefixed by given e i are set to zero. The rest is similar to phase 1.

19 19 Phase 2 The same decoding technique applied. For each candidate of phase 1, l new names are generated  kl candidates overall. New combined score NewScore = log(S1) + log(S2) The same decoding technique applied. For each candidate of phase 1, l new names are generated  kl candidates overall. New combined score NewScore = log(S1) + log(S2) Bma Bm Bmou Bmo Bn Bno he h t s ha hou m mm mme mi n me ma dhE deE dE shE diE doE م /m ح /h م /mm د /d

20 20 Phase 3 94646 first and last names. US census bureau. OAK System. All the entries are stripped of the vowels. Francisco  frncsc Stripped versions of the candidates are compared to the stripped versions of the dictionary entries. If matched, the distance of the original names is computed. Levenshtein (Edit) Distance. 94646 first and last names. US census bureau. OAK System. All the entries are stripped of the vowels. Francisco  frncsc Stripped versions of the candidates are compared to the stripped versions of the dictionary entries. If matched, the distance of the original names is computed. Levenshtein (Edit) Distance.

21 21 Phase 3

22 22 Word Filtering To avoid adding every output that the HMM generates, a word filtering step is necessary. Web Filtering Requires online queries for each execution. Not suitable for most offline tasks. Language Model Filtering Requires rich and updated language model. Google Unigram Model is used. Over 13 million words with frequency over 200 on the internet A huge FSA is built and HMM candidates that are accepted by the FSA remain in the system. To avoid adding every output that the HMM generates, a word filtering step is necessary. Web Filtering Requires online queries for each execution. Not suitable for most offline tasks. Language Model Filtering Requires rich and updated language model. Google Unigram Model is used. Over 13 million words with frequency over 200 on the internet A huge FSA is built and HMM candidates that are accepted by the FSA remain in the system.

23 23 Score Final Score =  S +  D +  R S is the combined Viterbi score from last two phases. D is Levenshtein Distance R is the number of repetitions. All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0). Final Score =  S +  D +  R S is the combined Viterbi score from last two phases. D is Levenshtein Distance R is the number of repetitions. All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0).

24 24 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

25 25 Test Data Preparation Extracted from Arabic Treebank 2 part 2 1167 Transliteration pairs First 300 pairs as development test set Second 300 pairs as blind test set Filter out explicit translations or wrong pairs manually 273 pairs for development test set 291 pairs for blind test set Extracted from Arabic Treebank 2 part 2 1167 Transliteration pairs First 300 pairs as development test set Second 300 pairs as blind test set Filter out explicit translations or wrong pairs manually 273 pairs for development test set 291 pairs for blind test set

26 26 Distribution of Names Distribution of Seen and Unseen Names Number of Alternatives for Names. Distribution of Seen and Unseen Names Number of Alternatives for Names. SeenUnseenTotal Dev Set 164109273 Blind Set 19299291 OneTwoThreeFour Dev Set 16185225 Blind Set 18579207

27 27 Performance on Dev Top 1Top 2Top 5Top 10Top 20 Single- phase HMM 44%59%73%81%85% Double- phase HMM 45%60%72%84%88% HMM+Dict.52%64%73%84%88%

28 28 Performance on Blind Top 1Top 2Top 5Top 10Top 20 Single- phase HMM 38%54%72%80%83% Double- phase HMM 41%57%75%82%85% HMM+Dict.46%61%76%84%86%

29 29 Overview Problem Definition and Challenges Related Work Our Approach Evaluation Discussion Problem Definition and Challenges Related Work Our Approach Evaluation Discussion

30 30 Discussion Does the "use of a dictionary help a lot"? You can never have enough training data Rare alignments: N i e t z s c h e  ن ی ت ش ه Issues with names with different origins depends on the task Appropriate for incorporation into an MT system Issues introduced in the introduction absence of short vowels (3) ambiguity resolution (4) Does the "use of a dictionary help a lot"? You can never have enough training data Rare alignments: N i e t z s c h e  ن ی ت ش ه Issues with names with different origins depends on the task Appropriate for incorporation into an MT system Issues introduced in the introduction absence of short vowels (3) ambiguity resolution (4)

31 31 Questions?


Download ppt "1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC."

Similar presentations


Ads by Google