Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.

Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta

2 Outline Background Background Improving LCSR Improving LCSR Cognates vs. word alignment links Cognates vs. word alignment links Experiments & results Experiments & results

3 Motivation Claim: words that are orthographically similar are more likely to be mutual translations than words that are not similar. Claim: words that are orthographically similar are more likely to be mutual translations than words that are not similar. Reason: existence of cognates, which are usually orthographically and semantically similar. Reason: existence of cognates, which are usually orthographically and semantically similar. Use: Considering cognates can improve word alignment and translation models. Use: Considering cognates can improve word alignment and translation models.

4 Objective Evaluation of orthographic similarity measures in the context of word alignment in bitexts. Evaluation of orthographic similarity measures in the context of word alignment in bitexts.

5 MT applications sentence alignment sentence alignment word alignment word alignment improving translation models improving translation models inducing translation lexicons inducing translation lexicons aid in manual alignment aid in manual alignment

6 Cognates Similar in orthography or pronunciation. Similar in orthography or pronunciation. Often mutual translations. Often mutual translations. May include: May include: –genetic cognates –lexical loans –names –numbers –punctuation

7 The task of cognate identification Input: two words Input: two words Output: the likelihood that they are cognate Output: the likelihood that they are cognate One method: compute their orthographic/phonetic/semantic similarity One method: compute their orthographic/phonetic/semantic similarity

8 Scope The measures that we consider are language-independent language-independent orthography-based orthography-based operate on the level of individual letters operate on the level of individual letters binary identity function binary identity function

9 Similarity measures Prefix method Prefix method Dice coefficient Dice coefficient Longest Common Subsequence Ratio (LCSR) Longest Common Subsequence Ratio (LCSR) Edit distance Edit distance Phonetic alignment Phonetic alignment Many other methods Many other methods

10 IDENT 1 if two words are identical, 0 otherwise 1 if two words are identical, 0 otherwise The simplest similarity measure The simplest similarity measure e.g. IDENT(colour, couleur) = 0 e.g. IDENT(colour, couleur) = 0

11 PREFIX The ratio of the longest common prefix of two words to the length of the longer word The ratio of the longest common prefix of two words to the length of the longer word e.g. PREFIX(colour, couleur) = 2/7 = 0.28 e.g. PREFIX(colour, couleur) = 2/7 = 0.28

12 DICE coefficient The ratio of the number of common letter bigrams to the total number of letter bigrams The ratio of the number of common letter bigrams to the total number of letter bigrams e.g. DICE(colour, couleur) = 6/11 = 0.55 e.g. DICE(colour, couleur) = 6/11 = 0.55 co ol lo ou ur co ou ul le eu ur

13 Longest Common Sub- sequence Ratio (LCSR) The ratio of the longest common subsequence of two words to the length of the longer word. The ratio of the longest common subsequence of two words to the length of the longer word. e.g. LCSR(colour, couleur) = 5/7 = 0.71 e.g. LCSR(colour, couleur) = 5/7 = 0.71 co-lo-ur coul-eur

14 LCSR Method of choice in several papers Method of choice in several papers Weak point: insensitive to word length Weak point: insensitive to word length Example Example –LCSR(walls, allés) = 0.8 –LCSR(sanctuary, sanctuaire) = 0.8 Sometimes a minimal word length imposed Sometimes a minimal word length imposed A principled solution? A principled solution?

15 The random model Assumption: strings are generated randomly from a given distribution of letters. Assumption: strings are generated randomly from a given distribution of letters. Problem: what is the probability of seeing k matches between two strings of length m and n? Problem: what is the probability of seeing k matches between two strings of length m and n?

16 A special case Assumption: k=0 (no matches) Assumption: k=0 (no matches) t – alphabet size t – alphabet size S(n,i) - Stirling number of the second kind S(n,i) - Stirling number of the second kind

17 The problem What is the probability of seeing k matches between two strings of length m and n? What is the probability of seeing k matches between two strings of length m and n? An exact analytical formula is unlikely to exist. An exact analytical formula is unlikely to exist. A very similar problem has been studied in bioinformatics as statistical significance of alignment scores. A very similar problem has been studied in bioinformatics as statistical significance of alignment scores. Approximations developed in bioinformatics are not applicable to words because of length differences. Approximations developed in bioinformatics are not applicable to words because of length differences.

18 Solutions for the general case Sampling Sampling –Not reliable for small probability values –Works well for low k/n ratios (uninteresting) –Depends on a given alphabet size and letter frequencies –No insight Inexact approximation Inexact approximation –Works well for high k/n ratios (interesting) –Easy to use

19 Formula 1 - probability of a match - probability of a match

20 Formula 1 Exact for k=m=n Exact for k=m=n Inexact in general Inexact in general Reason: implicit independence assumption Reason: implicit independence assumption Lower bound for the actual probability Lower bound for the actual probability Good approximation for high k/n ratios. Good approximation for high k/n ratios. Runs into numerical problems for larger n Runs into numerical problems for larger n

21 Formula 2 Expected number of pairs of k-letter substrings. Expected number of pairs of k-letter substrings. Approximates the required probability for high k/n ratios. Approximates the required probability for high k/n ratios.

22 Formula 2 Does not work for low k/n ratios. Does not work for low k/n ratios. Not monotonic. Not monotonic. Simpler than Formula 1. Simpler than Formula 1. More robust against numerical underflow for very long words. More robust against numerical underflow for very long words.

23 Comparison of both formulas Both are exact for k=m=n Both are exact for k=m=n For k close to max(m,n) For k close to max(m,n) –both formulas are good approximations –their values are very close Both can be quickly computed using dynamic programming. Both can be quickly computed using dynamic programming.

24 LCSF A new similarity measure based on Formula 2. A new similarity measure based on Formula 2. LCSR(X,Y) = k/n LCSR(X,Y) = k/n LCSF(X,Y) = LCSF(X,Y) = LCSF is as fast as LCSR because its values that depend only on k and n can be pre-computed and stored LCSF is as fast as LCSR because its values that depend only on k and n can be pre-computed and stored

25 Evaluation - motivation Intrinsic evaluation of orthographic similarity is difficult and subjective. Intrinsic evaluation of orthographic similarity is difficult and subjective. My idea: extrinsic evaluation on cognates and word aligned bitexts. My idea: extrinsic evaluation on cognates and word aligned bitexts. –Most cross-language cognates are orthographically similar and vice-versa. –Cognation is binary and not subjective

26 Cognates vs alignment links Manual identification of cognates is tedious. Manual identification of cognates is tedious. Manually word-aligned bitexts are available, but only some of the links are between cognates. Manually word-aligned bitexts are available, but only some of the links are between cognates. Question #1: can we use manually- constructed word alignment links instead? Question #1: can we use manually- constructed word alignment links instead?

27 Manual vs automatic alignment links Automatically word-aligned bitext are easily obtainable, but a good fraction of the links are wrong. Automatically word-aligned bitext are easily obtainable, but a good fraction of the links are wrong. Question #2: can we use machine- generated word alignment links instead? Question #2: can we use machine- generated word alignment links instead?

28 Evaluation methodology Assumption: a word aligned bitext Assumption: a word aligned bitext Treat aligned sentences as bags of words Treat aligned sentences as bags of words Compute similarity for all word pairs Compute similarity for all word pairs Order word pairs by their similarity value Order word pairs by their similarity value Compute precision against a gold standard Compute precision against a gold standard –either a cognate list or alignment links

29 Test data Blinker bitext (French-English) Blinker bitext (French-English) –250 Bible verse pairs –manual word alignment –all cognates manually identified Hansards (French-English) Hansards (French-English) –500 sentences –manual and automatic word-alignment Romanian-English Romanian-English –248 sentences –manually aligned

30 Blinker results

31 Hansards results

32 Romanian-English results

33 Contributions We showed that word alignment links can be used instead of cognates for evaluating word similarity measures. We showed that word alignment links can be used instead of cognates for evaluating word similarity measures. We proposed a new similarity measure which outperforms LCSR. We proposed a new similarity measure which outperforms LCSR.

34 Future work Extend our approach to length normalization to edit distance and other similarity measures. Extend our approach to length normalization to edit distance and other similarity measures. Incorporate cognate information into statistical MT models as an additional feature function. Incorporate cognate information into statistical MT models as an additional feature function.

35 Thank you

36 Applications Recognition of Cognates Recognition of Cognates –Historical linguistics –Machine translation –Sentence and word alignment Confusable Drug Names Confusable Drug Names Edit Distance Tasks Edit Distance Tasks –Spelling error correction

37 Improved word alignment quality GIZA trained on 50,000 sentences from Hansards. GIZA trained on 50,000 sentences from Hansards. Tested on 500 manually aligned sentences Tested on 500 manually aligned sentences 10% reduction of the error rate when cognates are added. 10% reduction of the error rate when cognates are added.

38 Blinker results

39 Problems with links (1) The lion (1) killed enough for his cubs and strangled the prey for his mate (2) … The lion (1) killed enough for his cubs and strangled the prey for his mate (2) … Le lion (1) d é chirait pour ses petits, etranglait pour ses lionnes (2) … Le lion (1) d é chirait pour ses petits, etranglait pour ses lionnes (2) …

40 Problems with links (2) But let justice (1) roll on like a river, righteousness (2) like a never -failing stream. But let justice (1) roll on like a river, righteousness (2) like a never -failing stream. Mais que la droiture (1) soit comme un courant de eau, et la justice (2) comme un torrent qui jamais ne tarit. Mais que la droiture (1) soit comme un courant de eau, et la justice (2) comme un torrent qui jamais ne tarit.

Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.

Similar presentations

Presentation on theme: "Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.

Similar presentations

Presentation on theme: "Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta."— Presentation transcript:

Similar presentations

About project

Feedback