Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.

Slides:

Advertisements

Similar presentations

Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)

Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Order Statistics Sorted

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.

Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.

Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003.

Cox Model With Intermitten and Error-Prone Covariate Observation Yury Gubman PhD thesis in Statistics Supervisors: Prof. David Zucker, Prof. Orly Manor.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Searching Sequence Databases

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.

Class 3: Estimating Scoring Rules for Sequence Alignment.

Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.

1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

1 CSA4050: Advanced Topics in NLP Spelling Models.

Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Hypotheses tests for means

Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Machine Learning 5. Parametric Methods.

Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

1 LING 696B: Final thoughts on nonparametric methods, Overview of speech processing.

January 2012Spelling Models1 Human Language Technology Spelling Models.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Discriminative Word Alignment with Conditional Random Fields Phil Blunsom & Trevor Cohn [ACL2006] Eiji ARAMAKI.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”

Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Statistical NLP: Lecture 7

CRF &SVM in Medication Extraction

Supervised Time Series Pattern Discovery through Local Importance

CHAPTER 4 Designing Studies

Erasmus University Rotterdam

Neural Machine Translation By Learning to Jointly Align and Translate

Discrete Event Simulation - 4

Pairwise Sequence Alignment (cont.)

Improved Word Alignments Using the Web as a Corpus

Statistical Machine Translation Papers from COLING 2004

Statistical Thinking and Applications

Fourier Transform of Boundaries

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Presentation transcript:

Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta

2 Outline Background Background Improving LCSR Improving LCSR Cognates vs. word alignment links Cognates vs. word alignment links Experiments & results Experiments & results

3 Motivation Claim: words that are orthographically similar are more likely to be mutual translations than words that are not similar. Claim: words that are orthographically similar are more likely to be mutual translations than words that are not similar. Reason: existence of cognates, which are usually orthographically and semantically similar. Reason: existence of cognates, which are usually orthographically and semantically similar. Use: Considering cognates can improve word alignment and translation models. Use: Considering cognates can improve word alignment and translation models.

4 Objective Evaluation of orthographic similarity measures in the context of word alignment in bitexts. Evaluation of orthographic similarity measures in the context of word alignment in bitexts.

5 MT applications sentence alignment sentence alignment word alignment word alignment improving translation models improving translation models inducing translation lexicons inducing translation lexicons aid in manual alignment aid in manual alignment

6 Cognates Similar in orthography or pronunciation. Similar in orthography or pronunciation. Often mutual translations. Often mutual translations. May include: May include: –genetic cognates –lexical loans –names –numbers –punctuation

7 The task of cognate identification Input: two words Input: two words Output: the likelihood that they are cognate Output: the likelihood that they are cognate One method: compute their orthographic/phonetic/semantic similarity One method: compute their orthographic/phonetic/semantic similarity

8 Scope The measures that we consider are language-independent language-independent orthography-based orthography-based operate on the level of individual letters operate on the level of individual letters binary identity function binary identity function

9 Similarity measures Prefix method Prefix method Dice coefficient Dice coefficient Longest Common Subsequence Ratio (LCSR) Longest Common Subsequence Ratio (LCSR) Edit distance Edit distance Phonetic alignment Phonetic alignment Many other methods Many other methods

10 IDENT 1 if two words are identical, 0 otherwise 1 if two words are identical, 0 otherwise The simplest similarity measure The simplest similarity measure e.g. IDENT(colour, couleur) = 0 e.g. IDENT(colour, couleur) = 0

11 PREFIX The ratio of the longest common prefix of two words to the length of the longer word The ratio of the longest common prefix of two words to the length of the longer word e.g. PREFIX(colour, couleur) = 2/7 = 0.28 e.g. PREFIX(colour, couleur) = 2/7 = 0.28

12 DICE coefficient The ratio of the number of common letter bigrams to the total number of letter bigrams The ratio of the number of common letter bigrams to the total number of letter bigrams e.g. DICE(colour, couleur) = 6/11 = 0.55 e.g. DICE(colour, couleur) = 6/11 = 0.55 co ol lo ou ur co ou ul le eu ur

13 Longest Common Sub- sequence Ratio (LCSR) The ratio of the longest common subsequence of two words to the length of the longer word. The ratio of the longest common subsequence of two words to the length of the longer word. e.g. LCSR(colour, couleur) = 5/7 = 0.71 e.g. LCSR(colour, couleur) = 5/7 = 0.71 co-lo-ur coul-eur

14 LCSR Method of choice in several papers Method of choice in several papers Weak point: insensitive to word length Weak point: insensitive to word length Example Example –LCSR(walls, allés) = 0.8 –LCSR(sanctuary, sanctuaire) = 0.8 Sometimes a minimal word length imposed Sometimes a minimal word length imposed A principled solution? A principled solution?

15 The random model Assumption: strings are generated randomly from a given distribution of letters. Assumption: strings are generated randomly from a given distribution of letters. Problem: what is the probability of seeing k matches between two strings of length m and n? Problem: what is the probability of seeing k matches between two strings of length m and n?

16 A special case Assumption: k=0 (no matches) Assumption: k=0 (no matches) t – alphabet size t – alphabet size S(n,i) - Stirling number of the second kind S(n,i) - Stirling number of the second kind

17 The problem What is the probability of seeing k matches between two strings of length m and n? What is the probability of seeing k matches between two strings of length m and n? An exact analytical formula is unlikely to exist. An exact analytical formula is unlikely to exist. A very similar problem has been studied in bioinformatics as statistical significance of alignment scores. A very similar problem has been studied in bioinformatics as statistical significance of alignment scores. Approximations developed in bioinformatics are not applicable to words because of length differences. Approximations developed in bioinformatics are not applicable to words because of length differences.

18 Solutions for the general case Sampling Sampling –Not reliable for small probability values –Works well for low k/n ratios (uninteresting) –Depends on a given alphabet size and letter frequencies –No insight Inexact approximation Inexact approximation –Works well for high k/n ratios (interesting) –Easy to use

19 Formula 1 - probability of a match - probability of a match

20 Formula 1 Exact for k=m=n Exact for k=m=n Inexact in general Inexact in general Reason: implicit independence assumption Reason: implicit independence assumption Lower bound for the actual probability Lower bound for the actual probability Good approximation for high k/n ratios. Good approximation for high k/n ratios. Runs into numerical problems for larger n Runs into numerical problems for larger n

21 Formula 2 Expected number of pairs of k-letter substrings. Expected number of pairs of k-letter substrings. Approximates the required probability for high k/n ratios. Approximates the required probability for high k/n ratios.

22 Formula 2 Does not work for low k/n ratios. Does not work for low k/n ratios. Not monotonic. Not monotonic. Simpler than Formula 1. Simpler than Formula 1. More robust against numerical underflow for very long words. More robust against numerical underflow for very long words.

23 Comparison of both formulas Both are exact for k=m=n Both are exact for k=m=n For k close to max(m,n) For k close to max(m,n) –both formulas are good approximations –their values are very close Both can be quickly computed using dynamic programming. Both can be quickly computed using dynamic programming.

24 LCSF A new similarity measure based on Formula 2. A new similarity measure based on Formula 2. LCSR(X,Y) = k/n LCSR(X,Y) = k/n LCSF(X,Y) = LCSF(X,Y) = LCSF is as fast as LCSR because its values that depend only on k and n can be pre-computed and stored LCSF is as fast as LCSR because its values that depend only on k and n can be pre-computed and stored

25 Evaluation - motivation Intrinsic evaluation of orthographic similarity is difficult and subjective. Intrinsic evaluation of orthographic similarity is difficult and subjective. My idea: extrinsic evaluation on cognates and word aligned bitexts. My idea: extrinsic evaluation on cognates and word aligned bitexts. –Most cross-language cognates are orthographically similar and vice-versa. –Cognation is binary and not subjective

26 Cognates vs alignment links Manual identification of cognates is tedious. Manual identification of cognates is tedious. Manually word-aligned bitexts are available, but only some of the links are between cognates. Manually word-aligned bitexts are available, but only some of the links are between cognates. Question #1: can we use manually- constructed word alignment links instead? Question #1: can we use manually- constructed word alignment links instead?

27 Manual vs automatic alignment links Automatically word-aligned bitext are easily obtainable, but a good fraction of the links are wrong. Automatically word-aligned bitext are easily obtainable, but a good fraction of the links are wrong. Question #2: can we use machine- generated word alignment links instead? Question #2: can we use machine- generated word alignment links instead?

28 Evaluation methodology Assumption: a word aligned bitext Assumption: a word aligned bitext Treat aligned sentences as bags of words Treat aligned sentences as bags of words Compute similarity for all word pairs Compute similarity for all word pairs Order word pairs by their similarity value Order word pairs by their similarity value Compute precision against a gold standard Compute precision against a gold standard –either a cognate list or alignment links

29 Test data Blinker bitext (French-English) Blinker bitext (French-English) –250 Bible verse pairs –manual word alignment –all cognates manually identified Hansards (French-English) Hansards (French-English) –500 sentences –manual and automatic word-alignment Romanian-English Romanian-English –248 sentences –manually aligned

30 Blinker results

31 Hansards results

32 Romanian-English results

33 Contributions We showed that word alignment links can be used instead of cognates for evaluating word similarity measures. We showed that word alignment links can be used instead of cognates for evaluating word similarity measures. We proposed a new similarity measure which outperforms LCSR. We proposed a new similarity measure which outperforms LCSR.

34 Future work Extend our approach to length normalization to edit distance and other similarity measures. Extend our approach to length normalization to edit distance and other similarity measures. Incorporate cognate information into statistical MT models as an additional feature function. Incorporate cognate information into statistical MT models as an additional feature function.

35 Thank you

36 Applications Recognition of Cognates Recognition of Cognates –Historical linguistics –Machine translation –Sentence and word alignment Confusable Drug Names Confusable Drug Names Edit Distance Tasks Edit Distance Tasks –Spelling error correction

37 Improved word alignment quality GIZA trained on 50,000 sentences from Hansards. GIZA trained on 50,000 sentences from Hansards. Tested on 500 manually aligned sentences Tested on 500 manually aligned sentences 10% reduction of the error rate when cognates are added. 10% reduction of the error rate when cognates are added.

38 Blinker results

39 Problems with links (1) The lion (1) killed enough for his cubs and strangled the prey for his mate (2) … The lion (1) killed enough for his cubs and strangled the prey for his mate (2) … Le lion (1) d é chirait pour ses petits, etranglait pour ses lionnes (2) … Le lion (1) d é chirait pour ses petits, etranglait pour ses lionnes (2) …

40 Problems with links (2) But let justice (1) roll on like a river, righteousness (2) like a never -failing stream. But let justice (1) roll on like a river, righteousness (2) like a never -failing stream. Mais que la droiture (1) soit comme un courant de eau, et la justice (2) comme un torrent qui jamais ne tarit. Mais que la droiture (1) soit comme un courant de eau, et la justice (2) comme un torrent qui jamais ne tarit.