Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003.

Similar presentations


Presentation on theme: "1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003."— Presentation transcript:

1 1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003

2 2 Duluth Word Alignment System Perl implementation of IBM Model 2 Learns a probabilistic model from sentence aligned parallel corpora –The parallel text consists of a source language text and its translation into some target language –Determines the word alignments of the sentence pairs Missing data problem –No examples of word alignments in the training data –Use the Expectation Maximization (EM) Algorithm

3 3 IBM Model 2 Takes into account The probability of the two words being translations of each other how likely it is for words at particular positions in a sentence pair to be alignments of each other Example Lagrandemaison Thebighouse 1 2 3 123

4 4 Distortion Factor How far away from the original (source) position can the word move Example: 1234567 1234567 Source sentence : Target sentence :

5 5 Types of Alignments Sure and Probable alignments –Sure : Alignment judged to be very likely –Probable : Alignment judged to be less certain –Our system does not make this distinction, we take the highest alignment regardless of the value No-null and Null alignments –Our system does not include null alignments –Null alignments : source words that do not align to any word in the target sentence One-to-One and One-to-Many alignments –Our system includes one-to-many as well as one to one alignments

6 6 Alignments S1S2S3S4S5 T1T2T3T4T5 One to OneOne to ManyMany to One

7 7 Data English – French –Trained 5% subset of the Aligned Hansards of the 36 th Parliament of Canada Approximately 50,000 out of the 1,200,000 given sentence pairs Mixture of the House and Senate debates We wanted to train the model on comparable size data sets –Tested 447 manually word aligned sentence pairs Romanian – English –Trained on all available training data (49,284 sentence pairs) –Tested 248 manually word aligned sentence pairs

8 8 Results modelPrecisionRecallF-measure UMD-RE-0.41.36.39 UMD-RE-2.53.47.50 UMD-RE-4.55.49.51 UMD-RE-6.54.47.50 UMD-EF-0.43.17.24 UMD-EF-2.53.21.30 UMD-EF-4.54.22.31 UMD-EF-6.55.22.31

9 9 Precision and Recall Results Precision of the two language pairs were similar –This may reflect the fact that we used approximately the same amount of training data for each of the models The recall for the English-French data was low –This system does not find alignments in which many English words align to one French word. –This reduced the number of alignment made by the system in comparison to the number of alignments in the gold standard

10 10 Distortion Results The precision and recall were not significantly affected by the distortion factor –Distortion factor of 0 resulted in lower precision and recall than a distortion factor of 2, 4 or 6 –Distortion factor of 2, 4, 6 resulted in approximately the same precision and recall values for each of the different language sets –The distortion factor of 4 and 6 do not contain any more information than a distortion factor of 2 suggests that word movement is limited

11 11 Conclusions of Training Data Small amount of training data –wanted to compare the Romanian English and the English French results –Although the data for Romanian English was different than the data for English French the results were comparable –would like to increase the training data to determine the how much of an improvement of the results could be obtained

12 12 Conclusions Considering modifying the existing Perl implementation to allow for this Database approach –Berkeley DB –NDBM re-implementing the algorithm in Perl Data Language –Perl module that is optimized for matrix and scientific computing


Download ppt "1 Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003."

Similar presentations


Ads by Google