Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Similar presentations


Presentation on theme: "A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department."— Presentation transcript:

1 A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department University of California - Riverside barai@cs.ucr.edu

2 Overview Pattern matching method for compiling a bilingual lexicon of nouns Information tagging of languages Word frequency and position information for low and high frequency words are represented in two forms for pattern matching Anchor points and noise elimination techniques are introduced Compilation of domain-specific noun phrases

3 Bilingual lexicon compilation without sentence alignment Automatically compiling a bilingual lexicon of nouns and proper nouns can contribute significantly to breaking bottlenecks: –Machine translation –Machine-aided translation Domain-specific terms are hard to translate because they often do not appear in dictionaries

4 Algorithm Abstract 1.Tag the English half of the parallel text 2.Compute the positional difference vector of each word 3.Match pairs of positional difference vectors, giving scores 4.Select a primary lexicon using the scores 5.Find anchor points using the primary lexicon 6.Compute a position binary vector for each word using the anchor points 7.Match binary vectors to yield a secondary lexicon

5 Problems with finding high frequency bilingual word pairs Sentence alignment between languages is not exact Chunks of text may appear in one language but not the other Dynamic Time Warping techniques may be used for pattern recognition but may be to slow.

6 Positional difference signals Sliding window

7 Solution to finding high frequency bilingual word pairs Tagging to identify nouns –Nouns tend to have consistent translations Positional difference vectors –A word and its translated counterparts usually have some correspondence to their frequency and positions but it may not be linear Matching positional difference vectors –DTW was found to be a good way to match word vectors of shifted or warped forms Statistical filters

8 Positional difference vectors Each word is represented as a binary variable –1 = word/phrase match –0 = word/phrase non-match Corpora represented as a bit vector/string –Example: Noun: water Text: “I like to drink water. Water is …” Bit String: “0000110…”

9 Statistical filters Statistical filters. –To improve computation speed use Euclidean distance to measure pairs If distance is higher then a certain threshold then filter pair out Look only at the Euclidean distances of the mean and standard deviations: –Low frequency words are not considered

10 Finding low frequency bilingual word pairs Secondary lexicons need to be computed Find anchor points on the DTW paths which divide the texts into multiple aligned segments Anchor points are more reliable than tracking all of the words in a given text Eliminate noise by keeping highly reliable points and discard the rest

11 Dynamic time warping path The line can be thought of as a text alignment path Its departure from the diagonal illustrates that the texts of this corpus are not identical not linearly aligned

12 DTW path reconstruction & anchor points obtained

13 Unsupervised algorithm The constraints in the below conditions are chosen roughly in proportion to the corpus size so that the filtered picture looks close to clean, diagonal line If chosen then supervised scenario

14 Finding low frequency bilingual word pairs Many nouns and proper nouns were not translated in the previous stages of the algorithm Frequency to low Non-linear segment binary vectors –Represent positional and frequency information of low frequency words by a binary vector for fast matching –Segments are smaller then entire text –Example: The the lexicon for the word “prosperity” Position Vectors: –English: –Chinese: Find segments each occur: –English: i = 20, 27, 41, 47, 193, 321, 360 –Chinese: i = 14, 29, 41, 47, 193, 275, 321, 360

15 Finding low frequency bilingual word pairs Binary vector correlation measure –Confidence measure: –Example: From previous … Equation: m = mutual information score t = confidence measure

16 Methods/Results Evaluation of three human judges. (E1 - E3) –E1 = Cantonese –E2 = Mandarin –E3 = Both Languages –Accuracy: Algorithm: 73.1% Human Judges: 66.0 – 87.5%

17 Results (Cont.) Finding Chinese words Compound noun translations Slang Collocations e.g. houses & housing project Proper names e.g. Benjamin Arai Tagging errors caused translation mistakes Many mistakes due to insufficient data

18 Summary The algorithm bypasses the sentence alignment step to find a bilingual lexicon of nouns and proper nouns Compared to other word alignment algorithms, it does not need a priori information

19 Future work The automated searching of valid lexicon matches has great potential for language translation –Noun and proper noun matching using subsets of bit vectors –Noun and proper noun filtering for translation gaps Automated noun phrase and compound word identification is potential for increasing lexicon matching accuracy Increase total text translations accuracy without human intervention

20 Any questions?


Download ppt "A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department."

Similar presentations


Ads by Google