A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department University of California - Riverside barai@cs.ucr.edu

Overview Pattern matching method for compiling a bilingual lexicon of nouns Information tagging of languages Word frequency and position information for low and high frequency words are represented in two forms for pattern matching Anchor points and noise elimination techniques are introduced Compilation of domain-specific noun phrases

Bilingual lexicon compilation without sentence alignment Automatically compiling a bilingual lexicon of nouns and proper nouns can contribute significantly to breaking bottlenecks: –Machine translation –Machine-aided translation Domain-specific terms are hard to translate because they often do not appear in dictionaries

Algorithm Abstract 1.Tag the English half of the parallel text 2.Compute the positional difference vector of each word 3.Match pairs of positional difference vectors, giving scores 4.Select a primary lexicon using the scores 5.Find anchor points using the primary lexicon 6.Compute a position binary vector for each word using the anchor points 7.Match binary vectors to yield a secondary lexicon

Problems with finding high frequency bilingual word pairs Sentence alignment between languages is not exact Chunks of text may appear in one language but not the other Dynamic Time Warping techniques may be used for pattern recognition but may be to slow.

Positional difference signals Sliding window

Solution to finding high frequency bilingual word pairs Tagging to identify nouns –Nouns tend to have consistent translations Positional difference vectors –A word and its translated counterparts usually have some correspondence to their frequency and positions but it may not be linear Matching positional difference vectors –DTW was found to be a good way to match word vectors of shifted or warped forms Statistical filters

Positional difference vectors Each word is represented as a binary variable –1 = word/phrase match –0 = word/phrase non-match Corpora represented as a bit vector/string –Example: Noun: water Text: “I like to drink water. Water is …” Bit String: “0000110…”

Statistical filters Statistical filters. –To improve computation speed use Euclidean distance to measure pairs If distance is higher then a certain threshold then filter pair out Look only at the Euclidean distances of the mean and standard deviations: –Low frequency words are not considered

Finding low frequency bilingual word pairs Secondary lexicons need to be computed Find anchor points on the DTW paths which divide the texts into multiple aligned segments Anchor points are more reliable than tracking all of the words in a given text Eliminate noise by keeping highly reliable points and discard the rest

Dynamic time warping path The line can be thought of as a text alignment path Its departure from the diagonal illustrates that the texts of this corpus are not identical not linearly aligned

DTW path reconstruction & anchor points obtained

Unsupervised algorithm The constraints in the below conditions are chosen roughly in proportion to the corpus size so that the filtered picture looks close to clean, diagonal line If chosen then supervised scenario

Finding low frequency bilingual word pairs Many nouns and proper nouns were not translated in the previous stages of the algorithm Frequency to low Non-linear segment binary vectors –Represent positional and frequency information of low frequency words by a binary vector for fast matching –Segments are smaller then entire text –Example: The the lexicon for the word “prosperity” Position Vectors: –English: –Chinese: Find segments each occur: –English: i = 20, 27, 41, 47, 193, 321, 360 –Chinese: i = 14, 29, 41, 47, 193, 275, 321, 360

Finding low frequency bilingual word pairs Binary vector correlation measure –Confidence measure: –Example: From previous … Equation: m = mutual information score t = confidence measure

Methods/Results Evaluation of three human judges. (E1 - E3) –E1 = Cantonese –E2 = Mandarin –E3 = Both Languages –Accuracy: Algorithm: 73.1% Human Judges: 66.0 – 87.5%

Results (Cont.) Finding Chinese words Compound noun translations Slang Collocations e.g. houses & housing project Proper names e.g. Benjamin Arai Tagging errors caused translation mistakes Many mistakes due to insufficient data

Summary The algorithm bypasses the sentence alignment step to find a bilingual lexicon of nouns and proper nouns Compared to other word alignment algorithms, it does not need a priori information

Future work The automated searching of valid lexicon matches has great potential for language translation –Noun and proper noun matching using subsets of bit vectors –Noun and proper noun filtering for translation gaps Automated noun phrase and compound word identification is potential for increasing lexicon matching accuracy Increase total text translations accuracy without human intervention

Any questions?

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Similar presentations

Presentation on theme: "A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.

Similar presentations

Presentation on theme: "A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback