Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management Fuzzy Translation of Cross-Lingual Spelling Variants SIGIR’03

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Method & Data Findings Discussion & Conclusions

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation The limitation on CLIR performance. Some terms not in translation dictionaries. Fuzzy matching ~ n-gram method.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective Two-step fuzzy translation technique for cross- lingual spelling variants to improve the CLIR performance Transformation rule based translation, TRT. Translate the intermediate forms into a target language using fuzzy matching.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction Technical terms and proper names are important text elements, but not generally found in electronic translation dictionaries utilized by MT and CLIR. Non-identical translatable spelling variant forms, e.g., Chernobyl – Tshernobyl. Similarity measure N-gram Fuzzy matching Transliteration

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Introduction In this paper, the technique transformation rule based translation, TRT Close to transliteration, but no phonetic elements. It’s suitable for cross-lingual spelling variants. Example : Spanish embr i olog ia =>English embr y olog y Problem: How to automatically find this rule? Equivalent term pairs extracted from a translation dictionary and aligned pairwise. Edit distance.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Introduction Two-step fuzzy translation Source words are translated into intermediate forms based on TRT, in order to render a source word more similar to its target equivalent. The intermediate forms are translated into target language equivalents through approximate string matching, i.e. fuzzy matching, n-gram based matching.

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Method & Data - Overview (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … Translation dictionary TRT Intermediate form N-gram Matching High confidence factor, HCF Low confidence factor, LCF Translation Strategies Example: konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) => convection

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Method & Data - TRT (emcriologia, embryology) (emariolagia, embryology) (embrialagia, embryology) … Translation dictionary Edit distance 0, the same character at the same position 1, consonant-consonant, vowel-vowel substitution 1, insertion or deletion of a character 2, consonant-vowel, vowel-consonant substitution Selection of proper terms and error value One transformation was selected which have the smallest sum of error values Rule: on -> o ugh n at middle position threshold (embriologia, embryology) (embriolagia, embryology) (embrialagia, embryology) … minimum ED

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Transformation Rule based Translation Edit Distance Automatic Generation of Rules Extracting similar terms from a dictionary with edit distance threshold. Selection of proper terms with the smallest sum of error values. Generation of transformation rules Context Information, Frequency, and Confidence Factor Sample Rules

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Edit Distance ED(A, B) = min{N sub + N ins + N del } {d[i – 1,j] + 1, d[i,j - 1] + 1, d[i – 1, j - 1] + cost}, where cost = 0, if A[i] = B[i], and cost = 1, if A[i] ≠ B[i].

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 A sample of Spanish-to-English rules

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Translation Resources Multilingual medical dictionary by Andre Fairchild. A Finnish list of medical terms (n=5970) A Swedish list of medical terms (n=657) Language pairs Finnish-English French-English German-English Spanish-English Swedish-English

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Target Word List and Source Words Target word list The index of CLEF’s LA Time collection, which contains 189000 words. Source words First source word list, 217 word tuples 72 training word tuples, 145 test word tuples. Second source word list 126 test word tuples. Experiments dataset 5(language)*(145+126)words =1355 words

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 N-gram Matching Similarity measure between the source and target words w 1 and w 2. where N i refers to the set of n-grams derived from the word w 1 and w 2. Digrams v.s. Trigrams Trigrams performed worse than digrams, but sometimes gave better results than digrams.

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Translation Strategies - High confidence factor (HCF) strategy A relatively high confidence factor threshold, 50%, to minimize the number of incorrect transformations. Reading order The location of the rules in source words: end, beginning, and middle. The source string length: the longest first. Confidence factor: the highest first. Example konvektio => convection o – on (end), ko – co (beginning), ekt – ect (middle) convetcion

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Translation Strategies - Low confidence factor (LCF) strategy A threshold confidence factor of 10% was used to filter out unreliable rules. Even more intermediate forms were obtained, but it may be incorrect transformations. Both in HCF and LCF the rules whose frequency was < 50 were removed.

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Evaluation For each word precision was calculated by considering the position of the correct equivalent (pce) in the ranked result list of n-gram matching More words share the same SIM value Worst position: the last word Average position precision: the middle of the set of the words

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Findings Four test word types Medical, biological, and chemical terms (Bio terms), n=90 Place names, n=55 Economics, n=31 Technology, n=36 Miscellaneous, n=59 Five language pairs Finnish-English French-English German-English Spanish-English Swedish-English

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Findings – 1/3

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Findings – 2/3

22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Findings – 3/3

23 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Discussion & Conclusion Technical terms and proper names are often untranslatable due to the limited coverage of translation dictionaries. In this study, two-step fuzzy translation Automatically generated transformation rules, TRT Fuzzy matching Two translation strategies were tested, HCF & LCF Digram and trigam matching were tesed in combination with TRT

24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Discussion & Conclusion Effectiveness of fuzzy translation depends on The frequency of identical terms shared by a source and a target language. The extent of variation in the spelling variants between a source and a target language. Fuzzy translation is well suited for language pairs with a high percentage of similar but non-identical terms.

25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Personal opinion How did we apply this ideas to our lab.? TRT?


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department."

Similar presentations


Ads by Google