Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management Acquisition of English-Japanese proper nouns from noisy-parallel newswire articles using KATAKANA matching Toshiba Corp. R&D Center

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Background Method Simulations Discussion Conclusion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Limitation of statistical approaches

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Superiority of linguistic approaches

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction A tool for extracting bilingual knowledge from noisy-parallel English-Japanese text Dynamic programming Phonetic similarities Partial matching of English-Japanese Extract a small reliable bilingual lexicon of anchor points Establish further bilingual correspondences

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Type of bilingual knowledge acquisition from parallel corpora Statistical Internal distributional evidence of bilingual word pairs Linguistic External evidence provided by bilingual lexicons to establish anchor points between pairs of bilingual phrases

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Background The challenge for establishing a bilingual correspondance between English-Katakana Lose information when English-Katakana `r' and `l' or `b' and `v' Redundant vowel sounds when Katakana-English `fra' in “Frankfurt” ` フラ ‘ translate into ‘fura’

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Background Deal with these problems in previous researches Transcribe into intermediate representations and match these. The matching knowledge may be biased towards English pronunciation. “Chirac” => “ シラク ” ` シ ' is pronounced as shi.

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Background A neutral intermediate representation allows for partial matching When intermediate representation match above a certain threshold then they are in a translation relation. “ パレスチナ ” “Palestine”“Palestinian” “Palestinians”

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method NPT (Nearest Phonetic Transliteration) Takes each Katakana word and converts it to a phonetic string representing all English spelling combinations of the word. “ ブルンジ ” which is “Burundi” in English ‘ ル ー > rloue’ “buorlouenmgesdjgiou”

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M.

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method – NPT_score “Burundi” “buorlouenmgesdjgiou” npt: NPT string e: English string md: maximum depth d: depth count s: score

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Method Save search time and detect substrings Several heuristics First letter is in upper case for obtaining candidate proper nouns in the English text. Limit the minimum length of Katakana words available for matching. “ クリスマス ” (=“Christmas”) and “Mass”

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Simulations Two corpora of English and Japanese headline newswire articles. The test corpus had 150 aligned articles 1730 English paragraphs and 771 Japanese paragraphs 871 Katakana words 9742 potential English proper nouns 65 comparisons for each Katakana word in each article.

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Simulations Baseline Soundex algorithm K&H Convert the Katakana and the English word to a simplified disjunctive phonetic form. Does not allow either partial matches or matching of substrings.

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Results F-measure 81% 58% 39%

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discussion NPT yielded the best result overall. Higher threshold and higher precision. K&H can’t handle partial match and intermediate form may lose information. Partial matching Finding substrings Identify cognatively connectd translation pairs “ インドネシア ” => “Indonesia” “Indonesian”, “Indonesians”, “Indonesias"

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion Back-transliterating from Katakana to English is unexpectedly difficult. The set of matching rules is quite small, it could be improved. Future research Induce the rules automatically from a corpus of examples.


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department."

Similar presentations


Ads by Google