Presentation is loading. Please wait.

Presentation is loading. Please wait.

2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito.

Similar presentations


Presentation on theme: "2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito."— Presentation transcript:

1 2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito

2 2007/4/202 Purpose Parallel corpus : a set of parallel texts Parallel texts : translated pairs of texts Construct Parallel Corpora from the Web One thing was certain, that the WHITE kitten had had nothing to do with it. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 --it was the black kitten's fault entirely. ―― もうなにもかも、 黒い子ネコのせいだったのです。 English 日本語

3 2007/4/203 Parallel Texts Useful resource for Statistical machine translation Dictionary construction But… existing corpora are not enough Genre Public Documents Software Manuals Language Limited English-French Amount Small Large human resource

4 2007/4/204 Parallel Texts from the Web Extracting Parallel Texts from Massive Web Documents  Very large amount of texts  Varied languages  Small human resource

5 2007/4/205 Problems How to detect parallel texts automatically How to reduce calculation cost Web To construct parallel corpus 1.Extract candidate pairs 2.Judge whether they really are parallel texts

6 2007/4/206 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion

7 2007/4/207 STRAND [Resnik et. al. 03] URL Matching 1.Remove language-specific substrings[LSSs] (Japanese : ja, jp, jpn, euc, sjis,…) 2.Match LSSs-removed URLs 3.Make a detail comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja

8 2007/4/208 DOM Tree Alignment [Lei et. al. 06] HTML→DOM Tree Searching linked pages “alt” tag link name Parallel link: a pair of the same hyperlinks in parallel texts link “ English version ” “ In English ” etc …

9 2007/4/209 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion

10 2007/4/2010 Outline Web Detect parallel texts Extract candidate pairs … … … … Crawler

11 2007/4/2011 Detecting parallel texts Low comparison cost without HTML Information 1.word (noun) 2.semantic ID 3.comparison [Fukushima et.al. 06]

12 2007/4/2012 Semantic ID Conversion Constructing a graph from dictionaries Treating Japanese and English texts in the same level # of Semantic IDs: about 10,000 Sense 感覚 意味 Movie Film 映画 Hobby Taste 趣味 味 1 2 3

13 2007/4/2013 Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 17049553173 (955, 1704, 3173) sort +position information

14 2007/4/2014 Comparison tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score=012 3 tscore = 4/(7+7) 4

15 2007/4/2015 tscore threshold Fry Corpus[05 Fry] 400 pair F-measure Speed 200,000 pairs/sec tscore threshold 0.102

16 2007/4/2016 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion

17 2007/4/2017 Extract candidate pairs Calculation cost of each comparison Calculation cost of extracting parallel texts A number of comparison: n^2 URL matching is too strict Japanese and English 90,000,000URL → 4,000 URL pairs → 1,000 real pairs

18 2007/4/2018 Calculation Cost Reduction →Reducing the number of comparison distance score : tscore Compare only texts close to each other Distance of each parallel texts and a sample text should be equal English 日本語 Sample

19 2007/4/2019 Calculation Cost Reduction Flow 1.Select sample texts (<<n) 2.Calculate distance score with sample texts 3.Classify top m score 4.Compare only for texts in the same group

20 2007/4/2020 Number of sample Calculation cost Accuracy (low risk of miss labeling) Methods to select sample Random k-means Sampling

21 2007/4/2021 k-means 1.Select k samples 2.Classify all texts 3.Calculate centers 4.Re-classify k=2

22 2007/4/2022 Calculation of tscore in k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Text2:(335, 567, 567, 1704, 4014, 5449, 7421) Text1:(106, 335, 455, 567, 1704, 3173, 7421) Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …) tscore = 4/(7+7) tscore = (0.2+0.5) normal k-means

23 2007/4/2023 Converting HTML on the Web Guess language English, SJIS, EUC-JP, UTF-8 Convert character code Remove HTML Tag Morphological Analysis→pickup noun

24 2007/4/2024 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion

25 2007/4/2025 Experiment Calculation Cost Accuracy v.s. Calculation time Clustering k-means

26 2007/4/2026 Environment Dataset : Fry Corpus [Fry 05] Corpus of Japanese-English news pages Convert HTML to Semantic ID in advance Machine CPU : Xeon 2.4GHz Dual Memory : 2GB OS : Linux (Debian)

27 2007/4/2027 Calculation Cost Fry Corpus 200 - 6400 pairs Normal All-to-All Random sampling (Top3) # of texts grows, gap becomes wider Low cost with n^2 samples

28 2007/4/2028 Accuracy v.s. Calculation time Fry Corpus 400 pairs Random sampling # of sample grows, Miss classification ratio → high Execution time → low Trade off with Miss classification ratio and Execution time

29 2007/4/2029 Sample selection with k-means Accuracy and Execution time with k-means Flow  Random sampling number of samples : √n 2.Calculating the center and re-sampling 3.Measuring Miss-classification ratio and Execution time

30 2007/4/2030 Evaluation of k-means Low miss-classification ratio →High biased miss classification calculation time [sec] 200random210.15 k-means40.32 400random510.54 k-means71.18

31 2007/4/2031 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion

32 2007/4/2032 Conclusion and Future work Parallel texts from the Web Detecting parallel texts Extracting candidate pairs Random sampling k-means

33 2007/4/2033 Future work Better clustering methods Hierarchical Dimension reduction About 10,000 dimension is too high Processing real HTML texts from the Web

34 2007/4/2034 Thank you for your attention!


Download ppt "2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito."

Similar presentations


Ads by Google