2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito.

2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito

2007/4/202 Purpose Parallel corpus : a set of parallel texts Parallel texts : translated pairs of texts Construct Parallel Corpora from the Web One thing was certain, that the WHITE kitten had had nothing to do with it. 一つ確実なのは、白い子ネコはなんの関係もなかったということ。 --it was the black kitten's fault entirely. ―― もうなにもかも、黒い子ネコのせいだったのです。 English 日本語

2007/4/203 Parallel Texts Useful resource for Statistical machine translation Dictionary construction But… existing corpora are not enough Genre Public Documents Software Manuals Language Limited English-French Amount Small Large human resource

2007/4/204 Parallel Texts from the Web Extracting Parallel Texts from Massive Web Documents  Very large amount of texts  Varied languages  Small human resource

2007/4/205 Problems How to detect parallel texts automatically How to reduce calculation cost Web To construct parallel corpus 1.Extract candidate pairs 2.Judge whether they really are parallel texts

2007/4/206 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion

2007/4/207 STRAND [Resnik et. al. 03] URL Matching 1.Remove language-specific substrings[LSSs] (Japanese : ja, jp, jpn, euc, sjis,…) 2.Match LSSs-removed URLs 3.Make a detail comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja

2007/4/208 DOM Tree Alignment [Lei et. al. 06] HTML→DOM Tree Searching linked pages “alt” tag link name Parallel link: a pair of the same hyperlinks in parallel texts link “ English version ” “ In English ” etc …

2007/4/2010 Outline Web Detect parallel texts Extract candidate pairs … … … … Crawler

2007/4/2011 Detecting parallel texts Low comparison cost without HTML Information 1.word (noun) 2.semantic ID 3.comparison [Fukushima et.al. 06]

2007/4/2012 Semantic ID Conversion Constructing a graph from dictionaries Treating Japanese and English texts in the same level # of Semantic IDs: about 10,000 Sense 感覚意味 Movie Film 映画 Hobby Taste 趣味味１２３

2007/4/2013 Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 17049553173 (955, 1704, 3173) sort +position information

2007/4/2014 Comparison tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score=012 3 tscore = 4/(7+7) 4

2007/4/2015 tscore threshold Fry Corpus[05 Fry] 400 pair F-measure Speed 200,000 pairs/sec tscore threshold 0.102

2007/4/2017 Extract candidate pairs Calculation cost of each comparison Calculation cost of extracting parallel texts A number of comparison: n^2 URL matching is too strict Japanese and English 90,000,000URL → 4,000 URL pairs → 1,000 real pairs

2007/4/2018 Calculation Cost Reduction →Reducing the number of comparison distance score : tscore Compare only texts close to each other Distance of each parallel texts and a sample text should be equal English 日本語 Sample

2007/4/2019 Calculation Cost Reduction Flow 1.Select sample texts (<<n) 2.Calculate distance score with sample texts 3.Classify top m score 4.Compare only for texts in the same group

2007/4/2020 Number of sample Calculation cost Accuracy (low risk of miss labeling) Methods to select sample Random k-means Sampling

2007/4/2021 k-means 1.Select k samples 2.Classify all texts 3.Calculate centers 4.Re-classify k=2

2007/4/2022 Calculation of tscore in k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Text2:(335, 567, 567, 1704, 4014, 5449, 7421) Text1:(106, 335, 455, 567, 1704, 3173, 7421) Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …) tscore = 4/(7+7) tscore = (0.2+0.5) normal k-means

2007/4/2023 Converting HTML on the Web Guess language English, SJIS, EUC-JP, UTF-8 Convert character code Remove HTML Tag Morphological Analysis→pickup noun

2007/4/2025 Experiment Calculation Cost Accuracy v.s. Calculation time Clustering k-means

2007/4/2026 Environment Dataset ： Fry Corpus [Fry 05] Corpus of Japanese-English news pages Convert HTML to Semantic ID in advance Machine CPU : Xeon 2.4GHz Dual Memory : 2GB OS : Linux (Debian)

2007/4/2027 Calculation Cost Fry Corpus 200 - 6400 pairs Normal All-to-All Random sampling (Top3) # of texts grows, gap becomes wider Low cost with n^2 samples

2007/4/2028 Accuracy v.s. Calculation time Fry Corpus 400 pairs Random sampling # of sample grows, Miss classification ratio → high Execution time → low Trade off with Miss classification ratio and Execution time

2007/4/2029 Sample selection with k-means Accuracy and Execution time with k-means Flow  Random sampling number of samples : √n 2.Calculating the center and re-sampling 3.Measuring Miss-classification ratio and Execution time

2007/4/2030 Evaluation of k-means Low miss-classification ratio →High biased miss classification calculation time [sec] 200random210.15 k-means40.32 400random510.54 k-means71.18

2007/4/2032 Conclusion and Future work Parallel texts from the Web Detecting parallel texts Extracting candidate pairs Random sampling k-means

2007/4/2033 Future work Better clustering methods Hierarchical Dimension reduction About 10,000 dimension is too high Processing real HTML texts from the Web

2007/4/2034 Thank you for your attention!

2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito.

Similar presentations

Presentation on theme: "2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito.

Similar presentations

Presentation on theme: "2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito."— Presentation transcript:

Similar presentations

About project

Feedback