Presentation is loading. Please wait.

Presentation is loading. Please wait.

2006/12/081 Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito.

Similar presentations


Presentation on theme: "2006/12/081 Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito."— Presentation transcript:

1 2006/12/081 Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito

2 2006/12/082 Parallel Texts Parallel texts : Parallel corpus : a set of parallel texts Translated pair of multilingual texts One thing was certain, that the WHITE kitten had had nothing to do with it. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 --it was the black kitten's fault entirely. ―― もうなにもかも、 黒い子ネコのせいだったのです。 English 日本語

3 2006/12/083 Parallel Texts Useful resource for Statistical machine translation Dictionary construction But … existing corpora are small Genre Public Document Software Manual Language English-French Number Not enough Need human resource

4 2006/12/084 Problems - How to detect parallel texts automatically - Calculation cost : Parallel Texts from the Web Crawling parallel texts from the Web Very large number of texts exist Varied languages are used Low human resource

5 2006/12/085 Parallel Texts from the Web Web Maybe parallel ① Not parallel ① Parallel Texts ② Not parallel ② Parallel textsNot parallel

6 2006/12/086 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion

7 2006/12/087 STRAND [Resnik et al. 03] URL Matching 1. Removing language-specific substrings[LSSs] (Japanese : ja, jp, jpn, euc, sjis, … ) 2. Matching LSSs-removed URLs 3. Making a detailed comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja

8 2006/12/088 URL Matching Experiment URL Matching for URLs of crawled pages 90,000,000URLs English ⇔ Japanese Seeing only URL japanese.php english.php japanese english1833 /ja/ /en/73 _ja _en3 ja_ en_488.ja.en1405 ja. en.271 Total4073 pairs 90,000,000 → 4,000 Too strict? Useless pages are included index.html.ja index.html.en

9 2006/12/089 Searching linked pages “ alt ” tag link name HTML → DOM Tree Parallel link: a pair of the same hyperlinks in parallel texts DOM Tree Alignment [Lei et al. 06] link “ English version ” “ In English ” etc …

10 2006/12/0810 Pros and Cons URL Matching High speed and Easy to implement Small number of pages DOM Tree High accuracy and Small storage Execution speed is slow ○ ○ × ×

11 2006/12/0811 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion

12 2006/12/0812 Detecting Parallel Texts [Fukushima 06] Reducing comparison cost without HTML Information word(noun) → semantic ID → comparison

13 2006/12/0813 Semantic ID Conversion Constructing a graph from dictionaries Treating Japanese and English texts on same level # of Semantic ID: about 10,000 Sense 感覚 意味 Movie Film 映画 Hobby Taste 趣味 味 1 2 3

14 2006/12/0814 Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 17049553173 (955, 1704, 3173) sort +position information

15 2006/12/0815 Comparison tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score=012

16 2006/12/0816 tscore threshold Fry Corpus[05 Fry] F-measure tscore threshold 0.102 Speed 250,000 pairs/sec

17 2006/12/0817 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion

18 2006/12/0818 Large Scale Crawling Calculation cost of each comparison Calculation cost of entire crawling Number of comparisons: URL matching is too strict Alt tag or link name are not applied for all parallel pages

19 2006/12/0819 HTML on the Web to Natural Language Guess language English, SJIS, EUC-JP, UTF-8 Convert character code Remove HTML Tag For crawling, or tag are used, tag may be useful

20 2006/12/0820 Calculation Cost Reduction Distance score of vectors Compare only near vectors distance score : tscore Set a label of the nearest sample text for all texts Distance score of two texts is far, then, they are not parallel texts.

21 2006/12/0821 Calculation Cost Reduction Flow 1. Select sample texts (<<n) 2. When crawling, calculate distance score with sample texts 3. Classify top m score 4. Compare only for texts in the same group

22 2006/12/0822 Number of sample Accuracy (risk of miss labeling) Calculation cost Size of the group should be equal Large group are divided into small recursively Sampling

23 2006/12/0823 Crawling link pages Evaluation of same links DOM Tree [Lei et al. 06] Evaluate function Position of tag Pages in same host Diff of URLs –hoge.html.en -> fuga.html.en : hoge - fuga –hoge.html.ja -> fuga.html.ja : hoge – fuga Same links from parallel texts will be parallel texts

24 2006/12/0824 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion

25 2006/12/0825 Evaluation of tscore Fry Corpus [Fry 05] 200(japanese) x 200(english) Flow 1. Convert all texts to vector 2. Calculate distance score for all pairs(40000) 3. Check scores of real parallel texts are high Score of parallel texts should be top

26 2006/12/0826 Evaluation of tscore Other distance score AND (3,1,0,2,0) (3,0,0,1,2) 2 NOT XOR (3,1,0,2,0) (3,0,0,1,2) 3 AND - XOR (3,1,0,2,0) (3,0,0,1,2) 0 EUCLID COS (1,1,1,2,4,4, … ) (3,1,0,2, … ) sparse

27 2006/12/0827 Evaluation of tscore Not TopNot in Top3 AND3211 NOT XOR347188 AND – XOR649 EUCLID274191 COS9332 Number of miss score ([200+200]texts) TSCORE22

28 2006/12/0828 Calculation Time Fry Corpus 200, 400, 800, 1600, 3200 NORMAL tscore(Top3) # of samples : √(# of All) Miss labeling : 11 (in 200 pairs)

29 2006/12/0829 Agenda Introduction Related work Proposal Detecting parallel texts Large scale crawling Experiment Conclusion

30 2006/12/0830 Conclusion and Future work Parallel texts from the Web Detecting parallel texts Large scale crawling Future work Crawling many texts from the Web Crawling with parallel link structure Detecting parallel in real HTML texts Proper sampling

31 2006/12/0831 Thank you for your attention!


Download ppt "2006/12/081 Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito."

Similar presentations


Ads by Google