Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011.

Similar presentations


Presentation on theme: "CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011."— Presentation transcript:

1 CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011

2 Project inspired by the research paper on parallel crawlers Site S1 is crawled by crawler C1 and site S2 is crawled by C2 In Firewall mode, crawlers ignore inter- partition links (C1 ignores g and C2 ignores d). Firewall mode makes no overlapping, quick performance (no communication between crawlers), but some data can be missed due to the elimination of inter-partition links. In Cross-over mode, crawlers also follow inter-partition links, hence download more pages than in Firewall mode, but overlapping is an issue (g and d get downloaded twice). Two parallel crawlers

3 Continued.. In Exchange mode, crawlers periodically and incrementally exchange inter- partition links, hence avoid overlapping and increase coverage. Implementation: Crawling two websites in parallel: USC Viterbi School of Engineering and USC School of Letters, Arts and Sciences. These two sites have their own data, and also share lots of links (generally to each other and to USC website). The data from USC website will get ignored in Firewall mode, overlapping will happen in cross-over mode when the two sites point to each other, and exchange mode will prove to be the best among the three modes. Viterbi LAS Nutch crawler Solr Indexing DBMS

4 Evaluation


Download ppt "CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011."

Similar presentations


Ads by Google