Presentation is loading. Please wait.

Presentation is loading. Please wait.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

Similar presentations


Presentation on theme: "distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec."— Presentation transcript:

1

2 distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec. 99, during a period of two weeks. The web image projected from this crawl might be biased but it represent the pages a parallel crawler would fetch.

3 distributed web crawlers2 Firewall Mode & Coverage Firewall: –every crawl collects pages only from its predetermined partition, and follows only intra- partition links. –Has a minimal communication overhead, but may have quality and coverage problems.

4 distributed web crawlers3 Firewall Mode & Coverage Considering the 40m pages as the entire web. Using site-hash based partitioning. Each c-proc was given five random sites from its own partitioning (5n for the overall crawler).

5 distributed web crawlers4 Results

6 distributed web crawlers5 Results (2)

7 distributed web crawlers6 Conclusions When a small number of c-proc’s run in parallel this mode provides good coverage, and the crawler may start with relatively small number of seed URLs. This mode is not a good choice when coverage is important, especially when it runs many c-proc’s in parallel.

8 distributed web crawlers7 Example Assuming we want to download 1B pages over one month, with 10 Mbps link to the internet per each c-proc’s machine: –we need to download 10 9 X10 4 bytes. –The download rate is 34 Mbps therefore we need 4 c-proc’s. from fig 4 we conclude that coverage will be about 80%.fig 4 –having a week, we need a download rate of 140 Mbps = 14 c-proc’s, which will cover only 50%.

9 distributed web crawlers8 Cross-over & Overlap This mode may yield improved coverage, since it follows inter-partition links, when a c-proc runs out of links in its own partition. This mode also incurs overlap, because a page can be downloaded by several c-procs. => the crawler increases coverage at the expense of overlap.

10 distributed web crawlers9 Cross-over & Overlap Considering the 40M pages as the entire web. Using site-hash based partitioning. Each c-proc was given five random sites from its own partitioning (5n for the overall crawler). Measuring overlap in various coverage points

11 distributed web crawlers10 Results

12 distributed web crawlers11 Conclusions While this mode is much better than independent crawl, it still incurs quite significant overlap. For example: 4 c-proc’s running will overlap almost 2.5 in order to obtain coverage close to 1. For this reason it is not recommended to use this mode unless coverage is important and no communication between c-proc’s is available.

13 distributed web crawlers12 Exchange Mode & Communication In this section we learn the communication overhead of an exchange mode crawler and how to reduce it by replication. We split the 40M pages into n partitions based on site-hash value, and run n c-proc’s in exchange mode.

14 distributed web crawlers13 Results

15 distributed web crawlers14 Conclusions The site-hash based partitioning scheme significantly reduces communication overhead, compared with the URL-hash based scheme. In average we need to transfer less than 10% of the discovered links (or up to 1 per page)..

16 distributed web crawlers15 Conclusions (2) Network bandwidth used for URL exchange is relatively small. URL’s average length is about 40 bytes, while an average page is about 10kb, so this transfer consumes about 0.4% of total network bandwidth.

17 distributed web crawlers16 Conclusions (3) The overhead of this exchange is quite significant because transmission goes through TCP/IP network stack at both sides, and incurs 2 switches between kernel and user mode.

18 distributed web crawlers17 Reducing Overhead by Replication

19 distributed web crawlers18 Conclusions Based on this result replicating 10-100 thousands in each c-proc will give best results (minimizes communication overhead while maintaining low replication overhead).

20 distributed web crawlers19 Quality & Batch Communication In this section we study the quality issue –as mentioned parallel crawler can be worse than single process crawler if every c-proc decides solely based on personal information.

21 distributed web crawlers20 Quality & Batch Communication (2) throughout this section we’ll regard a page’s importance I(p) as the number of backlinks it has. –The most common metric. –Obviously depends on how often c-proc’s are exchanging information.

22 distributed web crawlers21 Quality at Different Exchange Rates

23 distributed web crawlers22 Conclusions As number of c-proc’s increases, the quality becomes worse, unless they exchange backlink messages often. The quality of a firewall mode is worse than a single process crawler when downloading small fraction of pages. However there is no difference when downloading bigger fractions.

24 distributed web crawlers23 Quality and Communication Overhead

25 distributed web crawlers24 Conclusions Communication overhead doesn’t increase linearly. Large number of URL exchanges is not necessary for achieving high quality, especially when downloading large portion of the web. fig 9.fig 9.

26 distributed web crawlers25 Final Example Say we plan to operate a medium-scale search engine, to obtain 20% of the web (240M pages). We plan to refresh the index once a month, and our machines have 1 Mbps connection to the Internet. –We need about 7.44 Mbps download bandwidth, so we have to run at least 8 c-procs run in parallel.

27 distributed web crawlers26 Related charts

28 distributed web crawlers27 Final Conclusions When a small number of c-procs run in parallel, firewall mode provides good coverage. Given the simplicity of this mode, it is a good option to consider unless: –More than 4 c-procs are required. fig 4.fig 4. –Small subset of the web is required and quality is important. Fig 9.Fig 9.

29 distributed web crawlers28 Final Conclusions (2) Exchange mode based crawler consumes small network bandwidth, and minimize overhead if batch communication is operated. Quality is maximized even if less than 100 URL exchanges occurs. Replication of 10,000-100,000 most popular URLs reduces communication overhead by roughly 40%. Further replication contributes little. Fig 8.Fig 8.

30 distributed web crawlers29 References Junghoo Cho, Hector Garcia-Molina. Parallel crawlers, October 2001. Mike Burner, Crawling Towards Eternity, web techniques magazine, May 1998.


Download ppt "distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec."

Similar presentations


Ads by Google