Presentation is loading. Please wait.

Presentation is loading. Please wait.

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

Similar presentations


Presentation on theme: "Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3."— Presentation transcript:

1 Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

2 Spidering 24h, 7days “walking” over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*50*10 9 = 500*10 9 1-entries in adj matrix

3 Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process

4 Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

5 This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is at least 100 bytes on average  Overall we have about 20Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista) Also, two-level indexing with Front-coding compression

6 Link Extractor: while( ){ <extract….. } Downloaders: while( ){ <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while( ){ foreach u extracted { if ( (u  “Already Seen Page” ) || ( u  “Already Seen Page” && ) ) { } Crawler “cycle of life” PQ PR AR Crawler Manager Downloaders Link Extractor

7 Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication  Dynamic assignment  Central coordinator dynamically assigns URLs to crawlers  Links are given to Central coordinator (?bottleneck?)  Static assignment  Web is statically partitioned and assigned to crawlers  Crawler only crawls its part of the web

8 Two problems with static assignment Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail www.geocities.com/…. www.di.unipi.it/ Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D  D-1, new hash !!! What about new downloaders ? D  D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x Which hash would you use?

9 A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers mapped to unit circle Item K assigned to first server N such that ID(N) ≥ ID(K) What if a downloader goes down? What if a new downloader appears? Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

10 Examples: Open Source Nutch, also used by WikiSearch http://nutch.apache.org/


Download ppt "Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3."

Similar presentations


Ads by Google