Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Similar presentations


Presentation on theme: "Prof. Paolo Ferragina, Algoritmi per "Information Retrieval""— Presentation transcript:

1 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Crawling

2 A small example Paolo Ferragina, Web Algorithmics

3 This page is a new one ? Check if the page has been parsed/downloaded before Duplicate match Near-duplicate match Options: Hashing on URLs after 50 bln pages, we have “seen” over 500 bln URLs each URL is at least 100 bytes on average Overall we have about Tb (=50Pb) for just the URLS Disk access with caching (e.g. Altavista) > 5 ms per URL check > 5 ms * 5 * 1013 URL-checks => 8000 years/1PC => 30gg/105 PCs Bloom Filter (Archive) For 50 bln URLs  about 50 Tb [cfr. 1/1000 hashing]

4 Is the page new? [Bloom Filter, 1970]
Create a binary array B[1,m] Consider a family of k hash functions that map a key (URL) to a position (integer) in B B 1 k=3 Pro: No need to store keys, less complex ≈ 50*X bln bits in array versus bln chars Cons: false positives

5 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
TTT 2

6 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
m/n = 30 bits e = 4 * 10-11 = 0.62m/n Minimize prob. error for k = (m/n) ln 2 Advantageous when (m/n) « (key-length in bits + log n)

7 Parallel Crawlers Dynamic assignment Static assignment
Web is too big to be crawled by a single crawler, work should be divided avoiding duplication Dynamic assignment Central coordinator dynamically assigns URLs to crawlers It needs communication bwt coordinator/crawl threads Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web, no need of coordinator and thus communication

8 Two problems with static assignment
Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x Which hash would you use? Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? DD-1, new hash !!! What about new downloaders ? DD+1, new hash !!!

9 A nice technique: Consistent Hashing
Item and servers mapped to unit circle via hash function ID() Item K assigned to first server N such that ID(N) ≥ ID(K) What if a downloader goes down? What if a new downloader appears? A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Each server gets replicated log S times [monotone] adding a new server moves points between an old server to the new one, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times...

10 Open Source Nutch (+ hadoop), also used by WikiSearch


Download ppt "Prof. Paolo Ferragina, Algoritmi per "Information Retrieval""

Similar presentations


Ads by Google