Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Similar presentations


Presentation on theme: "Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer."— Presentation transcript:

1 Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer

2 Information Retrieval Crawling

3 The Web’s Characteristics Size Billions of pages are available 5-40K per page => hundreds of terabytes Size grows every day!! Change 8% new pages, 25% new links change weekly Life time of about 10 days

4 Spidering 24h, 7days “walking” over a Graph, getting data What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 8 * 10 9 nodes E changes (insert, delete) > 10 links per node 10*8*10 9 = 8*10 10 1-entries in adj matrix

5 A Picture of the Web Graph i j Q: sparse or not sparse? 21 millions of pages, 150millions of links

6 A special sorting Stanford Berkeley

7 A Picture of the Web Graph

8 Link Extractor: while( ){ <extract….. } Downloaders: while( ){ <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while( ){ foreach u extracted { if ( (u  “Already Seen Page” ) || ( u  “Already Seen Page” && ) ) { } Crawler “cycle of life” PQ PR AR Crawler Manager Downloaders Link Extractor

9 Crawling Issues How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How to parallelize the process

10 Page selection Given a page P, define how “good” P is. Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

11 BFS “…BFS-order discovers the highest quality pages during the early stages of the crawl” 328 millions of URL in the testbed [Najork 01]

12 This page is a new one ? Check if file has been parsed or downloaded before after 20 mil pages, we have “seen” over 200 million URLs each URL is 50 to 75 bytes on average  Overall we have about 10Gb of URLS Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)

13 Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication  Dynamic assignment  Central coordinator dynamically assigns URLs to crawlers  Links are given to Central coordinator  Static assignment  Web is statically partitioned and assigned to crawlers  Crawler only crawls its part of the web

14 Two problems Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail www.geocities.com/…. www.di.unipi.it/ Dynamic “relocation” schemes may be complicated Managing the fault-tolerance: What about the death of downloaders ? D  D-1, new hash !!! What about new downloaders ? D  D+1, new hash !!! Let D be the number of downloaders. hash(URL) maps an URL to [0,D). Dowloader x fetches the URLs U s.t. hash(U)  [x-1,x)

15 A nice technique: Consistent Hashing A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS Item and servers ← ID(hash of m bits) Server mapped on a unit circle Item k assigned to first server with ID ≥ k What if a downloader goes down? What if a new downloader appears? Theorem. Given S servers and I items, map on the unit circle  (log S) copies of each server and the I items. Then [load] any server gets ≤ (I/S) log S items [spread] any URL is stored in ≤ (log S) servers

16 Examples: Open Source Nutch, also used by Overture http://www.nutch.org Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html


Download ppt "Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer."

Similar presentations


Ads by Google