Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, 1999 2006. 5. 23 Young Geun Han.

Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, 1999 2006. 5. 23 Young Geun Han

2 Contents Introduction Related Work Architecture of a scalable Web crawler Extensibility Crawler traps and other hazards Results of an extended crawl Conclusions

3 The motivations of this work Due to the competitive nature of the search engine business, Web crawler design is not well-documented in the literature To collect statistics about the Web Mercator, a scalable, extensible Web crawler By scalable Mercator is designed to scale up to the entire Web They archive scalability by implementing their data structures so that use a bounded amount of memory, regardless of the size of the crawl The vast majority of their data structures are stored on disk, and small parts of them are stored in memory for efficiency By extensible Mercator is designed in a modular way, with the expectation that new functionality will be added by third parties 1. Introduction

4 2. Related work (1) Web crawlers are almost as old as the Web itself The first crawler, Matthew Gray’s Wanderer, 1993 (roughly coincided with the first release of NCSA Mosaic) Google search engine [Brin and Page 1998;Google] A distributed system that uses multiple machines for crawling The crawler consists of five functional components - Read URLs - Forward them to multiple crawler processes - Run on a different machine - Use asynchronous I/O to fetch data from up to 300 Web servers - Transmit pages to a Store Server - Compress the pages - Store the pages to disk - Read the pages from disk - Extract links from HTML pages - Save links to a different disk file URL ServerStore Server Crawler Repository Indexer Barrels Sorter URL Resolver Doc index Anchors - Read the link file - Resolve the URLs - Save the abolute URLs

5 2. Related work (2) Internet Archive [Burner 1997;InternetArchive] The internet Archive also uses multiple machines to crawl the Web Each crawler process is assigned up to 64 sites to crawl Each crawler reads a list of seed URLs and uses asynchronous I/O to fetch pages from per-site queues in parallel When a page is downloaded, the crawler extracts the links and adds to the appropriate site queue Using a batch process, it merges “cross-site” URLs into the site-specific seed sets, filtering out duplicates in the process SPHINK [Miller and Bharat 1998] SPHINK system provides some of the customizability features (a mechanism for limiting which pages are crawled, document processing code) SPHINK is targeted towards site-specific crawling, and therefore is not designed to be scalable

6 3. Architecture of a scalable Web crawler The basic algorithm of any Web crawler takes a list of seed URLs Remove a URL from the URL list Determine the IP address of its host name Download the corresponding document Extract any links contained in document For each of the extracted links, ensure that it is an absolute URL Add a URL to the list of URLs to download, provided it has not been encountered before Functional components a component (URL frontier) for storing the list of URLs to download a component for resolving host names into IP addresses a component for downloading documents using the HTTP protocol a component for extracting links from HTML documents a component for determining whether a URL has been encountered before

7 3.1 Mercator’s components (1) INTERNETINTERNET DNS Resolver HTTPFTPGopher RIS Content Seen? Link Extractor Tag Counter GIF Stats URL Filter URL Seen? URL Frontier Doc FPs Log URL Set Queue Files Protocol Modules Processing Modules 1 23 4 5678 Figure 1. Mercator’s main components Mercator

8 3.1 Mercator’s components (2) The first step of this loop is to remove an absolute URL from the shared URL frontier for downloading The protocol module's fetch method downloads the document from internet into a per-thread RewindInputStream The worker thread invokes the content-seen test to determine whether this document has been seen before Based on the downloaded document's MIME type, the worker invokes the process method of each processing module associated with that MIME type Each extracted link is converted into an absolute URL, and tested against a user-supplied URL filter to determine if it should be download The worker performs the URL-seen test, which checks if the URL has been seen before If the URL is new, it is added to the frontier 1 2 3 4 5 6 7 8

9 3.2 The URL frontier The URL frontier is the data structure that contains all the URLs that remain to be downloaded a FIFO queuea collection of distinct FIFO subqueues INTERNETINTERNET HTTP http://daum.net/B.html http://naver.com/c.html http://www.ssu.ac.kr http://daum.net/A.html http://naver.com/b.html http://naver.com/a.html Naver Daum SSU Protocol ModuleURL frontier Web Server Head Web Server INTERNETINTERNET HTTP http://daum.net/B.html http://naver.com/c.html http://www.ssu.ac.kr http://daum.net/A.html http://naver.com/b.html http://naver.com/a.html Naver Daum SSU Head Protocol ModuleURL frontier To implement the politeness constraint, the default version of Mercator’s URL frontier is implemented by a collection of distinct FIFO subqueues There is one FIFO subqueue per worker thread When a new URL is added, the FIFO subqueue in which it is placed is determined by the URL’s canonical host name

10 3.3 The HTTP protocol module The purpose of a protocol module is to fetch the document corresponding to a given URL using the appropriate network protocol Network protocols supported by Mercator include HTTP, FTP, and Gopher Mercator implements the Robots Exclusion Protocol To avoid downloading the RobotsExclusion file(robots.txt) on every request, Mercator's HTTP protocol module maintains a fixed-sized cache mapping host names to their robots exclusion rules Host name Robots exclusion rules (ex. User-agent, Disallow) LRU value (ex. Date) www.naver.com*, /tmp/ 2006.05.23/09:00(1) www. daum.netgooglebot, /cafe/ 2006.05.23/09:20(2) www. ssu.ac.kr2006.05.23/10:00(3) www. google.com*, /calendar/ 2006.05.23/10:10(4) 2^18 entries LRU replacement strategy Mercator uses its "lean and mean" HTTP protocol module its requests time out after 1 minute, and it has minimal synchronization and allocation overhead

11 3.4 Rewind input stream Mercator’s design allows the same document to be processed by multiple processing modules To avoid reading a document over the network multiple times, Mercator caches the document locally using an abstraction called a RewindInputStream A RIS caches small documents (64 KB or less) entirely in memory, while larger documents are temporarily written to a backing file (limit 1 MB) A RIS also provides a method for rewinding its position to the beginning of the stream, and various lexing methods that make it easy to build MIME-type- specific parsers RIS Link Extractor Tag Counter GIF Stats Processing Modules HTTP Protocol Modules Work thread URL from the frontier Initialize Pass the RIS 1 2 3 textLinkGIFLinktextGIF RIS Rewinding

12 3.5 Content-seen test (1) The Web crawler downloads the same document contents multiple times Many documents are available under multiple, different URLs There are also many cases in which document are mirrored on multiple servers Case A www.ssu.ac.kr/index.html Case B www3.ssu.ac.kr/index.html SERVER A Case C it.ssu.ac.kr/index.html Case D www.ssu.ac.kr/index.html SERVER B To prevent processing a document more than once, a Web crawler may wish to perform a content-seen test to decide if the document has already been processed. To save space and time, Mercator uses a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document Mercator compute the checksum using Broder’s implementation [Broder 1993] of Rabin’s fingerprinting algorithm [Rabin 1981] Fingerprints offer provably strong probabilistic guarantees that two different string will not have the same fingerprint

13 3.5 Content-seen test (2) Mercator maintains two independent set of fingerprints A small hash table kept in memory A large sorted list kept in a single disk file Check the FP in memory Check the FP in the disk file Add the new FP to the in-memory table Hash table fills up Merge the contents with the FP on disk Update the in-memory index of the disk file Not seen Content-seen test Add the new fingerprint 1 2 3 4 5 6 Fill up RIS Document fingerprint set Hash tableIndex of the disk file Memory Disk Content-seen test Java’s random access Use a readers- writer lock

14 3.6 URL filters The URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded The URL filter class has a single crawl method that takes a URL and returns a boolean value indicating whether or not to crawl that URL RIS Link Extractor URL Filter URL Seen? URL Frontier URLa boolean value crawl method URL filter class inputoutput Domain www.ssu.ac.kr Example input output www.naver.com www.ssu.ac.kr www.daum.net True False True

15 3.7 Domain name resolution Before contacting a Web server, a Web crawler must use the DNS to map the host name into an IP address Mercator tried to alleviate the DNS bottleneck by caching DNS results, but that was only partially effective To avoid bottleneck of DNS, Mercator used its own multi-threaded DNS resolver that can resolve host names much more rapidly than either the Java or Unix resolver INTERNETINTERNET DNS resolver DNS request Java interface to DNS lookups and the DNS interface on most Unix are synchronized INTERNETINTERNET DNS resolver DNS request DNS resolver DNS request DNS resolver DNS request Multi-thread DNS resoler Perform DNS lookings accounted for 87% of each thread’s elapsed time Reduce that elapsed time to 25%

16 3.8 URL-seen test (1) To avoid downloading and processing a document multiple times, a URL- seen test must be performed on each extracted link To perform the URL-seen test, all of the URLs seen by Mercator are stored in canonical form in a large table called the URL set RIS Link Extractor URL Filter URL Frontier URL Seen test This approach would result in a much larger frontier URL Set Popular URLs Memory URL Seen test The table of recently-added URLs To save space, Mercator doesn’t store the textual representation of each URL in the URL set, but rather a fixed-sized check-sum To reduce the number of operations on the backing disk file, Mercator keeps an in-memory cache of popular URLs

17 3.8 URL-seen test (2) Unlike the fingerprints, the stream of URLs has a non-trivial amount of locality (URL locality) URL Set Popular URLs Memory URL Seen test The table of recently-added URLs hit rate 66.2% 9.5% 8% 16% in-memory cache recently-added URLs the buffer missed requests Using an in-memory cache of 2^18 entries and the LRU-like clock replacement policy Each URL set membership test induces one-sixth as many kernel calls as a membership test on the document fingerprint set (Each membership test on the URL set results in an average of 0.16 seek and 0.17 read kernel calls)

18 3.8 URL-seen test (3) Host name locality Host name locality arises because many links found in Web pages are to different documents on the same server To preserve the locality, they compute the checksum of a URL by merging two independent fingerprints The fingerprint of the URL’s host name The fingerprint of the complete URL These two fingerprints are merged so that the high-order bits of the checksum derive from the host name fingerprint As a result, checksums for URLs with the same host component are numerically close together The host name locality in the stream of URLs translates into access locality on the URL set’s backing disk file, allowing the kernel’s file system buffers to service read requests from memory more often On extended crawls, this technique results in a significant reduction in disk load in a significant performance improvement

19 3.9 Synchronous vs. asynchronous I/O Google and Internet Archive crawlers Use single-threaded crawling processes and asynchronous I/O to perform multiple download in parallel They are designed from the ground up to scale to multiple machines Mercator Uses a multi-threaded process in which each thread performs synchronous I/O (It leads to a much simpler program structure) It would not be too difficult to adapt Mercator to run on multiple machines Thread machine Web server Thread machine Web server Thread machine Web server Thread Web server Thread Google and Archive cralwersMercator

20 3.10 Checkpointing To complete a crawl of the entire Web, Mercator writes regular snapshots of its state to disk An interrupted or aborted crawl can easily be restarted from the lastest checkpoint Mercator’s core classes and all user-supplied modules are required to implement the checkpointing interface Checkpointing are coordinated using a global readers-writer lock Each worker thread acquires a read share of the lock while processing a downloaded document Once a day, Mercator’s main thread has acquired the lock, it arranges for the checkpoint methods

Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, 1999 2006. 5. 23 Young Geun Han.

Similar presentations

Presentation on theme: "Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, 1999 2006. 5. 23 Young Geun Han."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, 1999 2006. 5. 23 Young Geun Han.

Similar presentations

Presentation on theme: "Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, 1999 2006. 5. 23 Young Geun Han."— Presentation transcript:

Similar presentations

About project

Feedback