Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp.357-368,

Slides:



Advertisements
Similar presentations
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Advertisements

1 Routing and Scheduling in Web Server Clusters. 2 Reference The State of the Art in Locally Distributed Web-server Systems Valeria Cardellini, Emiliano.
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
CS728: Internet Studies and Web Algorithms Lecture 2: Web Crawlers and DNS Algorithms April 3, 2008.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk* Torsten Suel CIS Department Polytechnic University Brooklyn,
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Hands-On Microsoft Windows Server 2003 Networking Chapter 6 Domain Name System.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
1 Web Content Delivery Reading: Section and COS 461: Computer Networks Spring 2007 (MW 1:30-2:50 in Friend 004) Ioannis Avramopoulos Instructor:
Distributed Computations MapReduce
PRASHANTHI NARAYAN NETTEM.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Windows Server 2008 Chapter 8 Last Update
Hands-On Microsoft Windows Server 2008 Chapter 8 Managing Windows Server 2008 Network Services.
Distributed Computing COEN 317 DC2: Naming, part 1.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Internet Concept and Terminology. The Internet The Internet is the largest computer system in the world. The Internet is often called the Net, the Information.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
A Web Crawler Design for Data Mining
CH2 System models.
Distributed File Systems
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network, Enhanced Chapter 6: Name Resolution.
Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, Young Geun Han.
IRLbot: Scaling to 6 Billion Pages and Beyond Presented by rohit tummalapalli sashank jupudi.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Data Structures & Algorithms and The Internet: A different way of thinking.
Crawling Slides adapted from
Distributed Computing COEN 317 DC2: Naming, part 1.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
MapReduce M/R slides adapted from those of Jeff Dean’s.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Serverless Network File Systems Overview by Joseph Thompson.
CRAWLER DESIGN YÜCEL SAYGIN These slides are based on the book “Mining the Web” by Soumen Chakrabarti Refer to “Crawling the Web” Chapter for more information.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Web Server.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Bigtable: A Distributed Storage System for Structured Data
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Statistics Visualizer for Crawler
UbiCrawler: a scalable fully distributed Web crawler
Steve Ko Computer Sciences and Engineering University at Buffalo
VIRTUAL SERVERS Presented By: Ravi Joshi IV Year (IT)
CHAPTER 3 Architectures for Distributed Systems
Whether you decide to use hidden frames or XMLHttp, there are several things you'll need to consider when building an Ajax application. Expanding the role.
Internet Networking recitation #12
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Steve Ko Computer Sciences and Engineering University at Buffalo
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anwar Alhenshiri.
Presentation transcript:

Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp , 2002 June 02, 2006 Jeonghye Sohn

2 Contents System Architecture  Crawl Manager  Downloaders and DNS Resolvers  Crawling Application  Scaling the System Implementation Details and Algorithmic Techniques  Application Parsing and Network Performance  URL Handling  Domain-Based Throttling

3 System Architecture(1) Crawling system  Crawling system consists of several specialized components, in particular a crawl manager, one or more downloaders, and one or more DNS resolvers  All of these components can run on different machines (and OS) and can be replicated to increase the system performance  Crawl manager is responsible for receiving the URL input stream fr om the applications and forwarding it to the available downloaders and DNS resolvers while enforcing rules about robot exclusion and crawl speed  Downloader is a high-performance asynchronous HTTP client capable of downloading hundreds of web pages in parallel  DNS resolver is an optimized stub DNS resolver that forwards queries to local DNS servers

4 Application issues requests to manager Manager schedules URL on downloader Downloader gets file and puts it on disk Application parses new files for hyperlinks Application sends data to storage component (indexing done later) System Architecture(2)

5 System Architecture(3) Crawl Manager The manager uses Berkeley DB and STL for external and internal data structures The manager receives requests for URLs from the application, and a pointer to a file containing several hundred or thousand URLs and located on some disk accessible via NFS After loading the URLs of a request files, the manager queries the DNS resolvers for the IP addresses of the servers, unless a recent address is already cached The manager does robot exclusion by generating requests to downloaders and parsing files Finally, after parsing the robots files and removing excluded URLs, the requested URLs are sent in batches to the downloaders

6 System Architecture(4) Downloaders and Resolvers The downloader optimized HTTP client written in Python (everything else in C++) The downloader component, fetches files from the web by opening up to 1000 connections to different servers, and polling these connections for arriving data Data is then marshaled into files located in a directory determined by the application and accessible via NFS The DNS resolver uses asynchronous DNS library While DNS resolution used to be a significant bottleneck in crawler design due to the synchronous nature of many DNS interfaces

7 System Architecture(5) Crawl Application The crawling application is a breadth-first crawl starting out at a set of seed URLs The application does parsing and handling of URLs (has this page already been downloaded?) The downloaded files are then forwarded to a storage manager for compression and storage in a repository

8 Scaling the System: Small system : 3-5 workstations and pages/sec peak can scale up by adding downloaders and DNS resolvers at pages/sec, application becomes bottleneck at 8 downloaders manager becomes bottleneck need to replicate application and manager hash-based technique (Internet Archive crawler) partitions URLs and hosts among application parts data transfer batched and via file system (NFS) System Architecture(6) Scaling the System

9 System Architecture(7) Scaling the System A possible scaled up version of our system that uses two crawl managers, with 8 downloaders and 3 DNS resolvers each, and with four application components We partition the space of all possible URLs into 4 subsets using a hash function, such that each application component is responsible for processing and requesting one subset If during parsing, a component encounters a hyperlink belonging to a different subset, then that URL is simply forwarded to the appropriate application component (as determined by the hash value)

10 Scaling up 20 machines 1500 pages/sec depends on crawl strategy hash to nodes based on site

11 Implementation Details and Algorithmic Techniques(1) Application Parsing and Network Performance Crawling Application  parsing using the Perl Compatible Regular Expression(pcre) library  The downloaders store the pages via NFS on a disk  Later, the application reads the files for parsing, and a storage manager copies them to a separate permanent repository, also via NFS : NFS eventually bottleneck

12 Implementation Details and Algorithmic Techniques(2) URL Handling URL-seen problem:  need to check if file has been parsed or downloaded before  after 20 million pages, we have “seen” over 100 million URLs  each URL is 50 to 75 bytes on average, and thus a naive representation of the URLs would quickly grow beyond memory size Solutions: compress URLs in main memory, or use disk  Bloom Filter (The carwler of the Internet Archive) : this results in a very compact representation, but also gives false positives  Lossless compression can reduce URL size to below 10 Bytes, though this is still too high for large crawls  Disk access with caching (Mercator) : by caching recently seen and frequently encountered URLs, resulting in a cache hit rate of almost 85%

13 Implementation Details and Algorithmic Techniques(3) URL Handling We perform the lookups and insertion in a bulk or offline operation on disk Implementation of URL-seen check:  while less than a few million URLs seen, keep in main memory  then write URLs to file in alphabetic, prefix-compressed order  collect new URLs in memory and periodically reform bulk check by merging new URLs into the file on disk

14 Implementation Details and Algorithmic Techniques(4) URL Handling The merge is performed by spawning a separate thread, so that the application can continue parsing new files, and the next merge will only be started an hour after the previous one has completed Thus, the system will gracefully adapt as the merge operations start taking longer while the structure grows Using this method, lookups and insertions were never a bottleneck

15 Implementation Details and Algorithmic Techniques(5) Domain-Based Throttling Some domains may have a fairly slow network connection, but large number of web servers, and could be impacted by a crawler Another problem we encountered was that larger organizations, such as universities, have intrusion detection systems that may raise an alarm if too many servers on campus are contacted in a short period of time, even if timeouts are observed between accesses to the same server Finally, our crawler does timeouts between accesses based on hostname and not IP address, and does not detect if web servers are collocated on the same machine We decided to address domain-based throttling in the crawl application, since this seemed to be the easiest way

16 Implementation Details and Algorithmic Techniques(6) Domain-Based Throttling We first note that fetching URLs in the order they were parsed out of the pages is a very bad idea, since there is a lot of second level domain locality in the links If we “scramble” the URLs into a random order, URLs from each domain will be spread out evenly We can do this in a simple deterministic way with provable load balancing properties  Put the hostname of each URL into reverse order (e.g., com.amazon. www) before inserting the URL into the data structures  After checking for duplicates, take the sorted list of new URLs and perform k-way unshuffle permutation, say for k=1000, before sending them to the manager