Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철 moonpfe@realtime.ssu.ac.kr

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr Table of Contents 1. Introduction 1.1 Crawling Applications 1.2 Basic Crawler Structure 1.3 Requirements for a Crawler 1.4 Content of this Paper

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1. Introduction(1/2) Web search technology Crawling strategies, storage, indexing, ranking techniques, the structural analysis of the web and web graph High efficient crawling systems are needed. Explosion in size of WWW Download the hundreds of millions of web pages indexed by the major search engines. size vs currency, quality vs response time

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1. Introduction(2/2) A crawler for a large search engine has to address two issues. 1. It has to have a good crawling strategy. 2. It needs to have a highly optimized system architecture Download a large number of pages per second ex) The Mercator system of AltaVista In this paper, Describe the design and implementation of an optimized system on a network of workstations. Breadth-first crawl The I/O and network efficiency aspects of a system, scalability issues are interested.

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1.1 Crawling Applications(1/2) Crawling strategies Breadth-First Crawler Start out at a small set of pages and then explore other pages by following links in a “breadth first-like” fashion. Recrawling Pages for Updates After pages are initially acquired, they may have to be periodically recrawled and checked updates. heuristics -> important pages, sites, domains more frequently Focused Crawling Focus only on certain types of pages –Pages on a particular topic, images, mp3 file The goal of a focused crawler is to find many pages of interest without using a lot of bandwidth.

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1.1 Crawling Applications(2/2) Random Walking and Sampling Use random walks on the web graph to sample pages or estimate the size and quality of search engines. Crawling the “Hidden Web” Hidden Web –Dynamic pages –Only be retrieved by posting appropriate queries and/or filling out forms on web pages. Automatic access to “Hidden Web”

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1.2 Basic Crawler Structure(1/2)

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1.2 Basic Crawler Structure(2/2) Two main components of crawler Crawling application The crawling application decides what pages to request next given the current state and the previosly crawled pages, and issues a stream of requests(URLs) to the crawling system. Robot exclusion, speed control, DNS resolution Crawling system The crawling system downloads the requested pages and supplies them to the crawling application for analysis and storage. Implements crawling strategies Both crawling system and application can be replicated. For higher performance

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1.3 Requirements for a Crawler(1/2) Flexibility Use the system in a variety of scenarios, with few modifications Low Cost and High Performance Scale to several hundred pages per second and hundreds of millions of pages per run, and run on low cost hardware. Robustness Tolerate bad HTML, strange server behavior, configurations. Tolerate crashes and network interruptions without losing the data

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1.3 Requirements for a Crawler(2/2) Etiquette and Speed Control Robot exclusion ( robots.txt and robots meta tags) Avoid putting too much load on a single server 30 seconds interval Throttle the speed on a domain level Manageability and Reconfigurability An appropriate interface is needed to monitor the crawl. Administrator should be able to control the crawl. Adjust Speed, add and remove components, shut down the system After a crash or shutdown, we may want to continue the crawl using a different machine configuration.

CREST(Center for Real-Time Embedded System Technology), Soongsil Univ, Korea http://realtime.ssu.ac.kr 1.4 Content of this Paper Section 2 describes the architecture of our system and its major components. Section 3 describes the data structures and algorithmic techniques that were used in more detail. Section 4 presents preliminary experimental results. Section 5 compares our design to that of other systems we know of. Section 6 offers some concluding remarks.

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

Similar presentations

Presentation on theme: "Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

Similar presentations

Presentation on theme: "Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철"— Presentation transcript:

Similar presentations

About project

Feedback