Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Brief Look at Web Crawlers Bin Tan 03/15/07. Web Crawlers “… is a program or automated script which browses the World Wide Web in a methodical, automated.

Similar presentations

Presentation on theme: "A Brief Look at Web Crawlers Bin Tan 03/15/07. Web Crawlers “… is a program or automated script which browses the World Wide Web in a methodical, automated."— Presentation transcript:

1 A Brief Look at Web Crawlers Bin Tan 03/15/07

2 Web Crawlers “… is a program or automated script which browses the World Wide Web in a methodical, automated manner” Uses:  Create an archive / index from the visited web pages to support offline browsing / search / mining.  Automating maintenance tasks on a website  Harvesting specific information from web pages

3 High-level architecture Seeds Frontier

4 How easy is it to write a program to crawl all web pages?

5 All sorts of real problems: Managing multiple download threads is nontrivial If you make requests to a server in short intervals, you’ll overloading it Pages may be missing; servers may be down or sluggish You may be trapped in dynamic-generated pages Web page may use ill-formed HTML

6 This is only a small-scale crawl… (Shkapenyuk and Suel, 2002): "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability."

7 Data characterics in large-scale crawls Large volume, fast changes, dynamic page generation: a wide selection of possibly crawlable URLs Edwards et al: "Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."

8 Selection policy: which page to download Need to prioritize according to some page importance metrics Depth-first Breadth-first Partial PageRank calculation OPIC (On-line Page Importance Computation) Length of per-site queues In focused crawling, prediction of similarity between page text and query re-visit policy

9 Revisit policy: when to check for changes to the pages Pages are frequently updated, created or deleted Cost functions to minimize:  Freshness (0 for stale pages, 1 for fresh pages )  Age (amount of time for which a page has been stale)

10 Revisit Policy (cont.) Uniform policy: revisiting all pages in the collection with the same frequency Proportional policy: revisiting more often the pages that change more frequently The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. Numerical methods are used for calculation based on distribution of page changes

11 Politeness policy: how to avoid overloading websites Badly-behaved crawlers can be a nuisance Robots exclusion protocol (robots.txt) Google Google Interval/delay between connections (10sec – 5 min)  fixed  proportional to page downloading time

12 Parallelization policy: how to coordinate distributed web crawlers Nutch: "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages"

13 Crawling the deep web Many web spiders run by popular search engines ignore URLs with a query string Google’s Sitemap protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling Also: mod-oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community

14 Example Web Crawler Software wget heritrix nutch others

15 Wget Command-line tool, non-extensible Config: recursive downloading Config: spanning hosts Breadth-first for HTTP, depth-first for FTP Config: include/exclude filters Updates outdated pages based on timestamps Supports robots.txt protocol Config: connection delay Single-threaded

16 Heritrix Heritrix is Internet Archive’s web crawler which was specially designed for web archiving Licence: LGPL Written in Java


18 Features Highly modular; easily extensible Scales to large data volume Implemented selection policies:  Breadth-first with options to throttle activity against particular hosts and to bias towards finishing hosts in progress or cycling among all hosts with pending URLs  Domain sensitive: allows specifying an upper-bound on the number of pages downloaded per site  Adaptive revisiting: repeatedly visit all encountered URLs (wait time between visits configurable) Implements fixed / proportional connection delay Detailed documentation Web-based UI for crawler administration


20 Nutch Nutch is an effort to build an open source search engine based on Lucene for the search and index component. License: Apache 2.0 Written in Java

21 Features Modular; extensible Breadth-first Includes parsing and indexing components Implements a MapReduce facility and a distributed file system (Haddop)

22 Recrawl command lines # The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment done

23 Appedix: Parsers HTML:  lynx –dump  Beautiful Soup (Python)  tidylib (C) PDF  xpdf Others  Nutch plugins  Office API (Windows)

Download ppt "A Brief Look at Web Crawlers Bin Tan 03/15/07. Web Crawlers “… is a program or automated script which browses the World Wide Web in a methodical, automated."

Similar presentations

Ads by Google