Presentation on theme: "A Brief Look at Web Crawlers Bin Tan 03/15/07. Web Crawlers “… is a program or automated script which browses the World Wide Web in a methodical, automated."— Presentation transcript:
A Brief Look at Web Crawlers Bin Tan 03/15/07
Web Crawlers “… is a program or automated script which browses the World Wide Web in a methodical, automated manner” Uses: Create an archive / index from the visited web pages to support offline browsing / search / mining. Automating maintenance tasks on a website Harvesting specific information from web pages
High-level architecture Seeds Frontier
How easy is it to write a program to crawl all uiuc.edu web pages?
All sorts of real problems: Managing multiple download threads is nontrivial If you make requests to a server in short intervals, you’ll overloading it Pages may be missing; servers may be down or sluggish You may be trapped in dynamic-generated pages Web page may use ill-formed HTML
This is only a small-scale crawl… (Shkapenyuk and Suel, 2002): "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability."
Data characterics in large-scale crawls Large volume, fast changes, dynamic page generation: a wide selection of possibly crawlable URLs Edwards et al: "Given that the bandwidth for conducting crawls is neither infinite nor free it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."
Selection policy: which page to download Need to prioritize according to some page importance metrics Depth-first Breadth-first Partial PageRank calculation OPIC (On-line Page Importance Computation) Length of per-site queues In focused crawling, prediction of similarity between page text and query re-visit policy
Revisit policy: when to check for changes to the pages Pages are frequently updated, created or deleted Cost functions to minimize: Freshness (0 for stale pages, 1 for fresh pages ) Age (amount of time for which a page has been stale)
Revisit Policy (cont.) Uniform policy: revisiting all pages in the collection with the same frequency Proportional policy: revisiting more often the pages that change more frequently The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. Numerical methods are used for calculation based on distribution of page changes
Politeness policy: how to avoid overloading websites Badly-behaved crawlers can be a nuisance Robots exclusion protocol (robots.txt) Google Google Interval/delay between connections (10sec – 5 min) fixed proportional to page downloading time
Parallelization policy: how to coordinate distributed web crawlers Nutch: "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages"
Crawling the deep web Many web spiders run by popular search engines ignore URLs with a query string Google’s Sitemap protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling Also: mod-oai is an Apache module that allows web crawlers to efficiently discover new, modified, and deleted web resources from a web server by using OAI-PMH, a protocol which is widely used in the digital libraries community
Example Web Crawler Software wget heritrix nutch others
Wget Command-line tool, non-extensible Config: recursive downloading Config: spanning hosts Breadth-first for HTTP, depth-first for FTP Config: include/exclude filters Updates outdated pages based on timestamps Supports robots.txt protocol Config: connection delay Single-threaded
Heritrix Heritrix is Internet Archive’s web crawler which was specially designed for web archiving Licence: LGPL Written in Java
Features Highly modular; easily extensible Scales to large data volume Implemented selection policies: Breadth-first with options to throttle activity against particular hosts and to bias towards finishing hosts in progress or cycling among all hosts with pending URLs Domain sensitive: allows specifying an upper-bound on the number of pages downloaded per site Adaptive revisiting: repeatedly visit all encountered URLs (wait time between visits configurable) Implements fixed / proportional connection delay Detailed documentation Web-based UI for crawler administration
Nutch Nutch is an effort to build an open source search engine based on Lucene for the search and index component. License: Apache 2.0 Written in Java
Features Modular; extensible Breadth-first Includes parsing and indexing components Implements a MapReduce facility and a distributed file system (Haddop)
Recrawl command lines # The generate/fetch/update cycle for ((i=1; i <= depth ; i++)) do bin/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` bin/nutch fetch $segment bin/nutch updatedb $webdb_dir $segment done