Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002
Web Resource Discovery Surfing Serendipity Search Specific Information Inverted keyword list Page lookup Crawler Text for keyword indexing Hence, crawlers are needed for discovery of Web resources
Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.
Some History First crawlers appeared in 1994 Why? Web growth April 1993: 62 registered web servers In 1994, Web (http) traffic grew 15 X faster than the Internet itself Lycos was announced in 1994 as a search engine.
So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT
Crawler Issues The URL itself Politeness Visit Order Robot Traps The hidden web System Considerations
Standard for Robot Exclusion Martin Koster (1994) Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler
The Four Laws of Web Robotics A Crawler must show identifications A Crawler must obey the robots.txt A Crawler must not hog resources A Crawler must report errors
Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate
Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster
The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web
System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources
Mercator - 1 Written in Java One file configures a crawl –How many threads –What analyzers to use –What filters to use –How to place links on the frontier –How long to run
Mercator - 2 Tell it what seed URL[s] to start with Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Is very efficient at memory usage –URLs are hashed –Documents are finger-printed
Mercator - 3 Industrial-strength crawler: –Multi-threaded for parallel crawls –Polite: one thread for one server –Mercator implements own host lookup –Mercator uses its own DNS
The Web as a Graph Crawling is meant to traverse the web Remove some edges to create a tree –I.e. do not revisit URLs You can only crawl forwards –I.e. need explicit back-links Page rank
The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl
Conclusion Clearly crawling is not simple Hot topic of the late 90’s research Good technologies as a result Focused crawling is where crawling is going next (hot topic of early 2000’s)