March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems

March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems bergmark@cs.cornell.edu

March 26, 2003CS502 Web Information Systems2 Web Resource Discovery Finding info on the Web –Surfing (random strategy; goal is serendipity) –Searching (inverted indices; specific info) –Crawling (follow links; “all” the info) Uses for crawling –Find stuff –Gather stuff –Check stuff

March 26, 2003CS502 Web Information Systems3 Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

March 26, 2003CS502 Web Information Systems4 Crawlers and internet history 1991: HTTP 1992: 26 servers 1993: 60+ servers; self-register; archie 1994 (early) – first crawlers 1996 – search engines abound 1998 – focused crawling 1999 – web graph studies 2002 – use for digital libraries

March 26, 2003CS502 Web Information Systems5 So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT

March 26, 2003CS502 Web Information Systems6 The Central Crawler Function Server 2 queue Server 1 queue Server 3 queue URL -> IP address via DNS Connect a Socket to Server; send HTTP request Wait for the response: An HTML page

March 26, 2003CS502 Web Information Systems7 Handling the HTTP Response Document seen before? FETCH Process this document No Extract text Extract links ::::

March 26, 2003CS502 Web Information Systems8 LINK Extraction Finding the links is easy (sequential scan) Need to clean them up and canonicalize them Need to filter them Need to check for robot exclusion Need to check for duplicates

March 26, 2003CS502 Web Information Systems9 Update the Frontier FETCHPROCESS URL1 URL2 URL3 : FRONTIER

March 26, 2003CS502 Web Information Systems10 Crawler Issues System Considerations The URL itself Politeness Visit Order Robot Traps The hidden web

March 26, 2003CS502 Web Information Systems11 Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler

March 26, 2003CS502 Web Information Systems12 Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate

March 26, 2003CS502 Web Information Systems13 Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster

March 26, 2003CS502 Web Information Systems14 The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web

March 26, 2003CS502 Web Information Systems15 MERCATOR

March 26, 2003CS502 Web Information Systems16 Mercator Features One file configures a crawl Written in Java Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Industrial-strength crawler: – uses its own DNS and java.net package

March 26, 2003CS502 Web Information Systems17 The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl

March 26, 2003CS502 Web Information Systems18 Crawling and Crawlers Web overlays the internet A crawl overlays the web seed

March 26, 2003CS502 Web Information Systems19 Focused Crawling

March 26, 2003CS502 Web Information Systems20 Focused Crawling 432 765 1 1 R Breadth-first crawl 1 432 5 R X X Focused crawl

March 26, 2003CS502 Web Information Systems21 Focused Crawling Recall the cartoon for a focused crawl: A simple way to do it is with 2 “knobs” 1 432 5 R X X

March 26, 2003CS502 Web Information Systems22 Focusing the Crawl Threshold: page is on-topic if correlation to the closest centroid is above this value Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than this value

March 26, 2003CS502 Web Information Systems23 Illustration 23 4 6 7 1 555 5 Cutoff = 1 Corr >= threshold

March 26, 2003CS502 Web Information Systems24 Closest Furthest

March 26, 2003CS502 Web Information Systems25 Correlation vs. Crawl Length

March 26, 2003CS502 Web Information Systems26 Fall 2002 Student Project Query Mercator CentroidCollectionDescription Term vectors Centroids, Dictionary Collection URLs Chebyshev P.s HTML

March 26, 2003CS502 Web Information Systems27 Conclusion We covered crawling – history, technology, deployment Focused crawling with tunneling We have a good experimental setup for exploring automatic collection synthesis

March 26, 2003CS502 Web Information Systems28 http://mercator.comm.nsdlib.org

March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems

Similar presentations

Presentation on theme: "March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems

Similar presentations

Presentation on theme: "March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems"— Presentation transcript:

Similar presentations

About project

Feedback