Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.

Similar presentations


Presentation on theme: "Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002."— Presentation transcript:

1 Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002

2 Web Resource Discovery Surfing  Serendipity Search  Specific Information Inverted keyword list  Page lookup Crawler  Text for keyword indexing Hence, crawlers are needed for discovery of Web resources

3 Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

4 Some History First crawlers appeared in 1994 Why? Web growth April 1993: 62 registered web servers In 1994, Web (http) traffic grew 15 X faster than the Internet itself Lycos was announced in 1994 as a search engine.

5 So, why not write a robot? You’d think a crawler would be easy to write: Pick up the next URL Connect to the server GET the URL When the page arrives, get its links (optionally do other stuff) REPEAT

6 Crawler Issues The URL itself Politeness Visit Order Robot Traps The hidden web System Considerations

7 Standard for Robot Exclusion Martin Koster (1994) http://any-server:80/robots.txt Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler

8 The Four Laws of Web Robotics A Crawler must show identifications A Crawler must obey the robots.txt A Crawler must not hog resources A Crawler must report errors

9 Visit Order The frontier Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate

10 Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster

11 The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web

12 System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources

13

14 Mercator - 1 Written in Java One file configures a crawl –How many threads –What analyzers to use –What filters to use –How to place links on the frontier –How long to run

15 Mercator - 2 Tell it what seed URL[s] to start with Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Is very efficient at memory usage –URLs are hashed –Documents are finger-printed

16 Mercator - 3 Industrial-strength crawler: –Multi-threaded for parallel crawls –Polite: one thread for one server –Mercator implements own host lookup –Mercator uses its own DNS

17 The Web as a Graph Crawling is meant to traverse the web Remove some edges to create a tree –I.e. do not revisit URLs You can only crawl forwards –I.e. need explicit back-links Page rank

18 The Web is a BIG Graph “Diameter” of the Web Cannot crawl even the static part, completely New technology: the focused crawl

19

20 Conclusion Clearly crawling is not simple Hot topic of the late 90’s research Good technologies as a result Focused crawling is where crawling is going next (hot topic of early 2000’s)


Download ppt "Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002."

Similar presentations


Ads by Google