Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.

Similar presentations


Presentation on theme: "The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University."— Presentation transcript:

1 The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University

2 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages

3 Crawling Issues (1) Load at visited web sites Load at crawlers Scope of the crawl

4 Crawling Issues (2) Typical crawler Periodic, Batch, Shadowing Incremental crawling Maintain Pages “fresh” Avoid crawling from scratch How do we crawl?

5 Outline Web evolution experiments Freshness metrics Design issues and comparison

6 Web Evolution Experiment How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to change?

7 Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission) identified 400 sites with highest “page rank” contacted administrators 720,000 pages collected 3,000 pages from each site daily start at root, visit breadth first (get new & old pages) ran only 9pm - 6am, 10 seconds between site requests

8 How Often Does a Page Change? Example: 50 visits to page, 5 changes  average change interval = 50/5 = 10 days Is this correct? 1 day changes page visited

9 Average Change Interval fraction of pages

10 Average Change Interval — By Domain fraction of pages

11 How Long Does a Page Live? experiment duration page lifetime experiment duration page lifetime experiment duration page lifetime experiment duration page lifetime

12 Page Lifespans fraction of pages

13 Page Lifespans Method 1 used fraction of pages

14 Time for a 50% Change days fraction of unchanged pages

15 Change Metrics Freshness [SIGMOD 2000] Freshness of element e i at time t is F(e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase  Freshness of the database S at time t is F(S ;t ) = F(e i ;t )  N 1 N i=1

16 Change Metrics Age [SIGMOD 2000] Age of element e i at time t is A(e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase Age of the database S at time t is A(S ; t ) = A(e i ; t )  N 1 N i=1

17 Crawler Types In-place vs. shadow Steady vs. batch eiei eiei... webdatabase eiei... shadow database time crawler on crawler off

18 Comparison: Batch vs. Steady batch mode in-place crawler steady in-place crawler crawler running

19 Shadowing Steady Crawler crawler’s collection current collection without shadowing

20 Shadowing Batch Crawler crawler’s collection current collection without shadowing

21 Experimental Data: Freshness Pages change on average every 4 months Batch crawler works one week out of 4 1 2 0.63 0.50

22 Uniform vs. Variable In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.] [SIGMOD 2000]

23 Summary Steady In-place Variable visit frequencies Improvement depends on on how the web changes improves freshness!

24 The End The paper proposes an architecture Thank you for your attention


Download ppt "The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University."

Similar presentations


Ads by Google