1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University

2 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages

3 Applications Internet Search Engines Internet Search Engines –Google, AltaVista Comparison Shopping Services Comparison Shopping Services –My Simon, BizRate Data mining Data mining –Stanford Web Base, IBM Web Fountain

4 WebBase Crawler Web Base Project Web Base Project BackRub Crawler, PageRank BackRub Crawler, PageRank Google Google New Web Base Crawler New Web Base Crawler –20,000 lines in C/C++ –130M pages collected

5 Crawling Issues (1) Load at visited web sites Load at visited web sites –Space out requests to a site –Limit number of requests to a site per day –Limit depth of crawl

6 Crawling Issues (2) Load at crawler Load at crawler –Parallelize init get next url get page extract urls initial urls to visit urls visited urls web pages init get next url get page extract urls ?

7 Crawling Issues (3) Scope of crawl Scope of crawl –Not enough space for “all” pages –Not enough time to visit “all” pages Solution: Visit “important” pages visited pages Intel

8 Crawling Issues (4) Replication Replication –Pages mirrored at multiple locations

9 Crawling Issues (5) Incremental crawling Incremental crawling –How do we avoid crawling from scratch? –How do we keep pages “fresh”?

10 Summary of My Research Load on sites [PAWS00] Load on sites [PAWS00] Parallel crawler [Tech Report 01] Parallel crawler [Tech Report 01] Page selection [WWW7] Page selection [WWW7] Replicated page detection [SIGMOD00] Replicated page detection [SIGMOD00] Page freshness [SIGMOD00] Page freshness [SIGMOD00] Crawler architecture [VLDB00] Crawler architecture [VLDB00]

11 Outline of This Talk How can we maintain pages fresh? How does the Web change? How does the Web change? What do we mean by “fresh” pages? What do we mean by “fresh” pages? How should we refresh pages? How should we refresh pages?

12 Web Evolution Experiment How often does a Web page change? How often does a Web page change? How long does a page stay on the Web? How long does a page stay on the Web? How long does it take for 50% of the Web to change? How long does it take for 50% of the Web to change? How do we model Web changes? How do we model Web changes?

13 Experimental Setup February 17 to June 24, 1999 February 17 to June 24, 1999 270 sites visited (with permission) 270 sites visited (with permission) –identified 400 sites with highest “PageRank” –contacted administrators 720,000 pages collected 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests

14 Average Change Interval fraction of pages  average change interval 

15 Change Interval – By Domain fraction of pages   average change interval

16 Modeling Web Evolution Poisson process with rate Poisson process with rate T is time to next event T is time to next event f T (t) = e - t (t > 0) f T (t) = e - t (t > 0)

17 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model

18 Change Metrics Freshness Freshness –Freshness of element e i at time t is F ( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase Freshness of the database S at time t is F( S ; t ) = F( e i ; t ) (Assume “equal importance” of pages)  N 1 N i=1

19 Change Metrics Age Age –Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase Age of the database S at time t is A( S ; t ) = A( e i ; t ) (Assume “equal importance” of pages)  N 1 N i=1

20 Change Metrics F(e i ) A(e i ) 0 0 1 time update refresh Time averages:

21 Refresh Order Fixed order Fixed order –Explicit list of URLs to visit Random order Random order –Start from seed URLs & follow links Purely random Purely random –Refresh pages on demand, as requested by user as requested by user eiei eiei... webdatabase

22 Freshness vs. Revisit Frequency r = / f = average change frequency / average visit frequency

23 Age vs. Revisit Frequency r = / f = average change frequency / average visit frequency = Age / time to refresh all N elements

24 Trick Question Two page database Two page database changes daily e 1 changes daily changes once a week e 2 changes once a week Can visit one page per week Can visit one page per week How should we visit pages? How should we visit pages? –... [uniform] –e 1 e 2 e 1 e 2 e 1 e 2 e 1 e 2... [uniform] – … [proportional] –e 1 e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1 … [proportional] –... –e 1 e 1 e 1 e 1 e 1 e 1... –... –e 2 e 2 e 2 e 2 e 2 e 2... –? e1e1 e2e2 e1e1 e2e2 web database

25 Proportional Often Not Good! Visit fast changing Visit fast changing e 1  get 1/2 day of freshness  get 1/2 day of freshness Visit slow changing Visit slow changing e 2  get 1/2 week of freshness Visiting is a better deal! Visiting e 2 is a better deal!

26 Optimal Refresh Frequency Problem Given and f, find find that maximize that maximize

27 Solution Compute Compute Lagrange multiplier method Lagrange multiplier method All All

28 Optimal Refresh Frequency Shape of curve is the same in all cases Holds for any change frequency distribution

29 Optimal Refresh for Age Shape of curve is the same in all cases Holds for any change frequency distribution

30 Comparing Policies Based on Statistics from experiment and revisit frequency of every month

31 Topics to Follow Weighted Freshness Weighted Freshness Non-Poisson Model Non-Poisson Model Change Frequency Estimation Change Frequency Estimation

32 Not Every Page is Equal! In general, e1e1 e2e2 Accessed by users 20 times/day Accessed by users 10 times/day Some pages are “more important” Some pages are “more important”

33 Weighted Freshness w = 1 w = 2 f

34 Non-Poisson Model interval in days fraction of changes with given interval Poisson model Heavy-tail distribution

35 Optimal Revisit Frequency for Heavy-Tail Distribution f

36 Principle of Diminishing Return T: time to next change T: time to next change : continuous, differentiable : continuous, differentiable Every page changes Every page changes Definition of change rate Definition of change rate

37 Change Frequency Estimation How to estimate change frequency? How to estimate change frequency? –Naïve Estimator: X/T –X: number of detected changes –T: monitoring period –2 changes in 10 days: 0.2 times/day Change detected 1 day Page visited Page changed Incomplete change history Incomplete change history

38 Improved Estimator Based on the Poisson model Based on the Poisson model –X: number of detected changes –N: number of accesses –f : access frequency 3 changes in 10 days: 0.36 times/day  Accounts for “missed” changes

39 Improved Estimator Bias Bias Efficiency Efficiency Consistency Consistency

40 Improvement Significant? Application to a Web crawler Application to a Web crawler –Visit pages once every week for 5 weeks –Estimate change frequency –Adjust revisit frequency based on the estimate »Uniform: do not adjust »Naïve: based on the naïve estimator »Ours: based on our improved estimator

41 Improvement from Our Estimator Detected changes Ratio to uniform Uniform2,147,589 100% 100% Naïve4,145,582193% Ours4,892,116228% (9,200,000 visits in total)

42 Other Estimators Irregular access interval Irregular access interval Last-modified date Last-modified date Categorization Categorization

43 Summary Web evolution experiment Web evolution experiment Change metric Change metric Refresh policy Refresh policy Frequency estimator Frequency estimator

44 Contribution Freshness [SIGMOD00] Freshness [SIGMOD00] Page selection [WWW7] Page selection [WWW7] Replicated page detection [SIGMOD00] Replicated page detection [SIGMOD00] Load on sites [PAWS00] Load on sites [PAWS00] Parallel crawler [Tech Report 01] Parallel crawler [Tech Report 01] Crawler architecture [VLDB00] Crawler architecture [VLDB00]

45 The End Thank you for your attention Thank you for your attention For more information visit For more information visithttp://www-db.stanford.edu/~cho/

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Similar presentations

Presentation on theme: "1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Similar presentations

Presentation on theme: "1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback