Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Content and Access Dynamics of a Busy Web Server: Findings and Implications Venkata N. Padmanabhan Microsoft Research Lili Qiu Cornell University.

Similar presentations


Presentation on theme: "1 The Content and Access Dynamics of a Busy Web Server: Findings and Implications Venkata N. Padmanabhan Microsoft Research Lili Qiu Cornell University."— Presentation transcript:

1 1 The Content and Access Dynamics of a Busy Web Server: Findings and Implications Venkata N. Padmanabhan Microsoft Research Lili Qiu Cornell University SIGCOMM’2000, Stockholm, Sweden August 30, 2000

2 2 Outline Motivation Related Work Overview Content Dynamics Access Dynamics Summary & Implications Future Work

3 3 Motivation Solid understanding of Web workload is critical for designing robust and scalable systems Each of the Web components provides a unique perspective on the functioning of the Web Internet replica proxy replica proxy Clients Servers

4 4 Motivation (Cont.) Distinguishing features of our work Study MSNBC web site a large news server consistently ranked among the busiest sites in the Web Study content & access dynamics The dynamics of file modification and creation The dynamics of users access

5 5 Related Work Server-based study [ABC+96] observed File popularity follows Zipf’s distribution (   1) Temporal locality in file accesses [AW96] found 10 invariants 10% files account for 90% accesses [MS97] Long latencies are not necessarily due to server over- loading or CGI traffic [AJ99] studied 1998 worldcup traces Significant volume of cache consistency traffic

6 6 Related Work (Cont.) Proxy workload characterization Page popularity follows a Zipf-like distribution, i.e. request frequency  1/i  (  < 1) [BCF+99] Hit rate of proxy caches no more than 50% [DMF97,GB97] A substantial fraction of misses arises from first- time accesses [VDA+99] Significance in organizational membership [WVS+99] Client-based study [CBC95] and [BBB+98] report Change in file popularity and temporal locality

7 7 Overview MSNBC server site a large news site server cluster with 40 nodes 25 million accesses a day (HTML content alone) Period studied: Aug. – Oct. 99 & Dec. 17, 98 flash crowd Server logs HTTP access logs Content Replication System (CRS) logs HTML content logs Data analysis Content dynamics Access dynamics

8 8 Major Findings Content dynamics Modification history is a rough predictor Frequent but minimal file modifications Access dynamics Set of popular files remains stable for days Domain membership has a significant bearing on client accesses except during a flash crowd of global interest Zipf-like distribution of file popularity but with a much larger  than at proxies Accesses to old documents account for most first- time misses  hard to anticipate such accesses, and eliminate these first-time misses

9 9 Content Dynamics Period studied: 10/1/99 – 10/28/99 CDF of modification intervals Distinct knees in the CDF at one hour and one day Predictive power of modification history Modification history is a rough predictor of future modification interval Extent of change upon file modification Most file modifications are minimal  delta encoding can be very useful

10 10 CDF of Modification Intervals Distinct knees in the CDF at one hour and one day

11 11 Predictive Power of Modification History Has significant bearing on cache consistency control algorithms, such as adaptive TTL Prediction algorithm studied Estimate the future modification interval as the mean of the past x samples Performance metrics Correlation coefficient between the predicted and actual values Error in prediction

12 12 Correlation Coefficient A larger averaging window size helps to predict the future modification interval up to a certain point.

13 13 Error in Prediction Averaging window: 16 samples Mean error: 226% Median error: 45% Percentage error in predicting file modification interval Modification history yields a rough predictor  need alternative mechanism (e.g. call-back based invalidation) as backup

14 14 Extent of Change Upon File Modifications  Compute delta using vdelta algorithm  Metric  as |vdelta(v1,v2)| |v1|+|v2| 2  Results  In 77% cases,   1%  In 96% cases,   10% Modification between successive versions is small  Delta encoding can be very useful

15 15 Access Dynamics Correlation between content and access dynamics Impact of age on file popularity Causes of first-time misses Spatial locality in client accesses Domain membership is significant except when there is a “hot” event of global interest Temporal stability of file popularity The set of popular documents mostly remains stable over a timescale of days Distribution of file popularity Zipf-like distribution but with a much larger  than at proxies

16 16 Impact of Age on Popularity For most documents, accesses are concentrated soon after creation

17 17 Causes of First-time Misses Up to 40% of cache misses are due to first time misses [VDA+99] DateNew files (%)Old files (%) Oct. 8, 9923.1676.84 Oct. 9, 9913.2286.78 Oct. 10,9913.2586.75 Oct. 11,9918.7581.28 Accesses to old documents account for most first-time misses  hard to anticipate such accesses & eliminate first-time misses

18 18 Temporal Stability of File Popularity Methodology Consider the traces from a pair of days Pick the top n popular documents from each day Compute the overlap Results One day apart:significant overlap (  80%) Two months apart: smaller overlap (20-80%) Ten months apart: very small overlap (mostly below 20%) The set of popular documents remains stable for days

19 19 Spatial Locality in Client Accesses Domain membership is significant except when there is a “hot” event of global interest

20 20 The Applicability of Zipf-law to Web requests The Web requests follow Zipf-like distribution Request frequency  1/i , where i is a document’s ranking The value of  is much larger in MSNBC traces 1.4 – 1.8 in MSNBC traces smaller or close to 1 in the proxy traces close to 1 in the small departmental server logs [ABC+96] Highest when there is a hot event

21 21 Impact of larger  Accesses in MSNBC traces are much more concentrated 90% of the accesses are accounted by Top 2-4% files in MSNBC traces Top 36% files in proxy traces (Microsoft proxies and the proxies studied in [BCF+99]) Top 10% files in small departmental server logs reported in [AW96] Popular news sites like MSNBC see much more concentrated accesses  Reverse caching and replication can be very effective!

22 22 Summary of Results & Implications FactsImplications Past modification history, when averaged over a sufficiently large window, yields a rough predictor Guide for setting TTL, but need alternative mechanism (e.g. callback- based invalidation) as backup Modification between successive versions is small Delta encoding can be very useful

23 23 Summary of Results & Implications (Cont.) FactsImplications The set of popular documents remains stable over a timescale of days Prefetch/push previously popular files that have undergone modification File popularity follows Zipf- like distribution, but with a much larger  than at proxies Potential of reverse caching & replication Accesses to old documents account for most first-time accesses Hard to anticipate such accesses, and eliminate first-time misses

24 24 Future Work Study data sets from other large server sites Different types of Web servers may have very different workload More studies such as ours will be needed Develop efficient cache consistency algorithms

25 25 Acknowledgement Jason Bender and Ian Marriott Erich Nahum Kiem-Phong Vo Damon Cole, Susan Dumais, Niccole Golden, Chris Haslam, Eric Horvitz, Geoff Voelker Anonymous reviewers


Download ppt "1 The Content and Access Dynamics of a Busy Web Server: Findings and Implications Venkata N. Padmanabhan Microsoft Research Lili Qiu Cornell University."

Similar presentations


Ads by Google