Presentation is loading. Please wait.

Presentation is loading. Please wait.

Group 3: Olena Hunsicker and Divya Josyula

Similar presentations


Presentation on theme: "Group 3: Olena Hunsicker and Divya Josyula"— Presentation transcript:

1 Group 3: Olena Hunsicker and Divya Josyula
“Rate of Change and other Metrics: a Live Study of the World Wide Web” Fred Douglis Anja Feldmann Balachander Krishnamurthy Jeffrey Mogul Group 3: Olena Hunsicker and Divya Josyula CS 791/891 "Web Syndication Formats" ODU Spring 2008

2 Presentation Overview
Motivation Behind the research Internet in 1997 What is the Web cache? Traces Statistics Analyzing the results Access rate Modification times Ages Modification Intervals Duplication Semantic Differences Conclusion CS 791/891 "Web Syndication Formats" ODU Spring 2008

3 Motivation Behind the Research
Assumptions: 1. Significant amount of web resources accessed more than once (locality of references) 2. “Those resources don’t change between accesses”. [1] (stability of value) Validate this assumptions Measure the benefits of using a shared proxy-server. Calculate the rate and nature of changes of Web resources How this metrics depend on: Access rate Resource size Content type Age at the time of reference Internet top level domain (TLD) Frequency of duplicates on the Web CS 791/891 "Web Syndication Formats" ODU Spring 2008

4 Historic Overview 1997 2007 19.5 million hosts [4] 200 million hosts
Table1. Changes on the Web from 1997 to 2007 1997 2007 19.5 million hosts [4] 200 million hosts 1 million of websites >92 millions of websites[5] Dial-up DSL/cable Internet CS 791/891 "Web Syndication Formats" ODU Spring 2008

5 CS 791/891 "Web Syndication Formats" ODU Spring 2008
What is a Web cache? Advantages: reduce latency and network traffic Proxy servers don’t cache documents that require authorization, include no-cache header Delta-Encoding reduces cache misses Client Browser Proxy server Origin server CS 791/891 "Web Syndication Formats" ODU Spring 2008

6 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Traces Static- Web “crawling” – doesn’t provide dynamic access information Dynamic- analyzing the proxy or web server log – can reflect access times and modification dates CS 791/891 "Web Syndication Formats" ODU Spring 2008

7 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Traces (cont) Amount of data : 19 GBytes Time limits: 17 days Where: gateway between AT&T Labs-Research and Internet Type of data: full contents of all HTTP requests and responses CS 791/891 "Web Syndication Formats" ODU Spring 2008

8 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Traces (cont) Used only 200 “OK” and 304 “Not Modified“ HTTP responses CS 791/891 "Web Syndication Formats" ODU Spring 2008

9 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Traces (cont) 79% of status-200 responses included Last-Modified header > telnet 80 | tee a1.out Trying Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~ohunsick/index.html HTTP/1.1 Host: HTTP/ OK Date: Sun, 27 Jan :12:38 GMT Server: Apache/2.2.0 Last-Modified: Sat, 10 Nov :22:47 GMT ETag: "5caedb-d56-d553dbc0" Accept-Ranges: bytes Content-Length: 3414 Content-Type: text/html <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso " /> <meta name="keywords" content="Olena Hunsicker" /> CS 791/891 "Web Syndication Formats" ODU Spring 2008

10 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Traces (cont) If status-200 responses didn’t include Last-Modified header & content changed, assume that resource was dynamically generated - use Date header > telnet 80 | tee a2.out Trying Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~ohunsick/index.html HTTP/1.1 Host: If-Modified-Since: Sat, 10 Nov :22:47 GMT HTTP/ Not Modified Date: Sun, 27 Jan :16:28 GMT Server: Apache/2.2.0 ETag: "5caedb-d56-d553dbc0" CS 791/891 "Web Syndication Formats" ODU Spring 2008

11 Statistics Content-Type Accesses % by count Resources Images
Table 2. Content Type distribution Content-Type Accesses % by count Resources Images (jpeg & gif) 69% 64% Text/html 20% 24% Application/octet-stream + others 11% 12% CS 791/891 "Web Syndication Formats" ODU Spring 2008

12 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Access Rate distinct resources in the AT&T trace resources (22%) were accessed more than once and returned multiple 200 “OK” responses or 304 “Not Modified” CS 791/891 "Web Syndication Formats" ODU Spring 2008

13 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Change Ratio Change ratio for the resource = # new instances of resource total # references Resource accessed more then once – 13 % modified Resource accessed 2 or more times – 16.5 % modified Overall – 15.4% all resources were modified between the accesses CS 791/891 "Web Syndication Formats" ODU Spring 2008

14 Results: Change Ratio (cont)
Fig. 2 Cumulative distribution of change ratio for the AT&T trace [1] Grouped by content type HTML only by # of references CS 791/891 "Web Syndication Formats" ODU Spring 2008

15 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Age Age = Request time - Last-Modified Time Fig. 3 Grouping data by number of references and resource size Thus, frequency of access and resource size do not affect the age CS 791/891 "Web Syndication Formats" ODU Spring 2008

16 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Age (cont). Fig 4. Grouping data by top-level domain (TLD) (edu, com, gov) an by content type. CS 791/891 "Web Syndication Formats" ODU Spring 2008

17 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Age (cont) Fig. 5. Grouping data by number of references All content types HTML only Conclusion: frequently accessed resources are younger CS 791/891 "Web Syndication Formats" ODU Spring 2008

18 Results: Modification Interval
Definition : Elapsed time between modifications of resources. Benefit : Helps cache in maintaining data consistency CS 791/891 "Web Syndication Formats" ODU Spring 2008

19 Results: Modification interval (cont)
Statistics Results Measurement by varying the no. of accesses The interval reduces as frequency of access increases Measurement by varying content type HTML resources change more often than static content types Content -type interval HTML 15 minutes Application /octet-stream 1 hour Images Gif/jpeg 1 day CS 791/891 "Web Syndication Formats" ODU Spring 2008

20 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Duplication A resource can have many replicas available under different URLs on the same or different machines Benefit of identifying replicas: - Reduce storage size of cache - Reduce number of accesses to the resource. - Extent of Duplication is an important aspect for HTTP Distribution and replication protocol CS 791/891 "Web Syndication Formats" ODU Spring 2008

21 Results: Duplication (cont)
Fig. 6 Number of hosts by comparison with number of replicas CS 791/891 "Web Syndication Formats" ODU Spring 2008

22 Results: Duplication (cont)
Observations 18% of the full body responses accessing an instance of particular resource were identical to at least one other instance of a different resource Possible causes : Multiple URL’s point to the same resource, for example: if you go to , you will end up at Same image embedded in two different HTML resources Different resources with the same links in their content CS 791/891 "Web Syndication Formats" ODU Spring 2008

23 Results: Semantic Differences
Semantically interesting items should : have recognizable pattern (phone numbers, <href ...>, <img ...>, addresses ) occurs reasonably often The string “ ” not necessarily, but likely is a phone number CS 791/891 "Web Syndication Formats" ODU Spring 2008

24 Results: Semantic Differences (cont)
# of forms that changed Churn = total # of forms For example, instance of the resource has 8 phone numbers. Next instance of the resource changes 4 phone numbers: Churn = 4/8 * 100% = 50% CS 791/891 "Web Syndication Formats" ODU Spring 2008

25 Results: Semantic Differences (cont)
Table 3. Percentage of instances having a given value of churn [1] churn HREF IMG 10-digit phone 7-digit phone 100% 3.3 4.7 1.4 0.9 3.2 >=75% 5.6 6.2 1.5 1.0 4.9 >=50% 9.7 12.6 2.1 6.3 >=25% 17.8 24.6 2.6 1.6 7.1 0% 41.2 48.6 96.5 98.0 90.2 Example: in 75% of cases, 4.9% of recognizable 7-digit phone numbers changed between instances CS 791/891 "Web Syndication Formats" ODU Spring 2008

26 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Conclusion Many resources change frequently Frequency of access, resource age and frequency of modification depends on: content type and TLD do not depend on the resource size Assumptions about locality of reference and stability of value for Web caching is valid for subset of the resources on the Web only. CS 791/891 "Web Syndication Formats" ODU Spring 2008

27 CS 791/891 "Web Syndication Formats" ODU Spring 2008
Questions: 1. The earlier studies on servers in Boston and Harvard Universities found that most popular resources change less frequently than others. Why their results were different? When multiple URL’s can refer to the same resource located on the same server? The researchers used the formula to calculate the age of the resource: Age = Response time – Last Modified time stamp. How is it different from Age header in HTTP response? CS 791/891 "Web Syndication Formats" ODU Spring 2008

28 CS 791/891 "Web Syndication Formats" ODU Spring 2008
References: Fred Douglis, Anja Feldman, Balachander Krishnamurthy, Jeffrey Mogul (1997). “Rate of Change and other Metrics: a Live Study of the World Wide Web”. Craig E. Wills, Mikhail Mikhailov (1999). “Towards a Better Understanding of Web Resources and Server Responces for Improved Caching”. Paul James (2006) “HTTP caching” “History of the Internet” (2007) Brief Timeline of the Internet CS 791/891 "Web Syndication Formats" ODU Spring 2008


Download ppt "Group 3: Olena Hunsicker and Divya Josyula"

Similar presentations


Ads by Google