Presentation is loading. Please wait.

Presentation is loading. Please wait.

A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages 213-237, Feb. 2004 Apr. 18. 2006.

Similar presentations


Presentation on theme: "A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages 213-237, Feb. 2004 Apr. 18. 2006."— Presentation transcript:

1 A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages 213-237, Feb. 2004 Apr. 18. 2006 So Jeong Han

2 2 Content Results Conclusions

3 3 Results(1) – Document Size Analyze Fig 2. Document Size (byte) versus Top-level domain HTTP status code of 200 x : Document size (2 x-1 ~ 2 x Byte) 14 with standard deviation 1DomainDescription all4 ~ 32KB (66.5%).com52.5% of all.org8% of all.gov1.1% of all.edu2 ~ 16KB (64.9%)

4 4 Results(2) – Document Size Analyze Fig 3. Document Size (Word) versus Top-level domain

5 5 Results(3) – Status Code Analyze Fig 4. Distribution of HTTP status codes over crawl generations. for each crawl generation the percentage of page retrievals resulting in categories of status codes. y : start at 85% URL’s lifetime is limited. Object: Why do downloads fail?CategoryDescription 200Successful (Ok) 3xxRedirection (moved) 4xxClient errors other500 status code (Server Error) RobotExclblocked networknetwork-related error (DNS, TCP) Heisenberg Effect

6 6 Results(4) – TDL Analyze Fig 5. Download success per Top-level domain. Pages in.jp &.de &.edu is available The decline in the curves bears out the limited lifetime of Web pages

7 7 Results(5) – TDL Analyze Fig 6. Last successfully downloaded Web page per TLD For viewing the lifetime of URLs from different domains. Pages in.cn expire sooner than average. Downloaded during the final crawl Unreachable after crawl 1

8 8 Results(6) – Change Analyze Fig 7. Distribution of change Fig 8. scaled to show the low-percentage categories cumulative percentage distribution

9 9 Results(7) Fig 9. Type of markup change Markup of 1,468,671 pages change Observing link evolution of this type Crawler recognize Session ID and remove Session ID avoid recrawling the same content

10 10 Results(8) Fig 10. Type of markup change normalized by hosts Fig 9 were influenced by URL? Host? Change of type is counted per host. % of changes attributed to URL query (48% -> 4%) Tend to appear many times on the same page. When session ID are embedded in links to URL, They tend to appear in all relative links on the page Fig 9. Shrink

11 11 Results(9) Fig 11. Breakdown of tags that changed changes that were additions or deletions of tagsTagDescription !Comment (23.1%) A90% of A are ‘HREF’ (48.1%) IMG52% of IMG are ‘SRC’

12 12 Results(10) - TLD, Change Analyze Fig 12. Clustered rates of change by TLD Each bar is divided into six region change cluster.de domain problem Automatically generate (‘stuffing’) Use distinct host names as a front to a single server Trick link-based ranking algorithm. Draw visitors to the Adultweb site. Complete change cluster No change clusterDomainDescription.com Frequently more change than.gov,.edu.deHigher rate and degree of change than others

13 13 Results(11) - TLD, Change Analyze Solution Symbolic host names of all the URLs in our data set Singled out each IP address with more than a 1000 symbolic host names mapping to it. Fig 13. Clustered rates of change by TLD after excluding automatically generated keyword-spam documents. Eliminated about 60%

14 14 Results(12) – TLD, Change Analyze Conclusion Adult content continues to skew our results. Shingling technique might not be well adapted to writing system like Chinese or Kanji (not employ inter-word spacing) Extent of change is quite consistent with other TLD Fig 14. Clustered rates of change by TLD omitting the no change cluster after excluding automatically generated keyword-spam documents.

15 15 Results(13) - Change Analyze Fig 15. scaled to show the low-percentage categories, after excluding automatically generated keyword-spam documents. Bucket 0 : cut in half Right is monotonous consider whether the length of pages impacts their rate of change Fig 8.Fig 12.

16 16 Results(14) Fig 16. Clustered rates of change by document size. (byte) Document size is strongly related to rate of change. Small documents are mostly to change Large documents (32KB above) change much more frequently than smaller ones (4KB below).

17 17 Results(15) Fig 17. Clustered rates of change by the number of words per document. Sensitivity of our shingling techniques depend on the number of words in a document ‘all- or-nothing’ similarity metric gives a relatively coarse. Large documents are more likely to change than smaller one.

18 18 Results(16) Fig 18. Clustered rates of change by the number of words per document, and omitting the no change cluster.

19 19 Results(17) Fig 19. Clustered rates of change by top-level domain and number of words per document.com &.net Stronger effect for larger documents (than.gov.edu) -- Commercial Web site : appearance of freshness -- Educational & Governmental Web site : archival purpose

20 20 Results(18) Fig 20. Distribution of the standard deviations of the rate of change in a given document over its lifetime

21 21 Results(19) Plate 1. Logarithmic histogram of intra-document changes over three successive weeks, showing the absolute number of changes. The number of pre-images in a document unchanged from Week n to n+1 The number of pre-images in a document unchanged from Week n-1 to n (85,85) : web page don’t change much over a 3week inteval 10000 times higher than any other feature

22 22 Results(20) Plate 2. Logarithmic histogram of intra-document changes over three successive weeks, ormalized to show the conditional probabilities of changes. Indicating once again that past change is a strong predictor of future change.

23 23 Conclusion(1) Purpose : measuring the rate and degree of Web page Method crawled 151 million pages once a week for 11 weeks saving salient information about each downloaded document including a feature vector of the text without markup plus the full text of 0.1% of all downloaded pages

24 24 Conclusion(2) Conclusion (We found..) Web pages change : markup or in trivial ways change Relation with TLD frequency of change of a document (strong) degree of change (weaker) Document size both frequency and degree of change. (.com &.net) large documents change more often and more extensively Predict future change -> implications for web crawlers ‘German anomaly’ : fast-changing page is not worthy. Fin.


Download ppt "A large-scale study of the evolution of Web pages D. Fetterly, M. Manasse, M. Najork and L. Wiener SPE Vol.34 No.2 pages 213-237, Feb. 2004 Apr. 18. 2006."

Similar presentations


Ads by Google