Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference.

Similar presentations


Presentation on theme: "How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference."— Presentation transcript:

1 How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference on Human.Society@Internet, HIS

2 Contents  Abstract  Introduction  URL Normalizations  Evaluation of a URL Normalization Method  Empirical Evaluation  Conclusions and Future Works

3 Abstract  Syntactically different URLs could represent the same web page  Duplicate representation handle a large amount of same web pages unnecessarily  URL normalization helps eliminate duplicate URLs  In this paper  presents a method that evaluates the effectiveness of a URL normalization method

4 Introduction  URL (Uniform Resource Locator)  A string that represents a web resource (a web page)  Equivalent URL  If more than two URLs locate the same web page  The inability to recognize two equivalent URLs being equivalent gives rise to a large amount of processing overhead

5 Introduction (2)  False negative  Determining equivalent URLs not to be equivalent  False positive  Determining non-equivalent URLs to be equivalent

6 Introduction (3)  URL normalizations [5]  Transform syntactically different but equivalent URLs into a syntactically identical string  The three types of URL normalizations  syntax-based normalization  scheme-based normalization  protocol-based normalization  The first two types of normalizations reduce false negatives while strictly avoiding false positives  Standard community does not give specific methods for the protocol-based normalization [6]

7 Introduction (4)  Extended normalization methods (1) [6]  Changing letters in the path component into the lower- case letters or into the upper-case letters  http://acm.org/PUBS/journals.html- >http://acm.org/pubs/journals.html  Attaching and eliminating the “www” prefix to URLs with and without the prefix in the host component  http://www.ssu.ac.kr->http://www.ssu.ac.kr  Eliminating the last slash symbol from URLs  http://www.acm.org/pubs/->http://www.acm.org/pubs  Eliminating default page names in the path component  http://www.acm.org/index.htm->http://www.acm.org/

8 Introduction (5)  Extended normalization methods (2)  Allow false positives  Lose, gain, or change web pages unintentionally  Reduce the number of total URLs in operation  Presents a scheme to evaluate the effectiveness of URL normalization methods  URL reduction rate  Web page loss/gain/change rate  94 million URLs (20,799 web sites in Korea)  Help select normalization methods

9 URL Normalizations  URL components  scheme : protocol (here, Hypertext Transfer Protocol)  authority : user information, host, port  path : directories  query : parameter names, values  fragment : particular part of a document

10 Standard URL Normalizations  A process that transforms a URL into a canonical form  syntax-based normalization  Characters in the scheme and host components into lower- case letters  HTTP://EXAMPLE.com -> http://example.com  All unreserved characters (i.e., uppercase and lowercase letters, decimal digits, …) should be decoded  http://example.com/%7Esmith -> http://example.com/~smith  path segment “.” and “..” are removed appropriately  http://example.com/a/b/./../c.htm -> http://exmaple.com/a/c.htm

11 Standard URL Normalizations (2)  Scheme-based normalization  Default port number is truncated from the URL  http://example.com:80/ -> http://example.com/  If path string is null, then the path string is transformed into “/”  http://example.com -> http://example.com/  Fragment in the URL is truncated  http://example.com/list.htm#chap1 -> http://example.com/list.htm  Protocol-based normalization  result of accessing the resources  the common conventions of their scheme’s dereference algorithm  http://example.com/a/b -> http://example.com/a/b/

12 Extended URL Normalizations  Standard Normalization  No false positive  High possibility of false negatives  In web applications (such as web crawlers)  handle a huge number of URLs  reducing the possibility of false negatives implies reduction of URLs that need to be considered  http://www.acm.org/  http://www.acm.org/index.html  Extended URL Normalization  Significantly reduce the possibility of false negatives  Allow false positives on a limited level  How to evaluate the effectiveness of an extended normalization method precisely ?

13 Evaluation of a URL Normalization Method  Two different points of view  how much URLs are reduced  how many pages are lost, gained, or changed  Suppose  Transform a given URL u1 in the original form into a URL u2 in a canonical form  The u1 and u2 locate web page p1 and p2 on the web, respectively  There are totally ten cases to consider

14 Evaluation of a URL Normalization Method (2)  Lose a web page (2, 4, 9)  Gain a web page (8) or Get a different page (7)  Negative false (2, 4, 7, 8, 9)

15 Evaluation of a URL Normalization Method (3)  (1) Page p1 exists on the web  (A) Page p2 does not exist (4, 9)  False positive, lose one page p1  (B) Page p2 exists, p1 & p2 same page (1, 6)  No false positive, save one page request  (C) Page p2 exists, p1 & p2 are not same (7, 2)  False positive, loss (2) or loss & gain (7)  (2) Page p1 does not exist  (A) URL u2 is already known to us (3, 5)  Do not loss any pages, save one page request  (B) URL u2 is not known to us (8, 10)  Gain one web page (8), lose nothing (10)  The number of page requests remains unchanged

16 Evaluation of a URL Normalization Method (4)  For evaluating the effectiveness of the URL normalization, we propose a number of metrics  Let N be the total number of URLs that are considered  Page loss rate = the total number of lost pages / N.  Page gain rate = the total number of gain pages / N  Page change rate = the total number of change pages / N  Page non-loss rate = the total number of non-loss pages / N  Reduction of URL  URL reduction rate = 1 - (the unique number of URLs after normalization / the unique number of URLs before normalization)  If we normalize 100 distinct URLs into 90 distinct URLs  The URL reduction rate is 0.1 (1 -90/100, or 10%)  A good normalization method  A high value of URL reduction rate  low values of page loss/gain/change


Download ppt "How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference."

Similar presentations


Ads by Google