How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference.

Slides:



Advertisements
Similar presentations
PHP I.
Advertisements

Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
4.01 How Web Pages Work.
WMES3103 : INFORMATION RETRIEVAL
Layer 7- Application Layer
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
CS 142 Lecture Notes: URLs and LinksSlide 1 Uniform Resource Locators (URLs) Scheme Host Name.
Using The World Wide Web Information Gathering. TCP/IP Communications protocol  how computers communicate or “talk” How does it work?
Topics in this presentation: The Web and how it works Difference between Web pages and web sites Web browsers and Web servers HTML purpose and structure.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
Lecturer: Ghadah Aldehim
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
ES Module 5 Uniform Resource Locators, Hypertext Transfer Protocol, & Common Gateway Interface.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Contents Data Communications Applications –File & print serving –Mail –Domain.
Business English at Work © 2003 Glencoe/McGraw-Hill.
Chapter 1: Introduction to Web Applications. This chapter gives an overview of the Internet, and where the World Wide Web fits in. It then outlines the.
TCP/IP Protocols Dr. Sharon Hall Perkins Applications World Wide Web(HTTP) Presented by.
1 Ed Pentz, CrossRef CrossRef and DOIs: New Developments 32 nd LIBER Annual General Conference Extending the Network: libraries and their partners 18 June.
Developing a Web Site. Web Site Navigational Structures A storyboard is a diagram of a Web site’s structure, showing all the pages in the site and indicating.
Chapter 8 Cookies And Security JavaScript, Third Edition.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
The Inter-network is a big network of networks.. The five-layer networking model for the internet.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
1 Very similar items lost in the Web: An investigation of deduplication by Google Web Search and other search engines CWI, Amsterdam,
Web Design (1) Terminology. Coding ‘languages’ (1) HTML - Hypertext Markup Language - describes the content of a web page CSS - Cascading Style Sheets.
1 Seminar on Service Oriented Architecture Principles of REST.
Tutorial 2 Developing a Web Site. XP Objectives Learn how to storyboard various Web site structures Create links among documents in a Web site Understand.
Application Layer Honolulu Community College Cisco Academy Training Center Semester 1 Version
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.
Digital Literacy Concepts and basic vocabulary. Digital Literacy Knowledge, skills, and behaviors used in digital devices (computers, tablets, smartphones)
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 1 Fundamentals.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Web Server.
Module: Software Engineering of Web Applications Chapter 2: Technologies 1.
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
01 - Introduction Informatics Department Parahyangan Catholic University.
The Internet, Fourth Edition-- Illustrated 1 The Internet – Illustrated Introductory, Fourth Edition Unit B Understanding Browser Basics.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
IP ADDRESS An IP (Internet Protocol) address is a unique identifier for a node or host connection on an IP network. An IP address is a 32 bit binary number.
CITA 310 Section 4 Apache Configuration (Selected Topics from Textbook Chapter 6)
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
27.1 Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Date of download: 5/28/2016 From: Medical Resources on the Internet Ann Intern Med. 1995;123(2): doi: / Components.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Efficient Signature Matching with Multiple Alphabet Compression Tables Publisher : SecureComm, 2008 Author : Shijin Kong,Randy Smith,and Cristian Estan.
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
Finally getting to html and CSS… Tim Berners-Lee, the writer of the software program that makes him the inventor of the WWW, defines the Internet as a.
4.01 How Web Pages Work.
4.01 How Web Pages Work.
Technologies and Applications
Warm Handshake with Websites, Servers and Web Servers:
Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Naming in Distributed Web-based Systems
Some Common Terms The Internet is a network of computers spanning the globe. It is also called the World Wide Web. World Wide Web It is a collection of.
Net 323 D: Networks Protocols
Internet Protocol Mr. Paulk.
Navigating The World Wide Web
Application layer Lecture 7.
What is a Search Engine EIT, Author Gay Robertson, 2017.
Net 323 D: Networks Protocols
Web Page Concept and Design :
Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
4.01 How Web Pages Work.
Presentation transcript:

How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference on HIS

Contents  Abstract  Introduction  URL Normalizations  Evaluation of a URL Normalization Method  Empirical Evaluation  Conclusions and Future Works

Abstract  Syntactically different URLs could represent the same web page  Duplicate representation handle a large amount of same web pages unnecessarily  URL normalization helps eliminate duplicate URLs  In this paper  presents a method that evaluates the effectiveness of a URL normalization method

Introduction  URL (Uniform Resource Locator)  A string that represents a web resource (a web page)  Equivalent URL  If more than two URLs locate the same web page  The inability to recognize two equivalent URLs being equivalent gives rise to a large amount of processing overhead

Introduction (2)  False negative  Determining equivalent URLs not to be equivalent  False positive  Determining non-equivalent URLs to be equivalent

Introduction (3)  URL normalizations [5]  Transform syntactically different but equivalent URLs into a syntactically identical string  The three types of URL normalizations  syntax-based normalization  scheme-based normalization  protocol-based normalization  The first two types of normalizations reduce false negatives while strictly avoiding false positives  Standard community does not give specific methods for the protocol-based normalization [6]

Introduction (4)  Extended normalization methods (1) [6]  Changing letters in the path component into the lower- case letters or into the upper-case letters  >  Attaching and eliminating the “www” prefix to URLs with and without the prefix in the host component   Eliminating the last slash symbol from URLs   Eliminating default page names in the path component 

Introduction (5)  Extended normalization methods (2)  Allow false positives  Lose, gain, or change web pages unintentionally  Reduce the number of total URLs in operation  Presents a scheme to evaluate the effectiveness of URL normalization methods  URL reduction rate  Web page loss/gain/change rate  94 million URLs (20,799 web sites in Korea)  Help select normalization methods

URL Normalizations  URL components  scheme : protocol (here, Hypertext Transfer Protocol)  authority : user information, host, port  path : directories  query : parameter names, values  fragment : particular part of a document

Standard URL Normalizations  A process that transforms a URL into a canonical form  syntax-based normalization  Characters in the scheme and host components into lower- case letters  ->  All unreserved characters (i.e., uppercase and lowercase letters, decimal digits, …) should be decoded  ->  path segment “.” and “..” are removed appropriately  ->

Standard URL Normalizations (2)  Scheme-based normalization  Default port number is truncated from the URL  ->  If path string is null, then the path string is transformed into “/”  ->  Fragment in the URL is truncated  ->  Protocol-based normalization  result of accessing the resources  the common conventions of their scheme’s dereference algorithm  ->

Extended URL Normalizations  Standard Normalization  No false positive  High possibility of false negatives  In web applications (such as web crawlers)  handle a huge number of URLs  reducing the possibility of false negatives implies reduction of URLs that need to be considered    Extended URL Normalization  Significantly reduce the possibility of false negatives  Allow false positives on a limited level  How to evaluate the effectiveness of an extended normalization method precisely ?

Evaluation of a URL Normalization Method  Two different points of view  how much URLs are reduced  how many pages are lost, gained, or changed  Suppose  Transform a given URL u1 in the original form into a URL u2 in a canonical form  The u1 and u2 locate web page p1 and p2 on the web, respectively  There are totally ten cases to consider

Evaluation of a URL Normalization Method (2)  Lose a web page (2, 4, 9)  Gain a web page (8) or Get a different page (7)  Negative false (2, 4, 7, 8, 9)

Evaluation of a URL Normalization Method (3)  (1) Page p1 exists on the web  (A) Page p2 does not exist (4, 9)  False positive, lose one page p1  (B) Page p2 exists, p1 & p2 same page (1, 6)  No false positive, save one page request  (C) Page p2 exists, p1 & p2 are not same (7, 2)  False positive, loss (2) or loss & gain (7)  (2) Page p1 does not exist  (A) URL u2 is already known to us (3, 5)  Do not loss any pages, save one page request  (B) URL u2 is not known to us (8, 10)  Gain one web page (8), lose nothing (10)  The number of page requests remains unchanged

Evaluation of a URL Normalization Method (4)  For evaluating the effectiveness of the URL normalization, we propose a number of metrics  Let N be the total number of URLs that are considered  Page loss rate = the total number of lost pages / N.  Page gain rate = the total number of gain pages / N  Page change rate = the total number of change pages / N  Page non-loss rate = the total number of non-loss pages / N  Reduction of URL  URL reduction rate = 1 - (the unique number of URLs after normalization / the unique number of URLs before normalization)  If we normalize 100 distinct URLs into 90 distinct URLs  The URL reduction rate is 0.1 (1 -90/100, or 10%)  A good normalization method  A high value of URL reduction rate  low values of page loss/gain/change