DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

Hashing.
Data Compression CS 147 Minh Nguyen.
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Do Not Crawl In The DUST: Different URLs Similar Text Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
DL Compression – Beeri/Feitelson1 Compression דחיסה Introduction Information theory Text compression IL compression.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
1 Lecture 18 Syntactic Web Clustering CS
Near Duplicate Detection
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Modern Systems Analysis and Design Third Edition
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
Copyright © 2012 Pearson Education, Inc. Publishing as Prentice Hall 9.1.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
CPSC 203 Introduction to Computers Lab 21, 22 By Jie Gao.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
A Web Crawler Design for Data Mining
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Access Path Selection in a Relational Database Management System Selinger et al.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Lecture 12 Designing Databases 12.1 COSC4406: Software Engineering.
Copyright 2006 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Third Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Crawling The Web For a Search Engine Or Why Crawling is Cool.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu /12/5.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Copyright © 2009 Pearson Education, Inc. Publishing as Prentice Hall Essentials of Systems Analysis and Design Fourth Edition Joseph S. Valacich Joey F.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
WEB SPAM.
Near Duplicate Detection
Containers and Lists CIS 40 – Introduction to Programming in Python
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Modern Systems Analysis and Design Third Edition
13 Text Processing Hongfei Yan June 1, 2016.
Evaluation of Relational Operations: Other Operations
Kalyan Boggavarapu Lehigh University
Searching Similar Segments over Textual Event Sequences
Detecting Phrase-Level Duplication on the World Wide Web
Overview of Query Evaluation
Implementation of Relational Operations
Using Link Information to Enhance Web Page Classification
Information Retrieval and Web Design
CPS 296.3:Algorithms in the Real World
Presentation transcript:

DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF and IDIT KEIDAR Renu Garg M.tech(CN)

Introduction Solution & The Dust Buster algorithm Experimental results Outline

Introduction

What is DUST? The web is abundant with DUST: Different URLs with Similar Text. Examples: ◦ Standard Canonization:  “ ”  “ ” ◦ Domain names and virtual hosts  “ ”  “ ” ◦ Aliases and symbolic links:  “ ”  “ ” ◦ Parameters with little affect on content  Print=1 ◦ URL transformations:  “ ”  “ ”

Expensive to crawl ◦ Access the same document via multiple URLs Forces us to shingle ◦ An expensive technique used to discover similar documents Ranking algorithms suffer ◦ References to a document split among its aliases Multiple identical results ◦ The same document is returned several times in the search results Any algorithm based on URLs suffers Why DUST is bad???

Solution to DUST

Domain name aliases. Default file names: index.html, default.htm File path canonizations: “ dirname/../ ”  “”, “ // ”  “ / ” Escape sequences: “ %7E ”  “ ~ ” Site-specific DUST: ◦ “ story_ ”  “ story?id= “ ◦ “ news.google.com ”  “ google.com/news ” ◦ “ labs ”  “ laboratories ” This DUST is harder to find. How do we Fight DUST Today? (1) Standard Conversions.

Shingles are document sketches Used to compare documents for similarity Pr(Shingles are equal) = Document similarity Compare documents by comparing shingles Calculate Shingle: ◦ Take all m word sequences ◦ Hash them with h i ◦ Choose the min ◦ That's your shingle How do we Fight DUST Today? (2) Shingles

Alias DUST: simple substring substitutions ◦ “ story_1259 ”  “ story?id=1259 ” ◦ “ news.google.com ”  “ google.com/news ” ◦ “ /index.html ”  “” Parameter DUST: ◦ Standard URL structure: protocol://domain.name/path/name?para=val&pa=va ◦ Some parameters do not affect content:  Can be removed  Can changed to a default value Types of DUST

Dust rule: Transforms one URL to another ◦ Example: ‘ index.html ’  ‘’ Valid DUST rule: r is a valid DUST rule w.r.t. site S if for every URL u  S,  r(u) is a valid URL  r(u) and u have “ similar ” contents. DUST Rules!

DUSTBUSTER ALGORITHM Uncovers dust Discovers rules that transform a given URL to others that are likely to have similar content. DustBuster has four phases- The first phase uses the URL list alone to generate a short list of likely DUST rules. The second phase removes redundancies from this list. The next phase generates likely parameter substitution rules. The last phase validates or refutes each of the rules in the list, by fetching a small sample of pages.

Given: a list of URLs from a site S ◦ Previous Crawl log ◦ Web server log Want: to find valid DUST rules w.r.t. S ◦ As many as possible ◦ Including site-specific ones ◦ Minimizes number of fetches Applications: ◦ Site-specific canonization ◦ More efficient crawling Dust Algorithm

DUST Algorithm 1:Function DetectLikelyRules(URLList L) 2: create table ST (substring, prefix, suffix, size range/doc sketch) 3: create table IT (substring1, substring2) 4: create table RT (substring1, substring2, support size) 5: for each record r ∈ L do 6: for ℓ = 0 to S do 7: for each substring a of r.url of length ℓ do 8: p := prefix of r.url preceding a 9: s := suffix of r.url succeeding a 10: add (a, p, s, r.size range/r.doc sketch) to ST 11: group tuples in ST into buckets by (prefix,suffix) 12: for each bucket B do 13: if (|B| = 1 OR |B| > T) continue 14: for each pair of distinct tuples t1, t2 ∈ B do 15: if (LikelySimilar(t1, t2)) 16: add (t1.substring, t2.substring) to IT 17: group tuples in IT into rule supports by (substring1,substring2) 18: for each rule support R do 19: t := first tuple in R 20: add tuple (t.substring1, t.substring2, |R|) to RT 21: sort RT by support size 22: return all rules in RT whose support size is ≥ MS

How to detect likely DUST rules? Large Support Principle  : a string ◦ Ex:  = “ story_ ” u: URL that contains  as a substring ◦ Ex: u = “ ” Envelope of  in u: ◦ A pair of strings (p,s) ◦ p: prefix of u preceding  ◦ s: suffix of u succeeding  ◦ Example: p = “ ”, s = “ 2659 ” E( α): all envelopes of  in URLs that appear in input URL list Support(r) = all instances (u,v) of r. The support of a valid DUST rule is large

Envelopes Example

Rule Support : An Equivalent View    : an alias DUST rule ◦ Ex:  = “ story_ ”,  = “ story?id= “ Lemma: |Support(    )| = | E(  ) ∩ E(  )| Proof: ◦ bucket(p,s) = {  | (p,s)  E(  ) } ◦ Observation: (u,v) is an instance of    if and only if u = p  s and v = p  s for some (p,s) ◦ Hence, (u,v) is an instance of    iff (p,s)  E(  ) ∩ E(  )

DUST Algorithm 1:Function DetectLikelyRules(URLList L) 2: create table ST (substring, prefix, suffix, size range/doc sketch) 3: create table IT (substring1, substring2) 4: create table RT (substring1, substring2, support size) 5: for each record r ∈ L do 6: for ℓ = 0 to S do 7: for each substring a of r.url of length ℓ do 8: p := prefix of r.url preceding a 9: s := suffix of r.url succeeding a 10: add (a, p, s, r.size range/r.doc sketch) to ST 11: group tuples in ST into buckets by (prefix,suffix) 12: for each bucket B do 13: if (|B| = 1 OR |B| > T) continue 14: for each pair of distinct tuples t1, t2 ∈ B do 15: if (LikelySimilar(t1, t2)) 16: add (t1.substring, t2.substring) to IT 17: group tuples in IT into rule supports by (substring1,substring2) 18: for each rule support R do 19: t := first tuple in R 20: add tuple (t.substring1, t.substring2, |R|) to RT 21: sort RT by support size 22: return all rules in RT whose support size is ≥ MS

Experimental Results

Experiment of Dust Buster on four web sites: Dynamic Forum Academic Site( Large news site (cnn.com) Smaller news site (nydailynews.com) Detected from a log of about 20,000 unique URLs. On each site four logs from different time periods were used. Experimental Setup

DUST Distribution 47.1 DUST 25.7% Images 7.6% Soft Errors 17.9% Exact Copy 1.8% misc

DUST Distribution 47.1% of the duplicates in the test log are eliminated by Dust Buster’s canonization algorithm. The rest of the DUST can be divided among several categories: (1) duplicate images and icons. (2) replicated documents. (3) soft errors(pages with no meaningful content) results pages, etc.

Reduction in crawl size Web SiteReduction Achieved Academic Site18% Small News Site26% Large News Site6% Forum Site(using logs)4.7%

Dust Buster is an efficient algorithm Finds DUST rules Benefits ranking algorithms Very less storage and space requirements Increase the effectiveness of crawling Reduce indexing overhead Conclusions

Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01] Identifying plagiarized documents [Hoad,Zobel 03] Finding near-replicas [Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03] Copy detection [Brin,Davis,Garcia-Molina 95], [Garcia-Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia-Molina 96] More Related Work

References ZIV BAR-YOSSEF and IDIT KEIDAR : Do Not Crawl in the DUST: Different URLs with Similar Text