Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Slides:



Advertisements
Similar presentations
Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
Advertisements

Hashing.
Similarity and Distance Sketching, Locality Sensitive Hashing
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
The course Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Lecture 18 Syntactic Web Clustering CS
Advanced Algorithms for Massive Datasets Basics of Hashing.
Near Duplicate Detection
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Finding Similar Items.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
CS276 Lecture 16 Web characteristics II: Web size measurement Near-duplicate detection.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Locality Sensitive Hashing Basics and applications.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Big Data Infrastructure
CS276 Information Retrieval and Web Search
Locality-sensitive hashing and its applications
Near Duplicate Detection
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Locality-sensitive hashing and its applications
Sketching, Locality Sensitive Hashing
Lecture 16: Web search/Crawling/Link Analysis
Finding Similar Items: Locality Sensitive Hashing
Theory of Locality Sensitive Hashing
Lecture 11: Nearest Neighbor Search
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Detecting Phrase-Level Duplication on the World Wide Web
RUM Conjecture of Database Access Method
CS5112: Algorithms and Data Structures for Applications
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Minwise Hashing and Efficient Search
On the resemblance and containment of documents (MinHash)
Hash Functions for Network Applications (II)
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
DATA STRUCTURES-COLLISION TECHNIQUES
Three Essential Techniques for Similar Documents
Collision Resolution: Open Addressing Extendible Hashing
Presentation transcript:

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Duplicate documents The web is full of duplicated content Few exact duplicate detection Many cases of near duplicates E.g., Last modified date the only difference between two copies of a page Sec. 19.6

Near-Duplicate Detection Problem Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … Mistaken – spider errors 30% of web-pages are near-duplicates [1997]

Desiderata Storage: only small sketches of each document. Computation: the fastest possible Stream Processing : once sketch computed, source is unavailable Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do

Natural Approaches Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability guarantees MD5 – cryptographically-secure string hashes Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents  similar samples But – even samples of same document will differ

Karp-Rabin Fingerprints Consider – m-bit string A = 1 a 1 a 2 … a m Basic values: Choose a prime p in the universe U ≈ 2 64 Fingerprint: f(A) = A mod p Rolling hash  given B = a 2 … a m a m+1 f(B) = [2 m-1 (A – 2 m - a 1 2 m-1 ) + 2 m + a m+1 ] mod p Prob[false hit] = Prob p divides (A-B) = #div(A-B)/ #prime(U) < (log (A+B)) / #prime(U) ) ≈ (m log U)/U

Basic Idea [Broder 1997] Shingling dissect document into q-grams (shingles) represent documents by shingle-sets reduce problem to set intersection [ Jaccard ] They are near-duplicates if large shingle-sets intersect enough

#1. Doc Similarity  Set Intersection Doc B S B SASA Doc A Jaccard measure – similarity of S A, S B Claim: A & B are near-duplicates if sim(S A,S B ) is high We need to cope with “Set Intersection” fingerprints of shingles (for space/time efficiency) min-hash to estimate intersections sizes (further efficiency)

Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint #2. Sets of 64-bit fingerprints Fingerprints: Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) In practice, use 64-bit fingerprints, i.e., U=2 64 Prob[collision] ≈ (8q * 64)/2 64 << 1 This reduces space for storing the multi-sets and the time to intersect them, but...

#3. Sketch of a document Sets are large, so their intersection is still too costly Create a “sketch vector” (of size ~200) for each shingle-set Documents that share ≥ t (say 80%) of the sketch-elements are claimed to be near duplicates Sec. 19.6

Sketching by Min-Hashing Consider S A, S B  {0,…,p-1} Pick a random permutation π of the whole set P (such as ax+b mod p) Define  = min{π(S A )},  = min{π(S B )} minimal element under permutation π Lemma:

Strengthening it… Similarity sketch sk(A) = k minimal elements under π(S A ) We might also take K permutations and the min of each Note : we can reduce the variance by using a larger k

Computing Sketch[i] for Doc1 Document Start with 64-bit f(shingles) Permute with  i Pick the min value Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document Are these equal? Test for 200 random permutations:  ,  ,…  200 AB Sec. 19.6

However… Document 1 Document A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Claim: This happens with probability Size_of_intersection / Size_of_union B A Sec. 19.6

#4. Detecting all duplicates Brute-force (quadratic time): compare sk(A) vs. sk(B) for all the pairs of docs A and B. Still (num docs)^2 is too much computing even if it is executed in internal memory Locality sensitive hashing (LSH) for sk(A)  sk(B) Sample h elements of sk(A) as ID (may induce false positives) Create t IDs (to reduce the false negatives) If at least one ID matches with another one (wrt same h-selection), then A and B are probably near- duplicates (hence compare).

#4. do you implement this? GOAL: If at least one ID matches with another one (wrt same h- selection), then A and B are probably near-duplicates (hence compare). SOL 1: Create t hash tables (with chaining), one per ID [recall that this is an h-sample of sk()]. Insert the docID in the slots of each ID (using some hash) Scan every bucket and check which docID share >=1 ID. SOL 2: Sort by each ID, and then check the consecutive equal ones.