Download presentation

1
**Near-Duplicates Detection**

Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky

2
**Why duplicate detection?**

About 30-40% of the pages on the Web are (near) duplicates of other pages. E.g., mirror sites Search engines try to avoid indexing duplicate pages Save storage Save processing time Avoid returning duplicate pages in search results Improved user’s search experience The goal: detect duplicate pages The web is full of duplicated content Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., Last modified date the only difference between two copies of a page There is little point in indexing (nearly) the same content over and over again There is certainly no reason to return the same content multiple times in search result pages How can near-duplicate pages be identified in a scalable and reliable manner?

3
**Exact-duplicates detection**

A naïve approach – detect exact duplicates Map each page to some fingerprint, e.g. 64-bit If two web pages have an equal fingerprint check if content is equal

4
**Near-duplicates What about near-duplicates? The challenge**

Pages that are almost identical. Common on the Web. E.g., only date differs. Eliminating near duplicates is desired! The challenge How to efficiently detect near duplicates? Exhaustively comparing all pairs of web pages wouldn’t scale.

5
Shingling K-shingles of a document d is defined to be the set of all consecutive sequences of k terms in d k is a positive integer E.g., 4-shingles of “My name is Inigo Montoya. You killed my father. Prepare to die”: { my name is inigo name is inigo montoya is inigo montoya you inigo montoya you killed montoya you killed my, you killed my father killed my father prepare my father prepare to father prepare to die }

6
Computing Similarity Intuition: two documents are near-duplicates if their shingles sets are ‘nearly the same’. Measure similarity using Jaccard coefficient Degree of overlap between two sets Denote by S (d) the set of shingles of document d J(S(d1),S(d2)) = |S (d1) S (d2)| / |S (d1) S (d2)| If J exceeds a preset threshold (e.g. 0.9) declare d1,d2 near duplicates. Issue: computation is costly and done pairwise How can we compute Jaccard efficiently ? Metric: Symmetry, reflexive, trainable inequality

7
**Hashing shingles Map each shingle into a hash value integer**

Over a large space, say 64 bits H(di) denotes the hash values set derived from S(di) Need to detect pairs whose sets H() have a large overlap How to do this efficiently ? In next slides …

8
**Permuting Let p be a random permutation over the hash values space**

Let P(di) denote the set of permuted hash values in H(di) Let xi be the smallest integer in P(di)

9
**Illustration Document 1 264 Start with 64-bit H(shingles)**

Permute on the number line with p Pick the min value 264 264 264

10
**Key Theorem Theorem: J(S(di),S(dj)) = P(xi = xj)**

xi, xj of the same permutation Intuition: if shingle sets of two documents are ‘nearly the same’ and we randomly permute, then there is a high probability that the minimal values are equal.

11
**Proof (1) View sets S1,S2 as columns of a matrix A Example**

one row for each element in the universe. aij = 1 indicates presence of item i in set j Example S1 S2 Jaccard(S1,S2) = 2/5 = 0.4

12
**Proof (2) Let p be a random permutation of the rows of A**

Denote by P(Sj) the column that results from applying p to the j-th column Let xi be the index of the first row in which the column P(Si) has a 1 P(S1) P(S2)

13
**Proof (3) For columns Si, Sj, four types of rows**

A 1 1 B 1 0 C 0 1 D 0 0 Let A = # of rows of type A Clearly, J(S1,S2) = A/(A+B+C)

14
**Proof (4) Previous slide: J(S1,S2) = A/(A+B+C)**

Claim: P(xi=xj) = A/(A+B+C) Why ? Look down columns Si, Sj until first non-Type-D row I.e., look for xi or xj (the smallest or both if they are equal) P(xi) = P(xj) type A row As we picked a random permutation, the probability for a type A row is A/(A+B+C) P(xi=xj) = J(S1,S2)

15
**Sketches Thus – our Jaccard coefficient test is probabilistic Method:**

Need to estimate P(xi=xj) Method: Pick k (~200) random row permutations P Sketchdi = list of xi values for each permutation List is of length k Jaccard estimation: Fraction of permutations where sketch values agree | Sketchdi Sketchdj | / k

16
**Example Sketches S1 S2 S3 Perm 1 = (12345) 1 2 1**

Similarities 0/ / /3

17
**Algorithm for Clustering Near-Duplicate Documents**

1.Compute the sketch of each document 2.From each sketch, produce a list of <shingle, docID> pairs 3.Group all pairs by shingle value 4.For any shingle that is shared by more than one document, output a triplet <smaller-docID, larger-docID, 1>for each pair of docIDs sharing that shingle 5.Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docID, larger-docID, # common shingles> 6.Join any pair of documents whose number of common shingles exceeds a chosen threshold using a “Union-Find”algorithm 7.Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents Implementation nicely fits the “map-reduce”programming paradigm

18
**Implementation Trick Permuting universe even once is prohibitive**

Row Hashing Pick P hash functions hk Ordering under hk gives random permutation of rows One-pass Implementation For each Ci and hk, keep slot for min-hash value Initialize all slot(Ci,hk) to infinity Scan rows in arbitrary order looking for 1’s Suppose row Rj has 1 in column Ci For each hk, if hk(j) < slot(Ci,hk), then slot(Ci,hk) hk(j)

19
**Example C1 C2 R1 1 0 R2 0 1 R3 1 1 R4 1 0 R5 0 1 C1 slots C2 slots**

h(1) = g(1) = h(2) = g(2) = h(3) = g(3) = h(4) = g(4) = h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = g(5) =

Similar presentations

OK

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Download ppt on civil disobedience movement of 1930 Ppt on chemical reactions and its types Ppt on series and parallel circuits Ppt on tamper resistant outlets Ppt on book review of fish tales Controller area network seminar ppt on 4g Maths ppt on algebraic identities Ppt on study designs in epidemiology Ppt on indian politics quoted Ppt on series and parallel circuits equations