Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani.

Similar presentations


Presentation on theme: "CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani."— Presentation transcript:

1 CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani

2 CS 361A 2 Game Plan for Week  Fingerprints  Document Similarity  Shingling  Min-Hashing  Min-Wise Independent Permutations

3 CS 361A 3 Fingerprints   – set of large objects (e.g., URLs)  Goal avoid storing large objects explicitly quick-and-dirty equality-testing  Fingerprints? Short tags for objects Distinct fingerprints  distinct objects Distinct objects  probably distinct fingerprints

4 CS 361A 4 Formalization  Fingerprint length k  fingerprint space size N=2 k  Fingerprint function family F = { f :  k }  Random f  R F  f(A)  f(B)  A  Collisions: P[ f(A) = f(B) | A    (ideally 2 O(-k) )  Typical Application Adversarial object-set S with |S| = n << 2 k Goal – |f(S)| = |S| with high probability n 2 pair-wise collisions possible  need 2 k > n 2 (to avoid Birthday Paradox)

5 CS 361A 5 Example – URL Fingerprints  Search Engines Manage large numbers of URL strings Long, variable strings (embedded objects/database-queries)  Desiderata small/fixed-length encodings – hopefully, unique Some scenarios oExact string irrelevant oOnly need ability to distinguish distinct URLs Even otherwise, unique IDs useful for indexing  Numbers? 4 billion webpages  n=2 32 N  n 2  k=64 Fingerprints  8-byte representation

6 CS 361A 6 Fingerprinting vs Hashing  Hashing h:  k Set Membership testing for set S of size n Desire uniform distribution over bin address  k Minimize collisions per bin – reduce lookup time Minimize hash table size  n  N=2 k  Fingerprinting f :  k Object Equality testing over set S of size n Distribution over  k is irrelevant Avoid collisions altogether Tolerate larger k – typically N > n 2

7 CS 361A 7 Fingerprinting Strings  Typical Application – but techniques extend to combinatorial objects (database tuples, trees/graphs)  Obvious techniques Checksum – no worst-case collision probability guarantees MD5 – cryptographically-secure string hashes orelatively slow oavoids leaking information about original string  Rabin’s Scheme Algebraic technique – polynomial arithmetic Efficient – need (1 table lookup + 1 xor + 1 shift) per byte other nice properties…

8 CS 361A 8 Rabin Fingerprints  Consider – m-bit string A=a 1 a 2 … a m  Assume – a 1 =1 and fixed-length strings (wlog)  Encoding Strings Degree-m polynomials over Z 2 A(x) = a 1 x m-1 + a 2 x m-2 + … + a m-1 x 1 + a m  Fingerprints P(x): random, irreducible deg-k polynomial over Z 2 (easy to sample such polynomials) irreducible  unlike x 2 +x+1, can factor x 2 +1=(x+1) 2 f(A) = A(x) mod P(x)

9 CS 361A 9 Analysis  Fix S – n strings of length m  Consider Collision f(A)=f(B)  A(x)=B(x) mod P(x)  Q S =0 mod P(x) Therefore – P(x) is factor of Q S (x)  Collision Probability? degree(Q S ) = n 2 m number of irreducible degree-k factors of Q S (x) is < n 2 m/k Fact: Number of irreducible degree-k polynomials > (2 k -2 k/2 )/k Prob[random P(x) divides Q S (x)] < n 2 m/2 k  Prob [fingerprints not distinct] <

10 CS 361A 10 Beneficial Properties  Hardware-level implementation Z 2 -polynomials same as strings simple shift-register operations  Distributivity – f(A+B) = f(A) + f(B) over Z 2  Let  = concatenation f(A  B) = f(f(A)  ) f(A  B) = A(x)*t m + B(x) mod P(x)  Fingerprint sliding windows over strings – low incremental cost

11 CS 361A 11 Duplicate Document Detection  Problem Given – large collection of arbitrary documents Identify – near-duplicate documents  Web search engines Proliferation of near-duplicate documents oLegitimate – mirrors, local copies, updates, … oMalicious – spam, spider-traps, dynamic URLs, … oMistaken – spider errors 30% of web-pages are near-duplicates [Broder et al 1997] Cost – RAM/disk, search quality, unhappy users Enterprise search – even larger amount of duplication SCAM – plagiarism detection [Shivakumar et al 1998]

12 CS 361A 12 Natural Approaches  Fingerprinting? only works for exact matches here – must identify even near-duplicates  Random Sampling? sample substrings (phrases, sentences, etc) hope: similar documents  similar samples No – even samples of same document will differ  Edit-distance? metric for approximate string-matching expensive – even for one pair of strings impossible – for 10 32 web documents

13 CS 361A 13 Desiderata  Storage only small sketches of each document.  Computation O(n log n) time on n documents  Stream Processing once sketch computed, source is unavailable  Error Guarantees problem scale  small biases have large impact need formal guarantees – heuristics will not do

14 CS 361A 14 Basic Idea [Broder 1997]  Shingling dissect document into q-grams (shingles) represent documents by shingle-sets near-duplicates  shingle-sets intersection is large reduce problem to set intersection  Set Intersection fingerprints of shingles min-hash to estimate intersections sizes

15 CS 361A 15 Shingling  Shingle – q contiguous tokens/words (q-gram)  Consider following “document” a rose is a rose is a rose  Choose q=4  get multi-set of shingles a rose is a rose is a rose is a rose is a rose is a rose is a rose

16 CS 361A 16 Multiset of Fingerprints Doc shingling Multiset of Shingles fingerprint Documents  Sets of 64-bit fingerprints Fingerprints? Use Rabin fingerprints Fingerprint space U = [0, …, N-1] In practice, use 64-bit fingerprints, i.e., N=2 64 Result – uniformity in length of strings

17 CS 361A 17 Similarity of Documents Doc B SBSB SASA Doc A Jaccard measure – similarity of S A, S B  U = [0 … N-1] Claim: A & B are near-duplicates if sim(S A,S B ) is high Claim: A is contained in B if con(S A,S B ) is high

18 CS 361A 18 Remarks  Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision  Shingle Size q ε [3 … 10] Short shingles  increase similarity of unrelated documents oWith q=1, sim(S A,S B ) =1  A is permutation of B oNeed larger q to sensitize to permutation changes Long shingles  small random changes have larger impact  Similarity Measure Similarity is non-transitive, non-metric But – dissimilarity 1- sim(S A,S B ) is a metric [Charikar 02]  [Ukkonen 92] – relate q-gram & edit-distance

19 CS 361A 19 Example  A = “a rose is a rose is a rose”  B = “a rose is a flower which is a rose”  Preserving multiplicity q=1  sim(S A,S B ) = 0.7 oS A = {a, a, a, is, is, rose, rose, rose} oS B = {a, a, a, is, is, rose, rose, flower, which} q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) = 0.3  Disregarding multiplicity q=1  sim(S A,S B ) = 0.6 q=2  sim(S A,S B ) = 0.5 q=3  sim(S A,S B ) = 0.4285

20 CS 361A 20 Min-Hashing  Consider S A, S B  U Pick – random permutation π of U Define  = π -1 ( min{π(S A )} ) and  = π -1 ( min{π(S B )} ) Meaning? – minimal element under permutation π  Lemma: Let δ = min{ π(S A  S B ) } Claim:  =  π -1 (δ)  S A  S B Clearly

21 CS 361A 21 Min-Hashing  Similarity Sketches Succinct representation of fingerprint sets S A Allows efficient estimation of sim(S A,S B ) Basic idea – use min-hash of fingerprints  sk(A) = k minimal elements under π(S A )  Claim: E[ sim(sk(A), sk(B)) ] = sim(S A,S B ) For each   sk(A)  sk(B)  Observe sketch-similarity is unbiased estimator of similarity reducing variance – use larger k

22 CS 361A 22 Remarks  Implementation shingle/fingerprint/sketch document in streams Issue – cost of pairwise comparison of sketches? ocluster sketch-streams [Broder et al, Guha et al] oOpen? – hashing sketches to identify similarity  [Broder-Mitzenmacher 99] – Min-Hash is only unbiased estimator  [Indyk-Motwani 99] – Locality-Sensitive Hash collisions more likely for similar items Min-Hash is special case

23 CS 361A 23 Multiple Permutations  Better Variance Reduction Instead of larger k, stick with k=1 Multiple, independent permutations  Sketch Construction Pick p random permutations of U – π 1,π 2, …,π p sk(A) = minimal elements under π 1 (S A ), …, π p (S A )  Claim: E[ sim(sk(A),sk(B)) ] = sim(S A,S B ) Earlier lemma  true for p=1 Linearity of expectations Variance reduction – independence of π 1, …,π p

24 CS 361A 24 Min-Wise Indep Permutations  Problem Truly-random π over U = [0 … N-1] is infeasible But – do we really need true randomness?  Solution Poly-size family of permutations F  S N over U Choosing/representing random π  F is easy Min-Wise Independence (MWI) Property: For all sets X  U, for all x  F,

25 CS 361A 25 Minimum-Size MWI Families  [Broder et al 98] Upper/lower bounds of lcm(1,2,…,n) Problem – exponential in N  Approximate MWI Families Relax to Non-constructive – polynomial-size Constructive – size N O(log 1/  ) [Indyk 99]  In practice – 2-universal hashes work well!

26 CS 361A 26 References I  Fingerprinting by random polynomials. M. Rabin. Technical Report TR-15-81, Harvard University (1981). Fingerprinting by random polynomials.  Some applications of Rabin's fingerprinting method. A. Broder. Sequence II (1993). Some applications of Rabin's fingerprinting method.  On the Resemblance and Containment of Documents, A. Broder. SEQUENCES 1997. On the Resemblance and Containment of Documents  Syntactic Clustering of the Web, A. Broder, S. Glassman, M. Manasse, and G. Zweig, WWW 1997. Syntactic Clustering of the Web  Finding near-replicas of documents on the web. N. Shivakumar and H. Garcia-Molina. WebDB 1998. Finding near-replicas of documents on the web.  Identifying and Filtering Near-Duplicate Documents, Andrei Broder. CPM 2000. Identifying and Filtering Near-Duplicate Documents

27 CS 361A 27 References II  Approximate String Matching with q-grams and Maximal Matches. E. Ukkonen. Theoretical Computer Science (1992). Approximate String Matching with q-grams and Maximal Matches  Completeness and Robustness Properties of Min-Wise Independent Permutations. A. Broder and M. Mitzenmacher. Completeness and Robustness Properties of Min-Wise Independent Permutations  Min-Wise Independent Permutations, A. Broder, M. Charikar, A. Frieze and M. Mitzenmacher, JCSS (2000). Min-Wise Independent Permutations  A Small Approximately min-wise Independent Family of Hash Functions. P. Indyk. SODA 1999. A Small Approximately min-wise Independent Family of Hash Functions.  Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, P. Indyk and R. Motwani. STOC 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality  Similarity Search in High Dimensions via Hashing, A. Gionis, P. Indyk, and R. Motwani. VLDB 1999. Similarity Search in High Dimensions via Hashing  Similarity Estimation Techniques from Rounding Algorithms, M. Charikar, STOC 2002. Similarity Estimation Techniques from Rounding Algorithms


Download ppt "CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 18 (Nov 30, 2005) Fingerprints, Min-Hashing, and Document Similarity Rajeev Motwani."

Similar presentations


Ads by Google