Three Essential Techniques for Similar Documents

Slides:



Advertisements
Similar presentations
Similarity and Distance Sketching, Locality Sensitive Hashing
Advertisements

Data Mining of Very Large Data
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
High Dimensional Search Min-Hashing Locality Sensitive Hashing
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Randomized / Hashing Algorithms
Min-Hashing, Locality Sensitive Hashing Clustering
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Distance Measures LS Families of Hash Functions S-Curves
Near Duplicate Detection
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
Finding Similar Items.
1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications.
1 Locality-Sensitive Hashing Basic Technique Hamming-LSH Applications.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Randomized / Hashing Algorithms Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman.
DATA MINING LECTURE 6 Sketching, Locality Sensitive Hashing.
Jeffrey D. Ullman Stanford University. 2  Generalized LSH is based on some kind of “distance” between points.  Similar points are “close.”  Example:
Shingling Minhashing Locality-Sensitive Hashing
Big Data Infrastructure
Locality-sensitive hashing and its applications
Chapter 14 Genetic Algorithms.
Copyright © Cengage Learning. All rights reserved.
The Acceptance Problem for TMs
CS276A Text Information Retrieval, Mining, and Exploitation
Definition, Domain,Range,Even,Odd,Trig,Inverse
Near Duplicate Detection
Finding Similar Items Jeffrey D. Ullman Application: Similar Documents
Clustering Algorithms
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Chapter 2 Sets and Functions.
Streaming & sampling.
Sketching, Locality Sensitive Hashing
Haim Kaplan and Uri Zwick
Hashing Alexandra Stefan.
Chapter 25 Comparing Counts.
Finding Similar Items: Locality Sensitive Hashing
Finding Similar Items: Locality Sensitive Hashing
Shingling Minhashing Locality-Sensitive Hashing
Theory of Locality Sensitive Hashing
Evaluation of Relational Operations
More LSH Application: Entity Resolution Application: Similar News Articles Distance Measures LS Families of Hash Functions LSH for Cosine Distance Jeffrey.
More LSH LS Families of Hash Functions LSH for Cosine Distance Special Approaches for High Jaccard Similarity Jeffrey D. Ullman Stanford University.
Fitting Curve Models to Edges
Hash-Based Improvements to A-Priori
RS – Reed Solomon List Decoding.
Indexing and Hashing Basic Concepts Ordered Indices
Locality Sensitive Hashing
Copyright © Cengage Learning. All rights reserved.
Chapter 26 Comparing Counts.
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
CS 3343: Analysis of Algorithms
The Selection Problem.
Calibration and homographies
Error Correction Coding
Near Neighbor Search in High Dimensional Data (1)
Complexity Theory: Foundations
Lecture-Hashing.
Locality-Sensitive Hashing
Presentation transcript:

Three Essential Techniques for Similar Documents Shingling : convert documents, emails, etc., to sets. Minhashing : convert large sets to short signatures, while preserving similarity. Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

The Big Picture Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the doc- ument

Minhashing

Four Types of Rows Given columns C1 and C2, rows may be classified as: C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 Also, a = # rows of type a , etc. Note Sim (C1, C2) = a /(a +b +c ).

Min-hashing Define a hash function h as follows: Permute the rows of the matrix randomly Important: same permutation for all the vectors! Let C be a column (= a document) h(C) = the number of the first (in the permuted order) row in which column C has 1

Minhash Signatures Use several (e.g., 100) independent hash functions to create a signature.

Minhashing Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5

Surprising Property The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). Both are a /(a +b +c )! Why? Look down the permuted columns C1 and C2 until we see a 1. If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

Similarity for Signatures The similarity of signatures is the fraction of the hash functions in which they agree. Because of the minhash property, the similarity of columns is the same as the expected similarity of their signatures.

Min Hashing – Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5 Similarities: 1-3 2-4 1-2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67 1.00 0 0

Implementation – (1) Suppose 1 billion rows. Hard to pick a random permutation from 1…billion. Representing a random permutation requires 1 billion entries. Accessing rows in permuted order leads to thrashing.

Implementation – (2) Simulate the effect of random permutation by a random hash function A good approximation to permuting rows: pick 100 (?) hash functions. h1 , h2 ,… For rows r and s, if hi (r ) < hi (s), then r appears before s in permutation i. We will use the same name for the hash function and the corresponding min-hash function

Implementation –(3) For each column c and each hash function hi , keep a “slot” M (i, c ). Initialize to infinity Intent: M (i, c ) will become the smallest value of hi (r ) for which column c has 1 in row r. I.e., hi (r ) gives order of rows for i th permuation. Sort the input matrix so it is ordered by rows So can iterate by reading rows sequentially from disk

Implementation – (4) for each row r for each column c if c has 1 in row r for each hash function hi do if hi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );

Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 h(2) = 2 1 2 1 1 0 2 0 1 3 1 1 4 1 0 5 0 1 h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

Implementation – (5) Often, data is given by column, not row. E.g., columns = documents, rows = shingles. If so, sort matrix once so it is by row. And always compute hi (r ) only once for each row.

Locality-Sensitive Hashing

Finding Similar Pairs Suppose we have, in main memory, data representing a large number of objects. May be the objects themselves . May be signatures as in minhashing. We want to compare each to each, finding those pairs that are sufficiently similar.

Checking All Pairs is Hard While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. Example: 106 columns implies 5*1011 column-comparisons. At 1 microsecond/comparison: 6 days.

Locality-Sensitive Hashing General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

The Big Picture Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the doc- ument

Candidate Generation From Minhash Signatures Pick a similarity threshold s, e..g., s=0.8 Goal: Find documents with Jaccard similarity at least s. Columns c and d are a candidate pair if their signatures agree in at least fraction s of the rows. I.e., M (i, c ) = M (i, d ) for at least fraction s values of i.

LSH for Minhash Signatures Big idea: hash columns of signature matrix M several times. Arrange that (only) similar columns are likely to hash to the same bucket. Candidate pairs are those that hash at least once to the same bucket.

Partition Into Bands r rows per band b bands One signature Matrix M

Buckets Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Matrix M b bands r rows

Partition into Bands – (2) Divide matrix M into b bands of r rows. For each band, hash its portion of each column to a hash table with k buckets. Make k as large as possible. Candidate column pairs are those columns that hash to the same bucket for ≥ 1 band. Tune b and r to catch most similar pairs, but few nonsimilar pairs.

Simplifying Assumption There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. Hereafter, we assume that “same bucket” means “identical in that band.” Assumption needed only to simplify analysis, not for correctness of algorithm

Example of Bands 100 min-hash signatures/document Suppose 100,000 columns Therefore, signatures take 40Mb. 5,000,000,000 pairs of signatures can take a while to compare. Let’s choose b=20, r=5 20 bands, 5 signatures per band Goal: find pairs of documents that are at least 80% similar

Suppose C1, C2 are 80% Similar Probability C1, C2 identical in one particular band: (0.8)5 = 0.328. Probability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 . i.e., about 1/3000th of the 80%-similar column pairs are false negatives. We would find 99.965% pairs of truly similar documents

Suppose C1, C2 Only 30% Similar Probability C1, C2 identical in any one particular band: (0.3)5 = 0.00243 . Probability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.0043 = 0.0486 . In other words, approximately 4.86% pairs of docs with similarity 30% end up becoming candidate pairs False positives

LSH Involves a Tradeoff Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up.

Analysis of LSH – What We Want Probability = 1 if s > t Probability of sharing a bucket No chance if s < t t Similarity s of two sets

What One Band of One Row Gives You Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets

b Bands, r rows/band

What b Bands of r Rows Gives You ( )b No bands identical 1 - At least one band identical 1 - Some row of a band unequal s r All rows of a band are equal t ~ (1/b)1/r Probability of sharing a bucket t Similarity s of two sets

Example: b = 20; r = 5 s 1-(1-sr)b .2 .006 .3 .047 .4 .186 .5 .470 .6 .802 .7 .975 .8 .9996

LSH Summary Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. Check in main memory that candidate pairs really do have similar signatures. Optional: In another pass through data, check that the remaining candidate pairs really represent similar sets .

Three Essential Techniques for Similar Documents Shingling : convert documents, emails, etc., to sets. Minhashing : convert large sets to short signatures, while preserving similarity. Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

The Big Picture Locality- sensitive Hashing Candidate pairs : those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the doc- ument

Key Idea Behind LSH Given documents (i.e., shingle sets) D1 and D2 If we can find a hash function h such that: if sim(D1,D2) is high, then with high probability h(D1) = h(D2) if sim(D1,D2) is low, then with high probability h(D1) !=h(D2) Then we could hash documents into buckets, and expect that “most” pairs of near duplicate documents would hash into the same bucket Compare pairs of docs in each bucket to see if they are really near-duplicates

Minhashing Clearly, the hash function depends on the similarity metric Not all similarity metrics have a suitable hash function Fortunately, there is a suitable hash function for Jaccard similarity Min-hashing

Theory of LSH

Theory of LSH We have used LSH to find similar documents In reality, columns in large sparse matrices with high Jaccard similarity e.g., customer/item purchase histories Can we use LSH for other distance measures? e.g., Euclidean distances, Cosine distance Let’s generalize what we’ve learned!

Families of Hash Functions A “hash function” is any function that takes two elements and says whether or not they are “equal” (really, are candidates for similarity checking). Shorthand: h(x) = h(y) means “h says x and y are equal.” A family of hash functions is any set of functions as in (1). An example of a family of hash functions For min-hash signatures, we got a minhash function for each permutation of rows A (large) set of related hash functions generated by some mechanism We should be able to efficiently pick a hash function at random from such a family

LS Families of Hash Functions Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S : If d(x,y) < d1, then prob. over all h in H, that h(x) = h(y) is at least p1. If d(x,y) > d2, then prob. over all h in H, that h(x) = h(y) is at most p2.

LS Families: Illustration High probability; at least p1 Low probability; at most p2 ??? d1 d2

Example: LS Family Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permutations. Then Prob[h(x)=h(y)] = 1-d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance rather than similarities.

Example: LS Family – (2) Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d. If distance < 1/3 (so similarity > 2/3) Then probability that minhash values agree is > 2/3

Comments For Jaccard similarity, minhashing gives us a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1 < d2. The theory leaves unknown what happens to pairs that are at distance between d1 and d2. Consequence: no guarantees about fraction of false positives in that range.

Amplifying a LS-Family Can we reproduce the “S-curve” effect we saw before for any LS family? The “bands” technique we learned for signature matrices carries over to this more general setting. Goal: the “S-curve” effect seen there. AND construction like “rows in a band.” OR construction like “many bands.”

AND of Hash Functions Given family H, construct family H’ consisting of r functions from H. For h = [h1,…,hr] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for all i. Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,(p1)r,(p2)r)-sensitive. Proof: Use fact that hi ’s are independent.

OR of Hash Functions Given family H, construct family H’ consisting of b functions from H. For h = [h1,…,hb] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for some i. Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,1-(1-p1)b,1-(1-p2)b)-sensitive.

Effect of AND and OR Constructions AND makes all probabilities shrink, but by choosing r correctly, we can make the lower probability approach 0 while the higher does not. OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.

Composing Constructions As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of AND’s and OR’s alternating. r-way AND construction followed by b-way OR construction Exactly what we did with minhashing Take points x and y s.t. Pr[h(x) = h(y)] = p H will make (x,y) a candidate pair with prob. p This construction will make (x,y) a candidate pair with probability 1-(1-p^r)^b – The S-Curve!

AND-OR Composition Each of the two probabilities p is transformed into 1-(1-pr)b. The “S-curve” studied before. Example: Take H and construct H’ by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4.

Table for Function 1-(1-p4)4 .2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 .9 .9860 Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)- sensitive family.

OR-AND Composition Apply a b-way OR construction followed by an r-way AND construction Each of the two probabilities p is transformed into (1-(1-p)b)r. The same S-curve, mirrored horizontally and vertically. Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4.

Table for Function (1-(1-p)4)4 .1 .0140 .2 .1215 .3 .3334 .4 .5740 .5 .7725 .6 .9015 .7 .9680 .8 .9936 Example:Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)- sensitive family.

Cascading Constructions Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction. Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9999996,.0008715)-sensitive family. Note this family uses 256 of the original hash functions.

General Use of S-Curves For each S-curve 1-(1-pr)b, there is a threshhold t, for which 1-(1-tr)b = t. Above t, high probabilities are increased; below t, they are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is greater than t. Iterate as you like.

Summary Pick any two distances x < y Start with a (x, y, (1-x), (1-y))-sensitive family Apply constructions to produce (x, y, p, q)-sensitive family, where p is almost 1 and q is almost 0. The closer to 0 and 1 we get, the more hash functions must be used.

LSH for Cosine Distance For cosine distance, there is a technique analogous to minhashing for generating a (d1,d2,(1-d1/180),(1-d2/180))- sensitive family for any d1 and d2. Called random hyperplanes.

Random Hyperplanes Pick a random vector v, which determines a hash function hv with two buckets. hv(x) = +1 if v.x > 0; = -1 if v.x < 0. LS-family H = set of all functions derived from any vector. Claim: Prob[h(x)=h(y)] = 1 – (angle between x and y divided by 180).

Proof of Claim v Look in the plane of x and y. x Hyperplanes (normal to v ) for which h(x) <> h(y) v Look in the plane of x and y. Hyperplanes for which h(x) = h(y) x Prob[Red case] = θ/180 θ y

Signatures for Cosine Distance Pick some number of vectors, and hash your data for each vector. The result is a signature (sketch ) of +1’s and –1’s for each data point Can be used for LSH like the minhash signatures for Jaccard distance. Amplified using AND/OR constructions.

How to pick random vectors Expensive to pick a random vector in M dimensions for large M M random numbers We need not pick from among all possible vectors v to form a component of a sketch. A more efficient approach It suffices to consider only vectors v consisting of +1 and –1 components.

LSH for Euclidean Distance Simple idea: hash functions correspond to lines. Partition the line into buckets of size a. Hash each point to the bucket containing its projection onto the line. Nearby points are always close; distant points are rarely in same bucket.

Projection of Points

Projection of Points Points at If d << a, then distance d the chance the points are in the same bucket is at least 1 – d /a. If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket. θ d cos θ Randomly chosen line Bucket width a

An LS-Family for Euclidean Distance If points are distance > 2a apart, then 60 < θ < 90 for there to be a chance that the points go in the same bucket. Cos θ <1/2 I.e., at most 1/3 probability. If points are distance < a/2, then there is at least ½ chance they share a bucket. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.

Fixup: Euclidean Distance For previous distance measures, we could start with an (x, y, p, q)-sensitive family for any x < y, and drive p and q to 1 and 0 by AND/OR constructions. Here, we seem to need y > 4x.

Fixup – (2) But as long as x < y, the probability of points at distance x falling in the same bucket is greater than the probability of points at distance y doing so. Thus, the hash family formed by projecting onto lines is an (x, y, p, q)-sensitive family for some p > q. Then, amplify by AND/OR constructions.