Three Essential Techniques for Similar Documents

Three Essential Techniques for Similar Documents
Shingling : convert documents, s, etc., to sets. Minhashing : convert large sets to short signatures, while preserving similarity. Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

The Big Picture Locality- sensitive Hashing Candidate pairs :
those pairs of signatures that we need to test for similarity. Minhash- ing Signatures : short integer vectors that represent the sets, and reflect their similarity Shingling Docu- ment The set of strings of length k that appear in the document

Minhashing

Four Types of Rows Given columns C1 and C2, rows may be classified as:
C1 C2 a 1 1 b 1 0 c 0 1 d 0 0 Also, a = # rows of type a , etc. Note Sim (C1, C2) = a /(a +b +c ).

Min-hashing Define a hash function h as follows:
Permute the rows of the matrix randomly Important: same permutation for all the vectors! Let C be a column (= a document) h(C) = the number of the first (in the permuted order) row in which column C has 1

Minhash Signatures Use several (e.g., 100) independent hash functions to create a signature.

Minhashing Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5

Surprising Property The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). Both are a /(a +b +c )! Why? Look down the permuted columns C1 and C2 until we see a 1. If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

Similarity for Signatures
The similarity of signatures is the fraction of the hash functions in which they agree. Because of the minhash property, the similarity of columns is the same as the expected similarity of their signatures.

Min Hashing – Example Input matrix 1 Signature matrix M 1 2 4 5 2 6 7 3 1 5 7 6 3 1 2 4 3 4 7 6 1 2 5 Similarities: Col/Col Sig/Sig

Implementation – (1) Suppose 1 billion rows.
Hard to pick a random permutation from 1…billion. Representing a random permutation requires 1 billion entries. Accessing rows in permuted order leads to thrashing.

Implementation – (2) Simulate the effect of random permutation by a random hash function A good approximation to permuting rows: pick 100 (?) hash functions. h1 , h2 ,… For rows r and s, if hi (r ) < hi (s), then r appears before s in permutation i. We will use the same name for the hash function and the corresponding min-hash function

Implementation –(3) For each column c and each hash function hi , keep a “slot” M (i, c ). Initialize to infinity Intent: M (i, c ) will become the smallest value of hi (r ) for which column c has 1 in row r. I.e., hi (r ) gives order of rows for i th permuation. Sort the input matrix so it is ordered by rows So can iterate by reading rows sequentially from disk

Implementation – (4) for each row r for each column c
if c has 1 in row r for each hash function hi do if hi (r ) is a smaller value than M (i, c ) then M (i, c ) := hi (r );

Example Sig1 Sig2 h(1) = 1 1 - g(1) = 3 3 - Row C1 C2 h(2) = 2 1 2
h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(5) = 0 1 0 g(5) = 1 2 0

Implementation – (5) Often, data is given by column, not row.
E.g., columns = documents, rows = shingles. If so, sort matrix once so it is by row. And always compute hi (r ) only once for each row.

Locality-Sensitive Hashing

Finding Similar Pairs Suppose we have, in main memory, data representing a large number of objects. May be the objects themselves . May be signatures as in minhashing. We want to compare each to each, finding those pairs that are sufficiently similar.

Checking All Pairs is Hard
While the signatures of all columns may fit in main memory, comparing the signatures of all pairs of columns is quadratic in the number of columns. Example: 106 columns implies 5*1011 column-comparisons. At 1 microsecond/comparison: 6 days.

Locality-Sensitive Hashing
General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated. For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

Candidate Generation From Minhash Signatures
Pick a similarity threshold s, e..g., s=0.8 Goal: Find documents with Jaccard similarity at least s. Columns c and d are a candidate pair if their signatures agree in at least fraction s of the rows. I.e., M (i, c ) = M (i, d ) for at least fraction s values of i.

LSH for Minhash Signatures
Big idea: hash columns of signature matrix M several times. Arrange that (only) similar columns are likely to hash to the same bucket. Candidate pairs are those that hash at least once to the same bucket.

Partition Into Bands r rows per band b bands One signature Matrix M

Buckets Columns 2 and 6 are probably identical. Columns 6 and 7 are surely different. Matrix M b bands r rows

Partition into Bands – (2)
Divide matrix M into b bands of r rows. For each band, hash its portion of each column to a hash table with k buckets. Make k as large as possible. Candidate column pairs are those columns that hash to the same bucket for ≥ 1 band. Tune b and r to catch most similar pairs, but few nonsimilar pairs.

Simplifying Assumption
There are enough buckets that columns are unlikely to hash to the same bucket unless they are identical in a particular band. Hereafter, we assume that “same bucket” means “identical in that band.” Assumption needed only to simplify analysis, not for correctness of algorithm

Example of Bands 100 min-hash signatures/document
Suppose 100,000 columns Therefore, signatures take 40Mb. 5,000,000,000 pairs of signatures can take a while to compare. Let’s choose b=20, r=5 20 bands, 5 signatures per band Goal: find pairs of documents that are at least 80% similar

Suppose C1, C2 are 80% Similar
Probability C1, C2 identical in one particular band: (0.8)5 = Probability C1, C2 are not similar in any of the 20 bands: ( )20 = i.e., about 1/3000th of the 80%-similar column pairs are false negatives. We would find % pairs of truly similar documents

Suppose C1, C2 Only 30% Similar
Probability C1, C2 identical in any one particular band: (0.3)5 = Probability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * = In other words, approximately 4.86% pairs of docs with similarity 30% end up becoming candidate pairs False positives

LSH Involves a Tradeoff
Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. Example: if we had only 15 bands of 5 rows, the number of false positives would go down, but the number of false negatives would go up.

Analysis of LSH – What We Want
Probability = 1 if s > t Probability of sharing a bucket No chance if s < t t Similarity s of two sets

What One Band of One Row Gives You
Remember: probability of equal hash-values = similarity Probability of sharing a bucket t Similarity s of two sets

b Bands, r rows/band

What b Bands of r Rows Gives You
( )b No bands identical 1 - At least one band identical 1 - Some row of a band unequal s r All rows of a band are equal t ~ (1/b)1/r Probability of sharing a bucket t Similarity s of two sets

Example: b = 20; r = 5 s 1-(1-sr)b .2 .006 .3 .047 .4 .186 .5 .470 .6
.802 .7 .975 .8 .9996

LSH Summary Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures. Check in main memory that candidate pairs really do have similar signatures. Optional: In another pass through data, check that the remaining candidate pairs really represent similar sets .

Three Essential Techniques for Similar Documents
Shingling : convert documents, s, etc., to sets. Minhashing : convert large sets to short signatures, while preserving similarity. Locality-sensitive hashing : focus on pairs of signatures likely to be similar.

Key Idea Behind LSH Given documents (i.e., shingle sets) D1 and D2
If we can find a hash function h such that: if sim(D1,D2) is high, then with high probability h(D1) = h(D2) if sim(D1,D2) is low, then with high probability h(D1) !=h(D2) Then we could hash documents into buckets, and expect that “most” pairs of near duplicate documents would hash into the same bucket Compare pairs of docs in each bucket to see if they are really near-duplicates

Minhashing Clearly, the hash function depends on the similarity metric
Not all similarity metrics have a suitable hash function Fortunately, there is a suitable hash function for Jaccard similarity Min-hashing

Theory of LSH

Theory of LSH We have used LSH to find similar documents
In reality, columns in large sparse matrices with high Jaccard similarity e.g., customer/item purchase histories Can we use LSH for other distance measures? e.g., Euclidean distances, Cosine distance Let’s generalize what we’ve learned!

Families of Hash Functions
A “hash function” is any function that takes two elements and says whether or not they are “equal” (really, are candidates for similarity checking). Shorthand: h(x) = h(y) means “h says x and y are equal.” A family of hash functions is any set of functions as in (1). An example of a family of hash functions For min-hash signatures, we got a minhash function for each permutation of rows A (large) set of related hash functions generated by some mechanism We should be able to efficiently pick a hash function at random from such a family

LS Families of Hash Functions
Suppose we have a space S of points with a distance measure d. A family H of hash functions is said to be (d1,d2,p1,p2)-sensitive if for any x and y in S : If d(x,y) < d1, then prob. over all h in H, that h(x) = h(y) is at least p1. If d(x,y) > d2, then prob. over all h in H, that h(x) = h(y) is at most p2.

LS Families: Illustration
High probability; at least p1 Low probability; at most p2 ??? d1 d2

Example: LS Family Let S = sets, d = Jaccard distance, H is formed from the minhash functions for all permutations. Then Prob[h(x)=h(y)] = 1-d(x,y). Restates theorem about Jaccard similarity and minhashing in terms of Jaccard distance rather than similarities.

Example: LS Family – (2) Claim: H is a (1/3, 2/3, 2/3, 1/3)-sensitive family for S and d. If distance < 1/3 (so similarity > 2/3) Then probability that minhash values agree is > 2/3

Comments For Jaccard similarity, minhashing gives us a (d1,d2,(1-d1),(1-d2))-sensitive family for any d1 < d2. The theory leaves unknown what happens to pairs that are at distance between d1 and d2. Consequence: no guarantees about fraction of false positives in that range.

Amplifying a LS-Family
Can we reproduce the “S-curve” effect we saw before for any LS family? The “bands” technique we learned for signature matrices carries over to this more general setting. Goal: the “S-curve” effect seen there. AND construction like “rows in a band.” OR construction like “many bands.”

AND of Hash Functions Given family H, construct family H’ consisting of r functions from H. For h = [h1,…,hr] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for all i. Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,(p1)r,(p2)r)-sensitive. Proof: Use fact that hi ’s are independent.

OR of Hash Functions Given family H, construct family H’ consisting of b functions from H. For h = [h1,…,hb] in H’, h(x)=h(y) if and only if hi(x)=hi(y) for some i. Theorem: If H is (d1,d2,p1,p2)-sensitive, then H’ is (d1,d2,1-(1-p1)b,1-(1-p2)b)-sensitive.

Effect of AND and OR Constructions
AND makes all probabilities shrink, but by choosing r correctly, we can make the lower probability approach 0 while the higher does not. OR makes all probabilities grow, but by choosing b correctly, we can make the upper probability approach 1 while the lower does not.

Composing Constructions
As for the signature matrix, we can use the AND construction followed by the OR construction. Or vice-versa. Or any sequence of AND’s and OR’s alternating. r-way AND construction followed by b-way OR construction Exactly what we did with minhashing Take points x and y s.t. Pr[h(x) = h(y)] = p H will make (x,y) a candidate pair with prob. p This construction will make (x,y) a candidate pair with probability 1-(1-p^r)^b – The S-Curve!

AND-OR Composition Each of the two probabilities p is transformed into 1-(1-pr)b. The “S-curve” studied before. Example: Take H and construct H’ by the AND construction with r = 4. Then, from H’, construct H’’ by the OR construction with b = 4.

Table for Function 1-(1-p4)4
.2 .0064 .3 .0320 .4 .0985 .5 .2275 .6 .4260 .7 .6666 .8 .8785 .9 .9860 Example: Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)- sensitive family.

OR-AND Composition Apply a b-way OR construction followed by an r-way AND construction Each of the two probabilities p is transformed into (1-(1-p)b)r. The same S-curve, mirrored horizontally and vertically. Example: Take H and construct H’ by the OR construction with b = 4. Then, from H’, construct H’’ by the AND construction with r = 4.

Table for Function (1-(1-p)4)4
.1 .0140 .2 .1215 .3 .3334 .4 .5740 .5 .7725 .6 .9015 .7 .9680 .8 .9936 Example:Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.1215)- sensitive family.

Cascading Constructions
Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction. Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8, , )-sensitive family. Note this family uses 256 of the original hash functions.

General Use of S-Curves
For each S-curve 1-(1-pr)b, there is a threshhold t, for which 1-(1-tr)b = t. Above t, high probabilities are increased; below t, they are decreased. You improve the sensitivity as long as the low probability is less than t, and the high probability is greater than t. Iterate as you like.

Summary Pick any two distances x < y
Start with a (x, y, (1-x), (1-y))-sensitive family Apply constructions to produce (x, y, p, q)-sensitive family, where p is almost 1 and q is almost 0. The closer to 0 and 1 we get, the more hash functions must be used.

LSH for Cosine Distance
For cosine distance, there is a technique analogous to minhashing for generating a (d1,d2,(1-d1/180),(1-d2/180))- sensitive family for any d1 and d2. Called random hyperplanes.

Random Hyperplanes Pick a random vector v, which determines a hash function hv with two buckets. hv(x) = +1 if v.x > 0; = -1 if v.x < 0. LS-family H = set of all functions derived from any vector. Claim: Prob[h(x)=h(y)] = 1 – (angle between x and y divided by 180).

Proof of Claim v Look in the plane of x and y. x Hyperplanes
(normal to v ) for which h(x) <> h(y) v Look in the plane of x and y. Hyperplanes for which h(x) = h(y) x Prob[Red case] = θ/180 θ y

Signatures for Cosine Distance
Pick some number of vectors, and hash your data for each vector. The result is a signature (sketch ) of +1’s and –1’s for each data point Can be used for LSH like the minhash signatures for Jaccard distance. Amplified using AND/OR constructions.

How to pick random vectors
Expensive to pick a random vector in M dimensions for large M M random numbers We need not pick from among all possible vectors v to form a component of a sketch. A more efficient approach It suffices to consider only vectors v consisting of +1 and –1 components.

LSH for Euclidean Distance
Simple idea: hash functions correspond to lines. Partition the line into buckets of size a. Hash each point to the bucket containing its projection onto the line. Nearby points are always close; distant points are rarely in same bucket.

Projection of Points

Projection of Points Points at If d << a, then distance d
the chance the points are in the same bucket is at least 1 – d /a. If d >> a, θ must be close to 90o for there to be any chance points go to the same bucket. θ d cos θ Randomly chosen line Bucket width a

An LS-Family for Euclidean Distance
If points are distance > 2a apart, then < θ < 90 for there to be a chance that the points go in the same bucket. Cos θ <1/2 I.e., at most 1/3 probability. If points are distance < a/2, then there is at least ½ chance they share a bucket. Yields a (a/2, 2a, 1/2, 1/3)-sensitive family of hash functions.

Fixup: Euclidean Distance
For previous distance measures, we could start with an (x, y, p, q)-sensitive family for any x < y, and drive p and q to 1 and 0 by AND/OR constructions. Here, we seem to need y > 4x.

Fixup – (2) But as long as x < y, the probability of points at distance x falling in the same bucket is greater than the probability of points at distance y doing so. Thus, the hash family formed by projecting onto lines is an (x, y, p, q)-sensitive family for some p > q. Then, amplify by AND/OR constructions.

Three Essential Techniques for Similar Documents

Similar presentations

Presentation on theme: "Three Essential Techniques for Similar Documents"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Three Essential Techniques for Similar Documents

Similar presentations

Presentation on theme: "Three Essential Techniques for Similar Documents"— Presentation transcript:

Similar presentations

About project

Feedback