Locality Sensitive Hashing Basics and applications.

Locality Sensitive Hashing Basics and applications

A well-known problem Given a large collection of documents Identify the near-duplicate documents Web search engines Proliferation of near-duplicate documents Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, … 30% of web-pages are near-duplicates [1997]

Natural Approaches Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability guarantees MD5 – cryptographically-secure string hashes Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents  similar samples But – even samples of same document will differ

Basic Idea: Shingling [Broder 1997] dissect document into q-grams (shingles) T = I leave and study in Pisa, …. If we set q=3 the 3-grams are: … represent documents by sets of hash[shingle]  The problem reduces to set intersection among int

Basic Idea: Shingling [Broder 1997] Set intersection  Jaccard similarity Doc B S B SASA Doc A Claim: A & B are near-duplicates if sim(S A,S B ) is high

Sketching of a document From each shingle-set we build a “sketch vector” (~200 size) Postulate: Documents that share ≥ t components of their sketch-vectors are claimed to be near duplicates Sec. 19.6

Sketching by Min-Hashing Consider S A, S B  P = {0,…,p-1} Pick a random permutation π of the whole set P (such as ax+b mod p) Pick the minimal element of S A :  = min{π(S A )} Pick the minimal element of S B :  = min{π(S B )} Lemma:

Strengthening it… Similarity sketch sk(A) d minimal elements under π(S A ) Or take d permutations and the min of each Note : we can reduce the variance by using a larger d Typically d is few hundred mins (~200)

Computing Sketch[i] for Doc1 Document 1 2 64 Start with 64-bit f(shingles) Permute with  i Pick the min value Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] Document 1 Document 2 2 64 Are these equal? Use 200 random permutations (minimum), and thus create one 200-dim vector per document and evaluate the fraction of shared components AB Sec. 19.6 Claim: This happens with probability Size_of_intersection / Size_of_union

It’s even more difficult… So we have squeezed few Kbs of data (web page) into few hundred bytes  But you still need a brute-force comparison (quadratic time) to compute all nearly-duplicate documents This is too much even if it is executed in RAM

Locality Sensitive Hashing The case of the Hamming distance How to compute fast the fraction of different compo. in d-dim vectors How to compute fast the hamming distance between d-dim vectors Fraction different components = HammingDist/d

A warm-up Consider the case of binary (sketch) vectors, thus living in the hypercube {0,1} d Hamming distance D(p,q)= # coord on which p and q differ Define hash function h by choosing a set I of k random coordinates h(p) = p |I = projection of p on I Example: if p=01011 (d=5), a pick I={1,4} (with k=2), then h(p)=01 Note similarity with the Bloom Filter

A key property Pr[to pick an equal component]= (d - D(p,q))/d We can vary the probability by changing k k=1k=2 distance Pr 12 ….d equal What about False Negatives ?

Reiterate Repeat L times the k-projections h i Declare a «match» if at least one h i matches Example: d=5, k=2, p = 01011 and q = 00101 I1 = {2,4}, we have h 1 (p) = 11 and h 1 (q)=00 I2 = {1,4}, we have h 2 (p) = 01 and h 2 (q)=00 I3 = {1,5}, we have h 3 (p) = 01 and h 3 (q)=01 We set g( ) = p and q match !!

Measuring the error prob. The g() consists of L independent hashes h i Pr[g(p) matches g(q)] =1 - Pr[h i (p) ≠ h i (q),  i=1, …, L] s Pr (1/L)^(1/k)

Find groups of similar items SOL 1: Buckets provide the candidate similar items «Merge» similar sets if they share items h 1 (p)h 2 (p)h L (p) TLTL T2T2 T1T1 p Points in a bucket are possibly similar objects

Find groups of similar items SOL 1: Buckets provide the candidate similar items SOL 2: Sort items by the h i (), and pick as similar candidate the equal ones Repeat L times, for all h i () «Merge» candidate sets if they share items. What about clustering ? Check candidates !!!

LSH versus K-means  What about optimality ? K-means is locally optimal [recently, some researchers showed how to introduce some guarantee]  What about the Sim-cost ? K-means compares items in  (d) time and space [notice that d may be bi/millions]  What about the cost per iteration and their number? Typically K- means requires few iterations, each costs K  n time: I  K  n  d  What about K ? In principle have to iterate K=1,…, n LSH needs sort(n) time hence, on disk, few passes over the data and with guaranteed error bounds

Also on-line query Given a query q, check the buckets of h j (q) for j=1,…, L h 1 (q)h 2 (q)h L (q) TLTL T2T2 T1T1 q

Locality Sensitive Hashing and its applications More problems, indeed

Another classic problem The problem: Given U users, the goal is to find groups of similar users (or, smilar to user Q) Features = Personal data, preferences, purchases, navigational behavior, followers/ing or +1, … A feature is typically a numerical value: binary or real 12345 U101011 U201100 U301111  Hamming distance: #different components

More than Hamming distance Example: q P*P* q is the query point. P* is its Nearest Neighbor

Approximation helps q p* r

A slightly different problem Approximate Nearest Neighbor Given an error parameter  >0 For query q and nearest-neighbor p’, return p such that Justification Mapping objects to metric space is heuristic anyway Get tremendous performance improvement

A workable approach Given an error parameter  >0, distance threshold t>0 (t,  )-Approximate NN Query If no point p with D(q,p)<t, return FAILURE Else, return any p’ with D(q,p’)< (1+  )t Application: Approximate Nearest Neighbor Assume maximum distance is T Run in parallel for Time/space – O(log 1+  T) overhead

Locality Sensitive Hashing and its applications The analysis

LSH Analysis For a fixed threshold r, we distinguish between Near D(p,q) < r Far D(p,q) > (1+ ε )r A locality-sensitive hash h should guarantee Near points are hashed together with Pr[h(a)=h(b)] ≥ P 1 Far points may be mapped together but Pr[h(a)=h(c)] ≤ P 2 where, of course, we have that P 1 > P 2 c a b

Family h i (.) = p |c1,…,ck,where the ci are chosen randomly If D(a,b) ≤ r, then Pr[h i (a)= h i (b)] = (1 – D(a,b)/d) k ≥ ( 1 – r/d ) k = ( p 1 ) k = P 1 If D(a,c) > (1+  )r, then Pr[h i (a)= h i (c)] = (1 – D(a,c)/d) k < ( 1 – r(1+  )/d ) k = ( p 2 ) k = P 2 where, of course, we have that p1 > p2 (as P 1 > P 2 ) What about hamming distance?

LSH Analysis The LSH-algorithm with the L mappings h i () correctly solves the (r,  )-NN problem on a point query q if the following hold: I. The total number of points FAR from q and belonging to the bucket h i (q) is a constant. II. If  p* NEAR to q, then h i (p*)= h i (q) for some I (p* is in a visited bucket) Theorem. Take k=log_{1/p 2 } n and L=n (ln p1/ ln p2), then the two properties above hold with probability at least 0.298. Repeating the process  (1/  ) times, we ensure a probability of success of at least 1- . Space ≈ nL = n 1+ , where  = (ln p 1 / ln p 2 ) < 1 Query time ≈ L buckets accessed, they are n 

Proof p* is a point near to q: D(q,p*) < r FAR(q) = set of points p s.t. D(q,p) > (1+  ) r BUCKET i (q) = set of points p s.t. h i (p)= h i (q) Let us define the following events: E1 = Num of far points in the visited buckets ≤ 3 L E2 = p* occurs in some visited bucket, i.e.  j s.t. h j (q) = h j (p*)

Bad collisions more than 3L Let p is a point in FAR(q): Pr[fixed j, h j (p) = h j (q)] < P 2 = ( p 2 ) k Given that k = log 1/p2 n Pr[fixed j, a far point p satisfies h j (p) = h j (q)] = 1/n By Markov inequality Pr(X > 3E[x]) ≤ 1/3, follows:

Good collision: p* occurs For any h j the probability that Pr[h j (p*) = h j (q)] ≥ P 1 = ( p 1 ) k = ( p 1 ) ^ { log 1/p2 n } = Given that L=n (ln p1/ ln p2), this is actually 1/L. So we have that Pr[not E 2 ] = Pr[not finding p* in q’s buckets] = = (1 - Pr[h j (p*) = h j (q)]) L = (1 – 1/L) L ≤ 1/e Finally Pr[E1 and E2] ≥ 1 – Pr[not E1 OR not E2] ≥ 1 – (Pr[not E1] + Pr [not E2]) ≥ 1 - (1/3) - (1/e) = 0.298

Locality Sensitive Hashing Basics and applications.

Similar presentations

Presentation on theme: "Locality Sensitive Hashing Basics and applications."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Locality Sensitive Hashing Basics and applications.

Similar presentations

Presentation on theme: "Locality Sensitive Hashing Basics and applications."— Presentation transcript:

Similar presentations

About project

Feedback