Locality Sensitive Hashing

Locality Sensitive Hashing
Petra Kohoutková, Martin Kyselák

Outline Motivation Definition Algorithm Examples
NN query, randomized approximate NN, open problems Definition LSH basic principles, definition, illustration Algorithm Basic idea, parameters, complexity Examples Specific LSH functions for several distance measures

Motivation Nearest neighbor queries: The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. Several efficient algorithms known for the case when the dimension d is low, e.g. kd-trees. However, these solutions suffer from either space or query time that is exponential in d. Thus, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high d. This phenomenon is often called “the curse of dimensionality.”

Motivation II Approximate NN: return a point whose distance from the query is at most c times the distance from the query to its nearest points; c > 1 is called the approximation factor. Randomized c-approximate R-near neighbor ((c, R)– NN) problem: given a set P of points in a d-dimensional space, and parameters R > 0, δ > 0, construct a data structure such that, given any query point q, if there exists an R-near neighbor of q in P, it reports some cR-near neighbor of q in P with probability 1 – δ. Locality Sensitive Hashing (LSH): the key idea is to use hash functions such that the probability of collision is much higher for objects that are close to each other than for those that are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. aiming for linear complexity in d, sublinear in n

LSH Definition The LSH algorithm relies on the existence of locality-sensitive hash functions. Let H be a family of hash functions mapping Rd to some universe U. For any two points p and q, consider a process in which we choose a function h from H uniformly at random, and analyze the probability that h(p)= h(q).

LSH Definition II Definition: A family H of functions h: Rd  U is called (R, cR, P1, P2)-sensitive, if for any p, q: If |p-q| ≤ R, then Pr[h(p) = h(q)] ≥ P1 If |p-q| ≥ cR, then Pr[h(p) = h(q)] ≤ P2 In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy P1 > P2.

Example: Hamming Distance
Consider a data set of binary strings (000100, , …) compared by the Hamming distance (D). In this case, we can use a simple family of functions H which contains all projections of the input point on one of the coordinates: H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } (pi is the i-th bit of p) Then the probability is as follows: Pr[h(p)=h(q)] = 1 - D(p,q) / d Let p = , q = ; Pr[h(p)=h(q)] = 5/6 H is (1, 2, 5/6, 4/6)-sensitive

LSH Algorithm An LSH family H can be used to design an efficient algorithm for approximate NN search. However, one typically cannot use H directly since the gap between the probabilities P1 and P2 could be quite small. Given a family H of hash functions with parameters (R, cR, P1, P2), we amplify the gap between the high probability P1 and low probability P2 by concatenating several functions. In particular, for parameters k and L (specified later), we choose L functions gj(q), j = 1,…,L, by setting gj=(h1,j(q),h2,j(q),…,hk,j(q)), where ht,j (1 ≤ t ≤ k, 1 ≤ j ≤ L) are chosen independently and uniformly at random from H. These are the actual functions that we use to hash the data points.

LSH Algorithm – Parameters
(A. Andoni, P. Indyk: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions)

LSH Algorithm II Preprocessing:
Choose L functions gj, j = 1,…L, by setting gj=(h1,j, h2,j, …, hk,j), where h1,j,…hk,j are chosen at random from the LSH family H. Construct L hash tables, where, for each j =1,…L, the jth hash table contains the dataset points hashed using the function gj. Query algorithm for a query point q: For each j = 1, 2,…, L Retrieve the points from the bucket gj(q) in the jth hash table. For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer

LSH Algorithm III The LSH algorithm gives a solution to the randomized c-approximate R-near neighbor problem, with parameters R and δ for some constant failure probability δ < 1. The value of δ depends on the choice of the parameters k and L. Conversely, for each δ, one can provide parameters k and L so that the error probability is smaller than δ. The query time is also dependent on k and L. It could be as high as Θ(n) in the worst case, but, for many natural data sets, a proper choice of parameters results in a sublinear query time O(dn),  < 1.

LSH Functions for Hamming distance for l1 distance
for Euclidean (l2) distance for Jaccard’s coefficient for Arccos measure … for general metric space?

LSH for Hamming Distance
Hamming distance of strings p, q is equal to the number of positions where p and q differ Define a family of functions H which contains all projections of the input point on one of the coordinates: H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } (pi is the i-th bit of p)

LSH for l1 Distance l1 distance of vectors x, y is defined as l1(x,y) = (|x1-y1| + … + |xd-yd|) Fix a real w>>R, and impose a randomly shifted grid with cells of width w; each cell defines a bucket. More specifically, pick random real numbers s1,…,sd from [0, w) and define hs1,…,sd(x) = ((x1 – s1)/w, …, (xd – sd)/w)

LSH for Euclidean (l2) Distance
l2(x,y) = (|x1-y1|2 + … + |xd-yd|2)1/2 Pick a random projection of Rd onto a 1-dimensional line and chop the line into segments of length w, shifted by a random value b from [0, w) hr,b(x) = (r·x + b)/w the projection vector r from Rd is constructed by picking each coordinate of r from the Gaussian distribution w = 3 r = (3,1), b = 2 r = (4, -1), b = 2

LSH for Jaccard’s Coefficient
Jaccard’s coefficient for sets is defined as Pick a random permutation π on the ground universe U. Then, define hπ(A) = min{π(a) | a is in A}. The probability of collision is Prπ[hπ(A)= hπ(B)] = 1 - d(A, B) A π1 hπ1(A) = hπ1(B) = π2 hπ2(A) = hπ2(B) = B

LSH for Arccos Measure Arccos measure for vectors is defined as Family of LSH functions is then defined as follows: H= {hu(p) = sign(u.p)} where u is a random unit-length vector

LSH for General Metric Space
We know nothing about the distance measure We can only use distances between objects Idea: use some randomly (?) picked objects as pivots, define buckets as Voronoi regions ?

Locality Sensitive Hashing

Similar presentations

Presentation on theme: "Locality Sensitive Hashing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Locality Sensitive Hashing

Similar presentations

Presentation on theme: "Locality Sensitive Hashing"— Presentation transcript:

Similar presentations

About project

Feedback