Presentation is loading. Please wait.

Presentation is loading. Please wait.

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

Similar presentations


Presentation on theme: "Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented."— Presentation transcript:

1 Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented by Jiyun Byun Vision Research Lab in ECE at UCSB

2 Outline Introduction Locality Sensitive Hashing Analysis Experiments Concluding Remarks

3 Introduction Nearest neighbor search (NNS) The curse of dimensionality experimental approach : use heuristic analytical approach Approximate approach ε-Nearest Neighbor Search (ε-NNS) Goal : for any given query q R d, returns the points p  P where d(q,P) is the distance of q to the its closest points in P right answers are much closer than irrelevant ones time/quality trade off

4 Locality Sensitive Hashing (LSH) Collision probability depends on distance between points higher collision probability for close objects small collision probability for those that far apart Given a query point, hash it using a set of hash functions inspect the entries in each bucket

5 Locality Sensitive Hashing

6 Locality Sensitive Hashing (LSH) Setting C : the largest coordinate among all points in the given dataset P of dimension d (R d ) Embed P into the Hamming cube {0,1} d’ dimension d’ = Cd v(p) = Unary C (x 1 )…Unary C (x d ) use the unary code for each point along each dimension isometric embedding d 1 (p,q) = d H (v(p),v(q)) embedding preserves the distance between points

7 Locality Sensitive Hashing (LSH) Hash functions(1/2) Build a hash function on Hamming cube in d’ dimensions Choose L subsets of the dimensions: I 1,I 2,..I L I j consists of k elements from {1,…,d’} found by sampling uniformly at random with replacement Project each point on each I j. g j (p) = projection of p on I j obtained by concatenating the bit values of p for dimensions I j Store p in buckets g j (p), j = 1.. L

8 Locality Sensitive Hashing (LSH) Hash functions(2/2) Two levels of hashing LSH function maps a point p to bucket g j (p) standard hash function maps the contents of buckets into a hash table of size M B : bucket capacity  : memory utilization parameter

9 Query processing Search buckets g j (q) until CL points are found or all L indices are searched. Approximate K-NNS output the K points closest to q fewer if less than K points are found -neighbor with parameter r

10 Analysis where r 1 P 2 Family of single projections in Hamming cube H d’ is (r, r(1+ ), 1-r/d’, 1- r(1+ )/d’) sensitive if d H (q,p) = r (r bits on which p and q differ) Pr[ h(q) h(p)] = r/d’

11 LSH solve (r+ ) Neighbor problem Determine if there exists a point within distance r of query point q or whether all points are at least a distance r(1+ ) away from q In the former case, return a point within distance r(1+ ) of q. Repeat construction to boost the probability.

12 ε-NN problem For a given query point q, return a point p from the dataset P multiple instances of (r, )-neighbor solution. (r 0, )-neighbor, (r 0 (1+ ), )-neighbor, (r 0 (1+ ) 2, )- neighbor, …,r max neighbor

13 Experiments(1/3) Datasets color histograms (Corel Draw) n = 20,000; d= 8,…,64 texture features (Aerial photos) n = 270,000; d = 60 Query sets Disk second level bucket is directly mapped to a disk block

14 Experiments(2/3) profiles color histogramtexture features Interpoint distance Normalized frequency

15 Experiments(3/3) Performance speed : average number of blocks accessed effective error d LSH : LSH NN distance(q), d* : NN distance(q) miss ratio the fraction of queries for which no answer was found

16 Experiments : color histogram(1/4) Error vs. Number of indices(L)

17 Experiments : color histogram(2/4) Dependence on n Approximate 1 NNS Approximate 10 NNS Number of database points Disk Accesses

18 Experiments : color histogram(3/4) Miss ratios Approximate 1 NNS Approximate 10 NNS Number of database points Miss ratio

19 Experiments : color histogram(4/4) Dependence on d Approximate 1 NNS Approximate 10 NNS Number of dimension Disk Accesses

20 Experiments : texture features(1/2) Number of indices vs. Error

21 Experiments : texture features(2/2) Number of indices vs. Size

22 Concluding remarks Locality Sensitive Hashing fast approximation dynamic/join version Future work hybrid techniques : tree-based and hashing-based


Download ppt "Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented."

Similar presentations


Ads by Google