Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Similar presentations


Presentation on theme: "Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009."— Presentation transcript:

1 Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009

2 Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion Indyk and Motwani, 1998 Gionis, Indyk and Motwani, 1999 Main Results

3 Nearest Neighbor Problem Input: A set P of points in R d (or any metric space). Output: Given a query point q, find the point p * in P which is closest to q. q p*

4 What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression

5 What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression 2 2 2 2 3 3 1 7 8 7 4 2 Feature space query

6 What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression about boat bat abate able scout shout abaut Feature space query

7 What is it good for? Many things! Examples: Optical Character Recognition Spell Checking Computer Vision DNA sequencing Data compression And many more…

8 Approximate Nearest Neighbor  -NN

9 Input: A set P of points in R d (or any metric space). Given a query point q, let: –p * point in P closest to q –r* the distance ||p*-q|| Output: Some point p’ with distance at most r*(1+  ) q p* r*

10 Approximate Nearest Neighbor  -NN Input: A set P of points in R d (or any metric space). Given a query point q, let: –p * point in P closest to q –r* the distance ||p*-q|| Output: Some point p’ with distance at most r*(1+  ) p* ·r*(1+  ) q r*

11 Approximate vs. Exact Nearest Neighbor Many applications give similar results with approximate NN Example from Computer Vision

12 Retiling Slide from Lihi Zelnik-Manor

13 Exact NNS ~27 sec Approximate NNS ~0.6 sec Slide from Lihi Zelnik-Manor

14 Solution Method Input: A set P of n points in R d. Method: Construct a data structure to answer nearest neighbor queries Complexity –Preprocessing: space and time to construct the data structure –Query: time to return answer

15 Solution Method Naïve approach: –Preprocessing O(nd) –Query time O(nd) Reasonable requirements: –Preprocessing time and space poly(nd). –Query time sublinear in n.

16 Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion

17 Classical nearest neighbor methods Tree structures –kd-trees Vornoi Diagrams –Preprocessing poly(n), exp(d) –Query log(n), exp(d) Difficult problem in high dimensions –The solutions still work, but are exp(d)…

18 KD-tree d=1 (binary search tree) 520 7, 810, 1213, 1518 121578101318 13,15,187,8,10,12 1813,15 10,127,8

19 KD-tree d=1 (binary search tree) 520 7, 810, 1213, 1518 121578101318 13,15,187,8,10,12 1813,15 10,127,8 17 query min dist = 1

20 KD-tree d=1 (binary search tree) 520 7, 810, 1213, 1518 121578101318 13,15,187,8,10,12 1813,15 10,127,8 16 query min dist = 2 min dist = 1

21 KD-tree d>1: alternate between dimensions Example: d=2 x y x (12,5) (6,8) (17,4) (23,2) (20,10) (9,9) (1,6) (17,4) (23,2) (20,10) (12,5) (6,8) (1,6) (9,9)

22 KD-tree d>1: alternate between dimensions Example: d=2 xx y x

23 KD-tree d>1: alternate between dimensions Example: d=2 NN search Animated gif from http://en.wikipedia.org/wiki/File:KDTree-animation.gif

24 KD-tree: complexity Preprocessing O(nd) Query –O(logn) if points are randomly distributed –w.c. O(kn 1-1/k ) almost linear when n close to k Need to search the whole tree xx y x

25 Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion

26 Sublinear solutions PreprocessingQuery time n O(1/  ) O(logn)Bucketing O(n 1+1/(1+  ) ) [n 3/2 when  =1] O(n 1/(1+  ) ) [sqrt(n) when  =1] LSH 2 Linear in d Not counting logn factors Solve  -NN by reduction

27 r-PLEB Point Location in Equal Balls Given n balls of radius r, for every query q, find a ball that it resides in, if exists. If doesn’t reside in any ball return NO. Return p 1 p1p1

28 r-PLEB Point Location in Equal Balls Given n balls of radius r, for every query q, find a ball that it resides in, if exists. If doesn’t reside in any ball return NO. Return NO

29 Reduction from  -NN to r-PLEB The two problems are connected –r-PLEB is like a decision problem for  -NN

30 Reduction from  -NN to r-PLEB The two problems are connected –r-PLEB is like a decision problem for  -NN

31 Reduction from  -NN to r-PLEB The two problems are connected –r-PLEB is like a decision problem for  -NN

32 Reduction from  -NN to r-PLEB Naïve Approach Set R=proportion between largest dist and smallest dist of 2 points Define r={(1+  ) 0, (1+  ) 1,…,R} For each r i construct r i -PLEB Given q, find the smallest r* which gives a YES –Use binary search to find r*

33 Reduction from  -NN to r-PLEB Naïve Approach Set R=proportion between largest dist and smallest dist of 2 points Define r={(1+  ) 0, (1+  ) 1,…,R} For each r i construct r i -PLEB Given q, find the smallest r i which gives a YES –Use binary search r 1 -PLEB r 2 -PLEB r 3 -PLEB

34 Reduction from  -NN to r-PLEB Naïve Approach Correctness –Stopped at r i =(1+  ) k –r i+1 =(1+  ) k+1 r 1 -PLEB r 2 -PLEB r 3 -PLEB (1+  ) k · r* · (1+  ) k+1

35 Reduction from  -NN to r-PLEB Naïve Approach Reduction overhead: Space: O(log 1+  R) r-PLEB constructions –Size of {(1+  ) 0, (1+  ) 1,…,R} is log 1+  R Query: O(loglog 1+  R) calls to r-PLEB Dependency on R

36 Reduction from  -NN to r-PLEB Better Approach Set r med as the radius which gives n/2 connected components (C.C) Har-Peled 2001

37 Reduction from  -NN to r-PLEB Better Approach Set r med as the radius which gives n/2 connected components (C.C)

38 Reduction from  -NN to r-PLEB Better Approach Set r med as the radius which gives n/2 connected components (C.C) Set r top = 4nr med logn/  r med r top

39 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

40 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r top

41 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

42 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r top

43 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r top

44 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

45 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

46 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

47 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. r med

48 Reduction from  -NN to r-PLEB Better Approach If q 2 B(p i,r med ) and q 2 B(p i,r top ), set R=r top /r med and perform binary search on r={(1+  ) 0, (1+  ) 1,…,R} –R independent of input points If q 2 B(p i,r med ) q 2 B(p i,r top ) 8 i then q is “far away” –Enough to choose one point from each C.C and continue recursively with these points (accumulating error · 1+  /3) If q 2 B(p i,r med ) for some i then continue recursively on the C.C. 2 + half of the points O(loglogR)=O(log(n/  ) Complexity overhead: how many r-PLEB queries? Total: O(logn)

49 (r,  )-PLEB Point Location in Equal Balls Given n balls of radius r, for query q: –If q resides in a ball of radius r, return the ball. –If q doesn’t reside in any ball, return NO. –If q resides only in the “border” of a ball, return either the ball or NO. p1p1 Return p 1

50 (r,  )-PLEB Point Location in Equal Balls Given n balls of radius r, for query q: –If q resides in a ball of radius r, return the ball. –If q doesn’t reside in any ball, return NO. –If q resides only in the “border” of a ball, return either the ball or NO. Return NO

51 (r,  )-PLEB Point Location in Equal Balls Given n balls of radius r, for query q: –If q resides in a ball of radius r, return the ball. –If q doesn’t reside in any ball, return NO. –If q resides only in the “border” of a ball, return either the ball or NO. Return YES or NO

52 Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Locality Sensitive Hashing Conclusion

53 Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB Indyk and Motwani, 1998

54 Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB

55 Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB

56 Bucketing Method Apply a grid of size r  /sqrt(d) Every ball is covered by at most k cubes –Can show that k · C d /  d for some C<5 constant kn cubes cover all balls Finite number of cubes: can use hash table –Key: cube, Value: a ball it covers Space req: O(nk) r-PLEB

57 Bucketing Method Given query q Compute the cube it resides in [O(d)] Find the ball this cube intersects [O(1)] This point is an (r,  )-PLEB of q r-PLEB

58 Bucketing Method Given query q Compute the cube it resides in [O(d)] Find the ball this cube intersects [O(1)] This point is an (r,  )-PLEB of q r  /sqrt(d)  r-PLEB

59 Bucketing Method Given query q Compute the cube it resides in [O(d)] Find the ball this cube intersects [O(1)] This point is an (r,  )-PLEB of q  NO YES YES or NOr-PLEB

60 Bucketing Method Complexity Space required: O(nk)=O(n(1/  d )) Query time: O(d) If d=O(logn) [or n=O(2 d )] –Space req: O(n log(1/  ) ) Else use dimensionality reduction in l 2 from d to  -2 log(n) [Johnson-Lindenstrauss lemma] –Space: n O(1/  ) 2

61 Break

62 Talk Outline Nearest neighbor problem –Motivation Classical nearest neighbor methods –KD-trees Efficient search in high dimensions –Bucketing method –Local Sensitive Hashing Conclusion

63 Locality Sensitive Hashing Indyk & Motwani 98, Gionis, Indyk & Motwani 99 A solution for (r,  )-PLEB. Probabilistic construction, query succeeds with high probability. Use random hash functions g: X  U (some finite range). Preserve “separation” of “near” and “far” points with high probability.

64 Locality Sensitive Hashing If ||p-q|| ≤ r, then Pr[g(p)=g(q)] is “high” If ||p-q|| > (1+  )r, then Pr[g(p)=g(q)] is “low” r … g3g3 … g2g2 … g1g1

65 A locality sensitive family A family H of functions h: X → U is called (P 1,P 2,r,(1+  )r)-sensitive for metric d X, if for any p,q: –if ||p-q|| P 1 –if ||p-q|| >(1+  )r then Pr[ h(p)=h(q) ] < P 2 For this notion to be useful we require P 1 > P 2

66 Intuition if ||p-q|| P 1 if ||p-q|| >(1+  )r then Pr[ h(p)=h(q) ] < P 2 h1h1 h2h2 Illustration from Lihi Zelnik-Manor

67 Claim If there is a (P 1,P 2,r,(1+  )r) - sensitive family for d X then there exists an algorithm for (r,  )- PLEB in d X with Space - O(dn+n 1+  ) Query - O(dn  ) Where ~ When  = 1 O(dn + n 3/2 ) O(d ¢ sqrt(n))

68 Algorithm – preprocessing k h1h1 h2h2 hkhk For i = 1,…,L –Uniformly select k functions from H –Set g i (p)=(h 1 (p),h 2 (p),…,h k (p)) g i ( ) = (0,0,...,1) gi( ) = (1,0,…,0) h i : R d  {0,1} 0 1

69 Algorithm – preprocessing For i = 1,…,L –Uniformly select k functions from H –Set g i (p)=(h 1 (p),h 2 (p),…,h k (p)) –Compute g i (p) for all p 2 P –Store resulting values in a hash table

70 Algorithm - query S à , i à 1 While |S| · 2L –S à S [ {points in bucket g i (q) of table i} –If 9 p 2 S s.t. ||p-q|| · (1+  )r return p and exit. –i++ Return NO.

71 Correctness Property I: if ||q-p * || · r then g i (p * ) = g i (q) for some i 2 1,...,L Property II: number of points p 2 P s.t. ||q-p|| ¸ (1+  )r and g i (p * ) = g i (q) is less than 2L We show that Pr[I & II hold] ¸ ½-1/e

72 Correctness Property I: if ||q-p * || · r then g i (p * ) = g i (q) for some i 2 1,...,L Property II: number of points p 2 P s.t. ||q-p|| ¸ (1+  )r and g i (p * ) = g i (q) is less than 2L Choose: –k = log 1/p 2 n –L = n  where

73 Complexity k = log 1/p 2 n L = n  where Space L ¢ n + d ¢ n = O(n 1+  + dn) Query L hash function evaluations + O(L) distance calculations = O(dn  ) Hash tablesData points ~

74 Significance of k and L ||p-q|| Pr[g(p) = g(q)]

75 Significance of k and L ||p-q|| Pr[g i (p) = g i (q) for some i 2 1,...,L]

76 Application Perform NNS in R d with l 1 distance. Reduce the problem to NNS in H d’ the hamming cube of dimension d’. H d’ = binary strings of length d’. d Ham (s 1,s 2 ) = number of coordinates where s 1 and s 2 disagree.

77 w.l.o.g all coordinates of all points in P are positive integer < C. Map integer i 2 {1,...,C} to (1,1,....,1,0,0,...0) Map a vector by mapping each coordinate. Example: {(5,3,2),(2,4,1)}  {(11111,11100,11000),(11000,11110,10000)} Embedding l 1 d in H d’ C-i zerosi ones

78 Distances are preserved. Actual computations are performed in the original space O(log C) overhead. Embedding l 1 d in H d’

79 A sensitive family for the hamming cube H d’ = {h i : h i (b 1,…,b d’ ) = b i for i = 1,…,d’} –If d Ham (s 1,s 2 ) < r what is Pr[h(p)=h(q)] ? at most 1-r/d’ –If d Ham (s ,s 2 ) > (1+  )r what is Pr[h(p)=h(q)] ? at least 1-(1+  )r/d’ H d’ is (r,(1+  )r,1-r/d’,1-(1+  )r/d’) sensitive. Question: what are these projections in the original space?

80 Corollary We can bound · (1/1+  ) Space - O(dn+n (1+1/(1+  ) Query - O(dn 1/(1+  ) When  = 1 O(dn + n 3/2 ) O(d ¢ sqrt(n))

81 Recent results In Euclidian space –  · 1/(1+  ) 2 + O(log log n / log 1/3 n) [Andoni & Indyk 2008] –  ¸ 0.462/(1+  ) 2 [Motwani, Naor & Panigrahy 2006] LSH family for l s s 2 [0,2) [Datar,Immorlica,Indyk & Mirrokni 2004] And many more.

82 Conclusion NNS is an important problem with many applications. The problem can be efficiently solved in low dimensions. We saw some efficient approximate solutions in high dimensions, which are applicable to many metrics.


Download ppt "Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009."

Similar presentations


Ads by Google