Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive.

Similar presentations


Presentation on theme: "CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive."— Presentation transcript:

1 CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive Hashing Rajeev Motwani

2 CS 361A 2 Metric Space Metric Space (M,D) –For points p,q in M, D(p,q) is distance from p to q –only reasonable model for high-dimensional geometric space Defining Properties –Reflexive: D(p,q) = 0 if and only if p=q –Symmetric: D(p,q) = D(q,p) –Triangle Inequality: D(p,q) is at most D(p,r)+D(r,q) Interesting Cases –M  points in d-dimensional space –D  Hamming or Euclidean L p -norms

3 CS 361A 3 High-Dimensional Near Neighbors Nearest Neighbors Data Structure –Given – N points P={p 1, …, p N } in metric space (M,D) –Queries – “Which point p  P is closest to point q?” –Complexity – Tradeoff preprocessing space with query time Applications –vector quantization –multimedia databases –data mining –machine learning –…

4 CS 361A 4 Known Results Query Time StorageTechniquePaper dN Brute-Force 2 d log NN 2^d+1 Voronoi DiagramDobkin-Lipton 76 D d/2 log NN d/2 Random SamplingClarkson 88 d 5 log NNdNd CombinationMeiser 93 log d-1 NN log d-1 N Parametric SearchAgarwal-Matousek 92 Some expressions are approximate Bottom-line – exponential dependence on d

5 CS 361A 5 Approximate Nearest Neighbor Exact Algorithms –Benchmark – brute-force needs space O(N), query time O(N) –Known Results – exponential dependence on dimension –Theory/Practice – no better than brute-force search Approximate Near-Neighbors –Given – N points P={p 1, …, p N } in metric space (M,D) –Given – error parameter  >0 –Goal – for query q and nearest-neighbor p, return r such that Justification –Mapping objects to metric space is heuristic anyway –Get tremendous performance improvement

6 CS 361A 6 Results for Approximate NN Query TimeStorageTechniquePaper d d e -d dN Balanced TreesArya et al 94 d 2 polylog(N,d) N N 2d dN polylog(N,d) Random ProjectionKleinberg 97 log 3 NN 1/  ^2 Search Trees + Dimension Reduction Indyk-Motwani 98 dN 1/  log 2 NN 1+1/  log N Locality-Sensitive Hashing Indyk-Motwani 98 External Memory Locality-Sensitive Hashing Gionis-Indyk- Motwani 99 Will show main ideas of last 3 results Some expressions are approximate

7 CS 361A 7 Approximate r-Near Neighbors Given – N points P={p 1,…,p N } in metric space (M,D) Given – error parameter  >0, distance threshold r>0 Query –If no point p with D(q,p)<r, return FAILURE –Else, return any p’ with D(q,p’)< (1+  )r Application –Solving Approximate Nearest Neighbor –Assume maximum distance is R –Run in parallel for –Time/space – O(log R) overhead –[Indyk-Motwani] – reduce to O(polylog n) overhead

8 CS 361A 8 Hamming Metric Hamming Space –Points in M: bit-vectors {0,1} d (can generalize to {0,1,2,…,q} d ) –Hamming Distance: D(p,q) = # of positions where p,q differ Remarks –Simplest high-dimensional setting –Still useful in practice –In theory, as hard (or easy) as Euclidean space –Trivial in low dimensions Example –Hypercube in d=3 dimensions –{000, 001, 010, 011, 100, 101, 110, 111}

9 CS 361A 9 Dimensionality Reduction Overall Idea –Map from high to low dimensions –Preserve distances approximately –Solve Nearest Neighbors in new space –Performance improvement at cost of approximation error Mapping? –Hash function family H = {H 1, …, H m } –Each H i : {0,1} d  {0,1} t with t<<d uniformly at random –Pick H R from H uniformly at random each point in using same –Map each point in P using same H R –Solve NN problem on H R (P) = {H R (p 1 ), …, H R (p N )}

10 CS 361A 10 Reduction for Hamming Spaces Theorem: For any r and small  >0, there is hash family H such that for any p,q and random H R  H with probability >1- , provided for some constant C, c a b c a b

11 CS 361A 11 Remarks For fixed threshold r, can distinguish between –Near D(p,q) < r –Far D(p,q) > (1+ε)r For N points, need Yet, can reduce to O(log N)-dimensional space, while approximately preserving distances Works even if points not known in advance

12 CS 361A 12 Hash Family Projection Function –Let S be ordered, multiset of s indexes from {1,…,d} –p|S:{0,1} d  {0,1} s projects p into s-dimensional subspace –Example d=5, p=01100 s=3, S={2,2,4}  p|S = 110 Choosing hash function H R in H –Repeat for i=1,…,t Pick S i randomly (with replacement) from {1…d} Pick random hash function f i :{0,1} s  {0,1} h i (p)=f i (p|S i ) –H R (p) = (h 1 (p), h 2 (p),…,h t (p)) Remark – note similarity to Bloom Filters

13 CS 361A 13 Illustration of Hashing 0110001010 10010000 1d..... 1... s1 s..... p p|S 1 p|S t 0110 f1f1 ftft h 1 (p)... h t (p) H R (p)

14 CS 361A 14 Analysis I Choose random index-set S Claim: For any p,q Why? –p,q differ in D(p,q) bit positions –Need all s indexes of S to avoid these positions –Sampling with replacement from {1, …,d}

15 CS 361A 15 Analysis II Choose s=d/r Since 1-x<e -x for |x|<1, we obtain Thus

16 CS 361A 16 Analysis III Recall h i (p)=f i (p|S i ) Thus Choosing c= ½ (1-e -1 )

17 CS 361A 17 Analysis IV Recall H R (p)=(h 1 (p),h 2 (p),…,h t (p)) D(H R (p),H R (q)) = number of i’s where h i (p), h i (q) differ By linearity of expectations Theorem almost proved For high probability bound, need Chernoff Bound

18 CS 361A 18 Chernoff Bound Consider Bernoulli random variables X 1,X 2, …, X n –Values are 0-1 –Pr[X i =1] = x and Pr[X i =0] = 1-x Define X = X 1 +X 2 +…+X n with E[X]=nx Theorem: For independent X 1,…, X n, for any 0<  <1, 2  nx P X nx

19 CS 361A 19 Analysis V Define –X i =0 if h i (p)=h i (q), and 1 otherwise –n=t –Then X = X 1 +X 2 +…+X t = D(H R (p),H R (q)) Case 1 [D(p,q)<r  x=c] Case 2 [D(p,q)>(1+ε)r  x=c+ε/6] Observe – sloppy bounding of constants in Case 2

20 CS 361A 20 Putting it all together Recall Thus, error probability Choosing C=1200/c Theorem is proved!!

21 CS 361A 21 Algorithm I Set error probability Select hash H R and map points p  H R (p) Processing query q –Compute H R (q) –Find nearest neighbor H R (p) for H R (q) –If then return p, else FAILURE Remarks –Brute-force for finding H R (p) implies query time –Need another approach for lower dimensions

22 CS 361A 22 Algorithm II Fact – Exact nearest neighbors in {0,1} t requires –Space O(2 t ) –Query time O(t) How? –Precompute/store answers to all queries –Number of possible queries is 2 t Since Theorem – In Hamming space {0,1} d, can solve approximate nearest neighbor with : –Space –Query time

23 CS 361A 23 Different Metric Many applications have “sparse” points –Many dimensions but few 1’s –Example – points  documents, dimensions  words –Better to view as “sets” Previous approach would require large s For sets A,B, define Observe –A=B  sim(A,B)=1 –A,B disjoint  sim(A,B)=0 Question – Handling D(A,B)=1-sim(A,B) ?

24 CS 361A 24 Min-Hash Random permutations p 1,…,p t of universe (dimensions) Define mapping h j (A)=min a in A p j (a) Fact: Pr[h j (A)= h j (B)] = sim(A,B) Proof? – already seen!! Overall hash-function H R (A) = (h 1 (A), h 2 (A),…,h t (A))

25 CS 361A 25 Min-Hash Analysis Select Hamming Distance –D(H R (A),H R (B))  number of j’s such that Theorem For any A,B, Proof? – Exercise (apply Chernoff Bound) Obtain – ANN algorithm similar to earlier result

26 CS 361A 26 Generalization Goal –abstract technique used for Hamming space –enable application to other metric spaces –handle Dynamic ANN Dynamic Approximate r-Near Neighbors –Fix – threshold r –Query – if any point within distance r of q, return any point within distance –Allow insertions/deletions of points in P Recall – earlier method required preprocessing all possible queries in hash-range-space…

27 CS 361A 27 Locality-Sensitive Hashing Fix – metric space (M,D), threshold r, error Choose – probability parameters Q 1 > Q 2 >0 Definition – Hash family H={h:M  S} for (M,D) is called. -sensitive, if for random h and for any p,q in M Intuition –p,q are near  likely to collide –p,q are far  unlikely to collide

28 CS 361A 28 Examples Hamming Space M={0,1} d –point p=b 1 …b d –H = {h i (b 1 …b d )=b i, for i=1…d} –sampling one bit at random –Pr[h i (q)=h i (p)] = 1 – D(p,q)/d Set Similarity D(A,B) = 1 – sim(A,B) –Recall –H = –Pr[h(A)=h(B)] = 1 – D(A,B)

29 CS 361A 29 Multi-Index Hashing Overall Idea –Fix LSH family H –Boost Q 1, Q 2 gap by defining G = H k –Using G, each point hashes into l buckets Intuition –r-near neighbors likely to collide –few non-near pairs in any bucket Define –G = { g | g(p) = h 1 (p)h 2 (p)…h k (p) } –Hamming metric  sample k random bits

30 CS 361A 30 Example ( l =4) p q r g1g1 g2g2 g3g3 g4g4 h1h1 hkhk ……

31 CS 361A 31 Overall Scheme Preprocessing –Prepare hash table for range of G –Select l hash functions g 1, g 2, …, g l Insert(p) – add p to buckets g 1 (p), g 2 (p), …, g l (p) Delete(p) – remove p from buckets g 1 (p), g 2 (p), …, g l (p) Query(q) –Check buckets g 1 (q), g 2 (q), …, g l (q) –Report nearest of (say) first 3 l points Complexity –Assume – computing D(p,q) needs O(d) time –Assume – storing p needs O(d) space –Insert/Delete/Query Time – O(d l k) –Preprocessing/Storage – O(dN+N l k)

32 CS 361A 32 Collision Probability vs. Distance r r Q1Q1 Q2Q2 1 0 r r

33 CS 361A 33 Multi-Index versus Error Set l =N z where Theorem For l =N z, any query returns r-near neighbor correctly with probability at least 1/6. Consequently (ignoring k=O(log N) factors) –Time O(dN z ) –Space O(N 1+z ) –Hamming Metric  –Boost Probability – use several parallel hash-tables

34 CS 361A 34 Analysis Define (for fixed query q) –p* – any point with D(q,p*) < r –FAR(q) – all p with D(q,p) > (1+ )r –BUCKET(q,j) – all p with g j (p) = g j (q) –Event E size : (  query cost bounded by O(d l )) –Event E NN : g j (p*) = g j (q) for some j (  nearest point in l buckets is r-near neighbor) Analysis –Show: Pr[E size ] = x > 2/3 and Pr[E NN ] = y > 1/2 –Thus: Pr[not(E size & E NN) ] < (1-x) + (1-y) < 5/6

35 CS 361A 35 Analysis – Bad Collisions Choose Fact Clearly Markov Inequality – Pr[X>r.E[X]] 0 Lemma 1

36 CS 361A 36 Analysis – Good Collisions Observe  Since l =n z  Lemma 2 Pr[E NN ] >1/2

37 CS 361A 37 Euclidean Norms Recall –x=(x 1, x 2, …, x d ) and y=(y 1, y 2, …, y d ) in R d –L 1 -norm –L p -norm (for p>1)

38 CS 361A 38 Extension to L 1 -Norm Round coordinates to {1,…M} Embed L 1 -{1,…,M} d into Hamming-{0,1} dM Unary Mapping Apply algorithm for Hamming Spaces –Error due to rounding of 1/M  –Space-Time Overhead due to mapping of d  dM

39 CS 361A 39 Extension to L 2 -Norm Observe –Little difference in L 1 -norm and L 2 -norm for high d –Additional error is small More generally – L p, for 1 p 2 –[Figiel et al 1977, Johnson-Schechtman 1982] –Can embed L p into L 1 –Dimensions d  O(d) –Distances preserved within factor (1+a) –Key Idea – random rotation of space

40 CS 361A 40 Improved Bounds [Indyk-Motwani 1998] –For any L p -norm –Query Time – O(log 3 N) –Space – Problem – impractical Today – only a high-level sketch

41 CS 361A 41 Better Reduction Recall –Reduced Approximate Nearest Neighbors to Approximate r-Near Neighbors –Space/Time Overhead – O(log R) –R = max distance in metric space Ring-Cover Trees –Removed dependence on R –Reduced overhead to O(polylog N)

42 CS 361A 42 Approximate r-Near Neighbors Idea –Impose regular-grid on R d –Decompose into cubes of side length s –Label cubes with points at distance <r Data Structure –Query q – determine cube containing q –Cube labels – candidate r-near neighbors Goals –Small s  lower error –Fewer cubes  smaller storage

43 CS 361A 43 p1p1 p2p2 p3p3

44 CS 361A 44 Grid Analysis Assume r=1 Choose Cube Diameter = Number of cubes = Theorem – For any L p -norm, can solve Approx r-Near Neighbor using –Space – –Time – O(d)

45 CS 361A 45 Dimensionality Reduction [Johnson-Lindenstraus 84, Frankl-Maehara 88] For, can map points in P into subspace of dimension while preserving all inter-point distances to within a factor Proof idea – project onto random lines Result for NN –Space – –Time – O(polylog N)

46 CS 361A 46 References Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality P. Indyk and R. Motwani STOC 1998 Similarity Search in High Dimensions via Hashing A. Gionis, P. Indyk, and R. Motwani VLDB 1999


Download ppt "CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive."

Similar presentations


Ads by Google