Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)

Slides:



Advertisements
Similar presentations
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Advertisements

Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Spectral Approaches to Nearest Neighbor Search arXiv: Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Metric Embeddings with Relaxed Guarantees Hubert Chan Joint work with Kedar Dhamdhere, Anupam Gupta, Jon Kleinberg, Aleksandrs Slivkins.
Similarity Search in High Dimensions via Hashing
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Navigating Nets: Simple algorithms for proximity search Robert Krauthgamer (IBM Almaden) Joint work with James R. Lee (UC Berkeley)
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Institute) Robert Krauthgamer (Weizmann Institute) Ilya Razenshteyn (CSAIL MIT)
Spectral Approaches to Nearest Neighbor Search Alex Andoni Joint work with:Amirali Abdullah Ravi Kannan Robi Krauthgamer.
Nearest Neighbor Search in High-Dimensional Spaces Alexandr Andoni (Microsoft Research Silicon Valley)
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
1 Embedded Stringology Piotr Indyk MIT. 2 Combinatorial Pattern Matching Stringology [Galil] : algorithms for strings (as well as trees and other plants)
Nearest Neighbour Condensing and Editing David Claus February 27, 2004 Computer Vision Reading Group Oxford.
Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Dimensionality Reduction
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Embedding and Sketching Non-normed spaces Alexandr Andoni (MSR)
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
Fast Similarity Search for Learned Metrics Prateek Jain, Brian Kulis, and Kristen Grauman Department of Computer Sciences University of Texas at Austin.
Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Princeton/CCI → MSR SVC) Barriers II August 30, 2010.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques Nearest Neighbor Editing and Condensing Techniques 1.Nearest Neighbor.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
Summer School on Hashing’14 Dimension Reduction Alex Andoni (Microsoft Research)
Data-dependent Hashing for Similarity Search
Approximate Near Neighbors for General Symmetric Norms
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
Sublinear Algorithmic Tools 2
Spectral Approaches to Nearest Neighbor Search [FOCS 2014]
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Sketching and Embedding are Equivalent for Norms
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Data-Dependent Hashing for Nearest Neighbor Search
Lecture 16: Earth-Mover Distance
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
Overcoming the L1 Non-Embeddability Barrier
COSC 4335: Other Classification Techniques
CS5112: Algorithms and Data Structures for Applications
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Approximating Edit Distance in Near-Linear Time
Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech
Presentation transcript:

Nearest Neighbor Search in high-dimensional spaces Alexandr Andoni (Microsoft Research)

Nearest Neighbor Search (NNS) Preprocess: a set D of points in R d Query: given a new point q, report a point p  D with the smallest distance to q q p

Motivation Generic setup:  Points model objects (e.g. images)  Distance models (dis)similarity measure Application areas:  machine learning, data mining, speech recognition, image/video/music clustering, bioinformatics, etc… Distance can be:  Euclidean, Hamming, ℓ ∞, edit distance, Ulam, Earth-mover distance, etc… Primitive for other problems:  find the closest pair in a set D, MST, clustering… p q

Plan for today 1. NNS for “basic” distances: LSH 2. NNS for “advanced” distances: embeddings

2D case Compute Voronoi diagram Given query q, perform point location Performance:  Space: O(n)  Query time: O(log n)

High-dimensional case All exact algorithms degrade rapidly with the dimension d When d is high, state-of-the-art is unsatisfactory:  Even in practice, query time tends to be linear in n AlgorithmQuery timeSpace Full indexingO(d*log n)n O(d) (Voronoi diagram size) No indexing – linear scan O(dn)

Approximate NNS r-near neighbor: given a new point q, report a point p  D s.t. ||p-q||≤r c-approximate cr as long as there exists a point at distance ≤r q r p cr

Approximation Algorithms for NNS A vast literature:  With exp(d) space or Ω(n) time: [Arya-Mount-et al], [Kleinberg’97], [Har-Peled’02],…  With poly(n) space and o(n) time: [Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti- Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A- Indyk’06]…

ρ=1/c 2 +o(1) [AI’06] n 1+ρ +nddn ρ ρ≈1/c [IM’98, Cha’02, DIIM’04] The landscape: algorithms ρ=O(1/c 2 ) [AI’06] n 4/ε 2 +ndO(d*log n)c=1+ε [KOR’98, IM’98] nd*logndn ρ ρ=2.09/c [Ind’01, Pan’06] Space: poly(n). Query: logarithmic Space: small poly (close to linear). Query: poly (sublinear). Space: near-linear. Query: poly (sublinear). SpaceTimeCommentReference ρ=1/c 2 +o(1) [AI’06] n 1+ρ +nddn ρ ρ≈1/c [IM’98, Cha’02, DIIM’04]

Locality-Sensitive Hashing Random hash function g: R d  Z s.t. for any points p,q:  If ||p-q|| ≤ r, then Pr[g(p)=g(q)] is “high”  If ||p-q|| >cr, then Pr[g(p)=g(q)] is “small” Use several hash tables q p “not-so-small” ||p-q|| Pr[g(p)=g(q)] rcr 1 P1P1 P2P2 : n ρ, where ρ s.t. [Indyk-Motwani’98]

Example of hash functions: grids Pick a regular grid:  Shift and rotate randomly Hash function:  g(p) = index of the cell of p Gives ρ ≈ 1/c p [Datar-Immorlica-Indyk-Mirrokni’04]

Regular grid → grid of balls  p can hit empty space, so take more such grids until p is in a ball Need (too) many grids of balls  Start by reducing dimension to t Analysis gives Choice of reduced dimension t?  Tradeoff between # hash tables, n , and Time to hash, t O(t)  Total query time: dn 1/c 2 +o(1) Near-Optimal LSH 2D p p RtRt [A-Indyk’06]

x Proof idea Claim:, where  P(r)=probability of collision when ||p-q||=r Intuitive proof:  Let’s ignore effects of reducing dimension  P(r) = intersection / union  P(r)≈random point u beyond the dashed line  The x-coordinate of u has a nearly Gaussian distribution → P(r)  exp(-A·r 2 ) p q r q P(r) u p

The landscape: lower bounds ρ=1/c 2 +o(1) [AI’06] ρ=O(1/c 2 ) [AI’06] n 4/ε 2 +ndO(d*log n)c=1+ε [KOR’98, IM’98] n 1+ρ +nddn ρ ρ≈1/c [IM’98, Cha’02, DIIM’04] nd*logndn ρ ρ=2.09/c [Ind’01, Pan’06] Space: poly(n). Query: logarithmic Space: small poly (close to linear). Query: poly (sublinear). Space: near-linear. Query: poly (sublinear). SpaceTimeCommentReference n o(1/ε 2 ) ω(1) memory lookups [AIP’06] ρ≥1/c 2 [MNP’06, OWZ’10] n 1+o(1/c 2 ) ω(1) memory lookups [PTW’08, PTW’10]

Open Question #1: Design space partitioning of R t that is  efficient: point location in poly(t) time  qualitative: regions are “sphere-like” [Prob. needle of length 1 is cut] [Prob needle of length c is cut] ≥ c2c2

LSH beyond NNS Approximating Kernel Spaces (obliviously)  Problem: For x,y  R d, can define inner product K(x,y)=e -||x-y|| Implicitly, means K(x,y) = ϕ (x) * ϕ (y) Can we obtain explicit and efficient ϕ ? approximately? (can’t do exactly)  Yes, for some kernels, via LSH: E.g., map ϕ ’(x)=(r 1 (g 1 (x)), r 2 (g 2 (x)…) g i ’s are LSH functions on R d, r i ’s map into random ±1 Get: ±ε approximation in O(ε -2 * log n) dimensions Sketching (≈ dimensionality reduction in a computational space) [KOR’98,…] [A-Indyk’07, Rahami-Recht’07, A’09]

Plan for today 1. NNS for basic distances 2. NNS for advanced distances: embeddings NNS beyond LSH

Distances, so far LSH good for: Hamming, Euclidean ρ≈1/c 2 [AI’06] n 1+ρ +nddn ρ ρ=1/c [IM’98, Cha’02, DIIM’04] ρ≥1/c 2 [MNP06,OZW10,PTW’08’10] Hamming ( ℓ 1 ) Euclidean ( ℓ 2 ) ρ≥1/c [MNP06,OZW10,PTW’08’10] SpaceTimeCommentReference How about other distances (not ℓ p ’s) ?

Earth-Mover Distance (EMD) Given two sets A, B of points, EMD(A,B) = min cost bipartite matching between A and B Points can be in plane, ℓ 2 d … Applications: image search Images courtesy of Kristen Grauman (UT Austin)

Reductions via embeddings f ℓ 1 =real space with distance ||x-y|| 1 =∑ i |x i -y i | For each X  M, associate a vector f(X), such that for all X,Y  M  ||f(X) - f(Y)|| 2 approximates original distance between X and Y  Up to some distortion (approximation) Then can use NNS for Euclidean space! Can also consider other “easy” distances between f(x), f(y) Most popular host: ℓ 1 ≡Hamming f

Earth-Mover Distance over 2D into ℓ 1 Sets of size s in [1…s]x[1…s] box Embedding of set A:  impose randomly-shifted grid  Each grid cell gives a coordinate: f (A) c =#points in the cell c  Subpartition the grid recursively, and assign new coordinates for each new cell (on all levels) Distortion: O(log s) 21 [Cha02, IT03]

Embeddings of various metrics Embeddings into ℓ 1 MetricUpper bound Earth-mover distance (s-sized sets in 2D plane) O(log s) [Cha02, IT03] Earth-mover distance (s-sized sets in {0,1} d ) O(log s*log d) [AIK08] Edit distance over {0,1} d (= #indels to tranform x->y) 2 Õ(√log d) [OR05] Ulam (edit distance between non-repetitive strings) O(log d) [CK06] Block edit distanceÕ(log d) [MS00, CM07] Lower bound Ω(log 1/2 s) [NS07] Ω(log s) [KN05] Ω(log d) [KN05,KR06] Ω̃(log d) [AK07] 4/3 [Cor03] Open Question #3: Improve the distortion of embedding EMD, W 2, edit distance into ℓ 1

Really beyond LSH NNS for ℓ ∞  via decision trees  cannot do better via (deterministic) decision trees NNS for mixed norms, e.g. ℓ 2 (ℓ 1 ) [I’04,AIK’09,A’09] Embedding into mixed norms [AIK’09]  Ulam O(1)-embeds in ℓ 2 2 (ℓ ∞ (ℓ 1 )) of small dimension  yields NNS with O(log log d) approximation  Ω̃(log d) if would embed into each separate norm! Open Question #4:  Embed EMD, edit distance into mixed norms? 23 n 1+ρ O(d log n)c≈log ρ log d [I’98] SpaceTimeCommentReference

Summary: high-d NNS Locality-sensitive hashing:  For Hamming, Euclidean spaces  Provably (near) optimal NNS in some regimes  Applications beyond NNS: kernels, sketches Beyond LSH  Non-normed distances: via embeddings into ℓ 1  Algorithms for ℓ p and mixed norms (of ℓ p ‘s) Some open questions:  Design qualitative, efficient LSH / space partitioning (in Euclidean space)  Embed “harder” distances (like EMD, edit distance) into ℓ 1, or mixed norms (of ℓ p ’s) ?  Is there an LSH for ℓ ∞ ?  NNS for any norm: e.g. trace norm?