Algorithms for Nearest Neighbor Search Piotr Indyk MIT.

Slides:



Advertisements
Similar presentations
Nearest Neighbor Search
Advertisements

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Multidimensional Indexing
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Similarity Search in High Dimensions via Hashing
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
2-dimensional indexing structure
Coherency Sensitive Hashing (CSH) Simon Korman and Shai Avidan Dept. of Electrical Engineering Tel Aviv University ICCV2011 | 13th International Conference.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Spatial Queries Nearest Neighbor and Join Queries.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
Spatial Indexing I Point Access Methods.
Spatial Queries Nearest Neighbor Queries.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
FLANN Fast Library for Approximate Nearest Neighbors
Indexing Techniques Mei-Chen Yeh.
Image Based Positioning System Ankit Gupta Rahul Garg Ryan Kaminsky.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Indexing for Multidimensional Data An Introduction.
Project 2 Presentation & Demo Course: Distributed Systems By Pooja Singhal 11/22/
Beyond Locality Sensitive Hashing Alex Andoni (Microsoft Research) Joint with: Piotr Indyk (MIT), Huy L. Nguyen (Princeton), Ilya Razenshteyn (MIT)
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Nearest Neighbor and Locality-Sensitive Hashing
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Multi-dimensional Search Trees
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Observer Relative Data Extraction Linas Bukauskas 3DVDM group Aalborg University, Denmark 2001.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Multi-dimensional Search Trees CS302 Data Structures Modified from Dr George Bebis.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
Spatial Data Management
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Multidimensional Access Structures
Spatial Indexing I Point Access Methods.
Lecture 11: Nearest Neighbor Search
Near(est) Neighbor in High Dimensions
Locality Sensitive Hashing
CS5112: Algorithms and Data Structures for Applications
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

Algorithms for Nearest Neighbor Search Piotr Indyk MIT

Nearest Neighbor Search Given: a set P of n points in R d Goal: a data structure, which given a query point q, finds the nearest neighbor p of q in P q p

Outline of this talk Variants Motivation Main memory algorithms: –quadtrees –kd-trees –Locality Sensitive Hashing Secondary storage algorithms: – R-tree (and its variants) –VA-file

Variants of nearest neighbor Near neighbor (range search): find one/all points in P within distance r from q Spatial join: given two sets P,Q, find all pairs p in P, q in Q, such that p is within distance r from q Approximate near neighbor: find one/all points p’ in P, whose distance to q is at most (1+  ) times the distance from q to its nearest neighbor

Motivation Depends on the value of d: low d: graphics, vision, GIS, etc high d: –similarity search in databases (text, images etc) –finding pairs of similar objects (e.g., copyright violation detection) –useful subroutine for clustering

Algorithms Main memory (Computational Geometry) –linear scan –tree-based: quadtree kd-tree –hashing-based: Locality-Sensitive Hashing Secondary storage (Databases) –R-tree (and numerous variants) –Vector Approximation File (VA-file)

Quadtree Simplest spatial structure on Earth !

Quadtree ctd. Split the space into 2 d equal subsquares Repeat until done: –only one pixel left –only one point left –only a few points left Variants: –split only one dimension at a time –k-d-trees (in a moment)

Range search Near neighbor (range search): –put the root on the stack –repeat pop the next node T from the stack for each child C of T: –if C is a leaf, examine point(s) in C –if C intersects with the ball of radius r around q, add C to the stack

Near neighbor ctd

Nearest neighbor Start range search with r =  Whenever a point is found, update r Only investigate nodes with respect to current r

Quadtree ctd. Simple data structure Versatile, easy to implement So why doesn’t this talk end here ? –Empty spaces: if the points form sparse clouds, it takes a while to reach them –Space exponential in dimension –Time exponential in dimension, e.g., points on the hypercube

Space issues: example

K-d-trees [Bentley’75] Main ideas: –only one-dimensional splits –instead of splitting in the middle, choose the split “carefully” (many variations) –near(est) neighbor queries: as for quadtrees Advantages: –no (or less) empty spaces –only linear space Exponential query time still possible

Exponential query time What does it mean exactly ? –Unless we do something really stupid, query time is at most dn –Therefore, the actual query time is Min[ dn, exponential(d) ] This is still quite bad though, when the dimension is around Unfortunately, it seems inevitable (both in theory and practice)

Approximate nearest neighbor Can do it using (augmented) k-d trees, by interrupting search earlier [Arya et al’94] Still exponential time (in the worst case)! Try a different approach: –for exact queries, we can use binary search trees or hashing –can we adapt hashing to nearest neighbor search ?

Locality-Sensitive Hashing [Indyk-Motwani’98] Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have: –Pr[h(p)=h(q)] is “high” if p is “close” to q –Pr[h(p)=h(q)] is “low” if p is”far” from q

Do such functions exist ? Consider the hypercube, i.e., – points from {0,1} d – Hamming distance D(p,q)= # positions on which p and q differ Define hash function h by choosing a set I of k random coordinates, and setting h(p) = projection of p on I

Example Take –d=10, p= –k=2, I={2,5} Then h(p)=11

h’s are locality-sensitive Pr[h(p)=h(q)]=(1-D(p,q)/d) k We can vary the probability by changing k k=1k=2 distance Pr

How can we use LSH ? Choose several h 1..h l Initialize a hash array for each h i Store each point p in the bucket h i (p) of the i-th hash array, i=1...l In order to answer query q –for each i=1..l, retrieve points in a bucket h i (q) –return the closest point found

What does this algorithm do ? By proper choice of parameters k and l, we can make, for any p, the probability that h i (p)=h i (q) for some i look like this: Can control: –Position of the slope –How steep it is distance

The LSH algorithm Therefore, we can solve (approximately) the near neighbor problem with given parameter r Worst-case analysis guarantees dn 1/(1+  ) query time Practical evaluation indicates much better behavior [GIM’99,HGI’00,Buh’00,BT’00] Drawbacks: works best for Hamming distance (although can be generalized to Euclidean space) requires radius r to be fixed in advance

Secondary storage Seek time same as time needed to transfer hundreds of KBs Grouping the data is crucial Different approach required: –in main memory, any reduction in the number of inspected points was good –on disk, this is not the case !

Disk-based algorithms R-tree [Guttman’84] –departing point for many variations –over 600 citations ! (according to CiteSeer) –“optimistic” approach: try to answer queries in logarithmic time Vector Approximation File [WSB’98] –“pessimistic” approach: if we need to scan the whole data set, we better do it fast LSH works also on disk

R-tree “Bottom-up” approach (k-d-tree was “top- down”) : –Start with a set of points/rectangles –Partition the set into groups of small cardinality –For each group, find minimum rectangle containing objects from this group –Repeat

R-tree ctd.

Advantages: –Supports near(est) neighbor search (similar as before) –Works for points and rectangles –Avoids empty spaces –Many variants: X-tree, SS-tree, SR-tree etc –Works well for low dimensions Not so great for high dimensions

VA-file [Weber, Schek, Blott’98] Approach: –In high-dimensional spaces, all tree-based indexing structures examine large fraction of leaves –If we need to visit so many nodes anyway, it is better to scan the whole data set and avoid performing seeks altogether – 1 seek = transfer of few hundred KB

VA-file ctd. Natural question: how to speed-up linear scan ? Answer: use approximation –Use only i bits per dimension (and speed-up the scan by a factor of 32/i) – Identify all points which could be returned as an answer –Verify the points using original data set

Time to sum up “Curse of dimensionality” is indeed a curse In main memory, we can perform sublinear-time search using trees or hashing In secondary storage, linear scan is pretty much all we can do (for high dim) Personal thought: if linear search is all we can do, we are not doing too well…. Maybe it is time to buy a few GB of RAM..but at the end everything depends on your data set

Resources Surveys: –Berchtold & Keim: – –Theodoridis: – –Agarwal et al (range searching): –

Resources Source code: References: see surveys plus very recent –[Buh’00,BT’00]: J. Buhler et al: –[HGI’00]: Haveliwala et al:

Contact If you have any question, feel free to me at Thank you !