Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Slides:



Advertisements
Similar presentations
Nonparametric Methods: Nearest Neighbors
Advertisements

Indexing DNA Sequences Using q-Grams
Shortest Vector In A Lattice is NP-Hard to approximate
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Segmented Hash: An Efficient Hash Table Implementation for High Performance Networking Subsystems Sailesh Kumar Patrick Crowley.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Similarity Search in High Dimensions via Hashing
Mining Time Series.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Modern Information Retrieval
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive.
1 Lecture 18 Syntactic Web Clustering CS
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
Hash Tables1 Part E Hash Tables  
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Hashing General idea: Get a large array
Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
Distance Indexing on Road Networks A summary Andrew Chiang CS 4440.
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
CSC 211 Data Structures Lecture 13
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Segmented Hash: An Efficient Hash Table Implementation for High Performance Networking Subsystems Sailesh Kumar Patrick Crowley.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
AGC DSP AGC DSP Professor A G Constantinides©1 Signal Spaces The purpose of this part of the course is to introduce the basic concepts behind generalised.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Mining Data Streams (Part 1)
SIMILARITY SEARCH The Metric Space Approach
Hashing - Hash Maps and Hash Functions
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
K Nearest Neighbor Classification
Randomized Algorithms CS648
Near(est) Neighbor in High Dimensions
Lecture 16: Earth-Mover Distance
Indexing and Hashing Basic Concepts Ordered Indices
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
CS5112: Algorithms and Data Structures for Applications
Minwise Hashing and Efficient Search
Error Correction Coding
Presentation transcript:

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani

 The approach of using the similarity searching is to be used only in high dimensions.  The reason, or the idea behind the approach is that since the selection of features and the choice of distance is rather heuristic, determining an appropriate nearest neighbor should suffice for most practical purposes.

 The basic idea is to hash the points from the database so as to ascertain that the probability of collision is much higher for objects that are close to each other than for those that are far apart.  The necessity arose from the so called ‘curse of dimensionality’ fact for the large databases.  In this case all the searching techniques reduce to linear search, if are being searched for the appropriate answer.

 The similarity search problem involves the nearest ( most similar ) object in a given collection of objects to a given query.  Typically the objects of interest are represented as points in  d and a distance metric is used to measure the similarity of the objects.  The basic problem is to perform indexing or similarity searching for query objects.

 The problem arises due to the fact that the present methods are not entirely satisfactory, for large d.  And is based on the idea that for most applications it is not necessary for the exact answer.  It also provides the user with a time-quality trade-off.  The above statements are based on the assumption that the searching for approximate answers is faster than for finding the exact answers.

 The technique is to use locality-sensitive hashing instead of space sensitive hashing.  The idea is to hash points points using several hash functions so as to ensure that, for each function the probability of collision is much higher for objects that are close to each other.  Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.

 The LSH ( locality sensitive hashing ) enabled to achieve the worst case O(dn 1/  ) time for approximate nearest neighbor over a n-point database.  In the presented paper, the worst time running time has been improved by the new technique to O(dn 1/(1+  ) ), which is a significant improvement.

Preliminaries  l d p is used to denote the Euclidian space  d under the l p normal form i.e., when the length of the vector (x 1,…,x d ) is defined as (| x 1 | p +…+| x d | p ).  Further, d(p,q) denotes the distance between the points p and q in l d p  We use H d to represent the Hamming metric space of of dimension d.  We use d H (p,q) to denote the Hamming distance.

 General definition of the problem is to find K nearest points in the given database, where K > 1.  Even for the KNNS problem, our algorithm generalizes to finding the K (>1) approximate nearest neighbors.  Here we wish to find the K points p 1,…,p k such the distance of pi to the query q is at the most (1+  ) times the distance from the i th nearest point to q.

The Algorithm  The distance is measured in the Euclidian terms, or the l 1 norm.  All the co-ordinates of the points in P are positive integers.

Locality Sensitive Hashing  The new algorithm is in many respects more natural the earlier ones: it does not require that a bucket to store only point.  It has better running time.  The analysis is generalized for the case of secondary memory.

 Let C be the largest coordinate in all points in P.  Then we can embed P into the Hamming cube H d’ with d’=C.d, by transforming each point p=(x1,…,xd) into a binary vector. v p =Unary c (x 1 )…Unary c (x d ), where Unary c (x) denotes the unary representation of x, i.e., a sequence of x zeroes followed by C-x ones.

 For an integer l, choose I 1 …I l subsets of {1…d’}.  Let p|I denote the the projection of vector p on the coordinate positions as per I and concatenating the bits in those positions.  Denote g j (p)= p|I j  For the preprocessing we store each p  P in the buckets for g j (p), for j=1…l.  As the total number of buckets may be large, we compress the buckets by resorting to standard hashing.

 Thus we use two levels of hashing.  The LSH maps the points into buckets g j (p) while a standard hashing function maps the contents of these buckets into a hash table of size M.  If a bucket in a given index is full, a new point cannot be added to it, since it will be added to another index with a very high probability.  This saves the overhead of maintaining the link structure.

 To process a query q, we search all the indices g 1 (q)…g l (q) until we either encounter at least c.l points or use all the l indices.  The number of disk accesses is always upper bounded on l, the number of indices.  Let p 1,…,p t be the points encountered in the process.  For the output we return the nearest K points, or fewer in case we could not find so many points as a result of the search.

 The principle behind our method is the probability of collision of two points p and q is closely related to the distance between them.  Especially the larger the distance, smaller the collision property.