Locality Sensitive Hashing

Slides:



Advertisements
Similar presentations
Lecture outline Nearest-neighbor search in low dimensions
Advertisements

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
PARTITIONAL CLUSTERING
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Similarity Search in High Dimensions via Hashing
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
Dimensionality Reduction
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010.
1 Lecture 18 Syntactic Web Clustering CS
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Optimal Data-Dependent Hashing for Approximate Near Neighbors
Dimensionality Reduction
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Sketching and Nearest Neighbor Search (2) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.
KNN & Naïve Bayes Hongning Wang
Locality-sensitive hashing and its applications
SIMILARITY SEARCH The Metric Space Approach
Grassmannian Hashing for Subspace searching
Multidimensional Access Structures
Instance Based Learning
Hashing Alexandra Stefan.
Sublinear Algorithmic Tools 3
Orthogonal Range Searching and Kd-Trees
Hash Table.
Lecture 11: Nearest Neighbor Search
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
Near(est) Neighbor in High Dimensions
Lecture 16: Earth-Mover Distance
Nearest-Neighbor Classifiers
Indexing and Hashing Basic Concepts Ordered Indices
Instance Based Learning
University of Crete Department Computer Science CS-562
Searching Similar Segments over Textual Event Sequences
CS5112: Algorithms and Data Structures for Applications
2018, Spring Pusan National University Ki-Joune Li
Space-for-time tradeoffs
Machine Learning: UNIT-4 CHAPTER-1
CS5112: Algorithms and Data Structures for Applications
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Randomized Algorithms
Lecture-Hashing.
Presentation transcript:

Locality Sensitive Hashing Petra Kohoutková, Martin Kyselák

Outline Motivation Definition Algorithm Examples NN query, randomized approximate NN, open problems Definition LSH basic principles, definition, illustration Algorithm Basic idea, parameters, complexity Examples Specific LSH functions for several distance measures

Motivation Nearest neighbor queries: The goal is to preprocess a dataset of objects (e.g., images) so that later, given a new query object, one can quickly return the dataset object that is most similar to the query. Several efficient algorithms known for the case when the dimension d is low, e.g. kd-trees. However, these solutions suffer from either space or query time that is exponential in d. Thus, all current indexing techniques (based on space partitioning) degrade to linear search for sufficiently high d. This phenomenon is often called “the curse of dimensionality.”

Motivation II Approximate NN: return a point whose distance from the query is at most c times the distance from the query to its nearest points; c > 1 is called the approximation factor. Randomized c-approximate R-near neighbor ((c, R)– NN) problem: given a set P of points in a d-dimensional space, and parameters R > 0, δ > 0, construct a data structure such that, given any query point q, if there exists an R-near neighbor of q in P, it reports some cR-near neighbor of q in P with probability 1 – δ. Locality Sensitive Hashing (LSH): the key idea is to use hash functions such that the probability of collision is much higher for objects that are close to each other than for those that are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point. aiming for linear complexity in d, sublinear in n

LSH Definition The LSH algorithm relies on the existence of locality-sensitive hash functions. Let H be a family of hash functions mapping Rd to some universe U. For any two points p and q, consider a process in which we choose a function h from H uniformly at random, and analyze the probability that h(p)= h(q).

LSH Definition II Definition: A family H of functions h: Rd  U is called (R, cR, P1, P2)-sensitive, if for any p, q: If |p-q| ≤ R, then Pr[h(p) = h(q)] ≥ P1 If |p-q| ≥ cR, then Pr[h(p) = h(q)] ≤ P2 In order for a locality-sensitive hash (LSH) family to be useful, it has to satisfy P1 > P2.

Example: Hamming Distance Consider a data set of binary strings (000100, 000101, …) compared by the Hamming distance (D). In this case, we can use a simple family of functions H which contains all projections of the input point on one of the coordinates: H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } (pi is the i-th bit of p) Then the probability is as follows: Pr[h(p)=h(q)] = 1 - D(p,q) / d Let p = 000100, q = 000101; Pr[h(p)=h(q)] = 5/6 H is (1, 2, 5/6, 4/6)-sensitive

LSH Algorithm An LSH family H can be used to design an efficient algorithm for approximate NN search. However, one typically cannot use H directly since the gap between the probabilities P1 and P2 could be quite small. Given a family H of hash functions with parameters (R, cR, P1, P2), we amplify the gap between the high probability P1 and low probability P2 by concatenating several functions. In particular, for parameters k and L (specified later), we choose L functions gj(q), j = 1,…,L, by setting gj=(h1,j(q),h2,j(q),…,hk,j(q)), where ht,j (1 ≤ t ≤ k, 1 ≤ j ≤ L) are chosen independently and uniformly at random from H. These are the actual functions that we use to hash the data points.

LSH Algorithm – Parameters (A. Andoni, P. Indyk: Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions)

LSH Algorithm II Preprocessing: Choose L functions gj, j = 1,…L, by setting gj=(h1,j, h2,j, …, hk,j), where h1,j,…hk,j are chosen at random from the LSH family H. Construct L hash tables, where, for each j =1,…L, the jth hash table contains the dataset points hashed using the function gj. Query algorithm for a query point q: For each j = 1, 2,…, L Retrieve the points from the bucket gj(q) in the jth hash table. For each of the retrieved point, compute the distance from q to it, and report the point if it is a correct answer

LSH Algorithm III The LSH algorithm gives a solution to the randomized c-approximate R-near neighbor problem, with parameters R and δ for some constant failure probability δ < 1. The value of δ depends on the choice of the parameters k and L. Conversely, for each δ, one can provide parameters k and L so that the error probability is smaller than δ. The query time is also dependent on k and L. It could be as high as Θ(n) in the worst case, but, for many natural data sets, a proper choice of parameters results in a sublinear query time O(dn),  < 1.

LSH Functions for Hamming distance for l1 distance for Euclidean (l2) distance for Jaccard’s coefficient for Arccos measure … for general metric space?

LSH for Hamming Distance Hamming distance of strings p, q is equal to the number of positions where p and q differ Define a family of functions H which contains all projections of the input point on one of the coordinates: H= {hi | hi : {0,1}d → {0,1}, hi(p) = pi } (pi is the i-th bit of p)

LSH for l1 Distance l1 distance of vectors x, y is defined as l1(x,y) = (|x1-y1| + … + |xd-yd|) Fix a real w>>R, and impose a randomly shifted grid with cells of width w; each cell defines a bucket. More specifically, pick random real numbers s1,…,sd from [0, w) and define hs1,…,sd(x) = ((x1 – s1)/w, …, (xd – sd)/w)

LSH for Euclidean (l2) Distance l2(x,y) = (|x1-y1|2 + … + |xd-yd|2)1/2 Pick a random projection of Rd onto a 1-dimensional line and chop the line into segments of length w, shifted by a random value b from [0, w) hr,b(x) = (r·x + b)/w the projection vector r from Rd is constructed by picking each coordinate of r from the Gaussian distribution w = 3 r = (3,1), b = 2 r = (4, -1), b = 2

LSH for Jaccard’s Coefficient Jaccard’s coefficient for sets is defined as Pick a random permutation π on the ground universe U. Then, define hπ(A) = min{π(a) | a is in A}. The probability of collision is Prπ[hπ(A)= hπ(B)] = 1 - d(A, B) A π1 hπ1(A) = hπ1(B) = π2 hπ2(A) = hπ2(B) = B

LSH for Arccos Measure Arccos measure for vectors is defined as Family of LSH functions is then defined as follows: H= {hu(p) = sign(u.p)} where u is a random unit-length vector

LSH for General Metric Space We know nothing about the distance measure We can only use distances between objects Idea: use some randomly (?) picked objects as pivots, define buckets as Voronoi regions ?