Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

Slides:



Advertisements
Similar presentations
Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.
Advertisements

Nonparametric Methods: Nearest Neighbors
A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.
Aggregating local image descriptors into compact codes
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Three things everyone should know to improve object retrieval
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Presented by Xinyu Chang
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Efficiently searching for similar images (Kristen Grauman)
Multidimensional Data
Similarity Search in High Dimensions via Hashing
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Multimedia Indexing and Retrieval Kowshik Shashank Project Advisor: Dr. C.V. Jawahar.
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Coherency Sensitive Hashing (CSH) Simon Korman and Shai Avidan Dept. of Electrical Engineering Tel Aviv University ICCV2011 | 13th International Conference.
Optimal invariant metrics for shape retrieval
Small Codes and Large Image Databases for Recognition CVPR 2008 Antonio Torralba, MIT Rob Fergus, NYU Yair Weiss, Hebrew University.
Algorithms for Nearest Neighbor Search Piotr Indyk MIT.
Fast and Compact Retrieval Methods in Computer Vision Part II A. Torralba, R. Fergus and Y. Weiss. Small Codes and Large Image Databases for Recognition.
Given by: Erez Eyal Uri Klein Lecture Outline Exact Nearest Neighbor search Exact Nearest Neighbor search Definition Definition Low dimensions Low dimensions.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive.
Look-up problem IP address did we see the IP address before?
1 Lecture 18 Syntactic Web Clustering CS
1998/5/21by Chang I-Ning1 ImageRover: A Content-Based Image Browser for the World Wide Web Introduction Approach Image Collection Subsystem Image Query.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
FLANN Fast Library for Approximate Nearest Neighbors
Large Scale Recognition and Retrieval. What does the world look like? High level image statistics Object Recognition for large-scale search Focus on scaling.
Efficient Image Search and Retrieval using Compact Binary Codes
כמה מהתעשייה? מבנה הקורס השתנה Computer vision.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Content Based Image Retrieval Natalia.
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
Minimal Loss Hashing for Compact Binary Codes
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
© 2009 IBM Corporation IBM Research Xianglong Liu 1, Yadong Mu 2, Bo Lang 1 and Shih-Fu Chang 2 1 Beihang University, Beijing, China 2 Columbia University,
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Randomized Algorithms Part 3 William Cohen 1. Outline Randomized methods - so far – SGD with the hash trick – Bloom filters – count-min sketches Today:
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
CMSC 611: Advanced Computer Architecture
Locality-sensitive hashing and its applications
Data Transformation: Normalization
SIMILARITY SEARCH The Metric Space Approach
Sublinear Algorithmic Tools 3
Lecture 11: Nearest Neighbor Search
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Near(est) Neighbor in High Dimensions
Lecture 16: Earth-Mover Distance
Locality Sensitive Hashing
CS5112: Algorithms and Data Structures for Applications
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented by Jiyun Byun Vision Research Lab in ECE at UCSB

Outline Introduction Locality Sensitive Hashing Analysis Experiments Concluding Remarks

Introduction Nearest neighbor search (NNS) The curse of dimensionality experimental approach : use heuristic analytical approach Approximate approach ε-Nearest Neighbor Search (ε-NNS) Goal : for any given query q R d, returns the points p  P where d(q,P) is the distance of q to the its closest points in P right answers are much closer than irrelevant ones time/quality trade off

Locality Sensitive Hashing (LSH) Collision probability depends on distance between points higher collision probability for close objects small collision probability for those that far apart Given a query point, hash it using a set of hash functions inspect the entries in each bucket

Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) Setting C : the largest coordinate among all points in the given dataset P of dimension d (R d ) Embed P into the Hamming cube {0,1} d’ dimension d’ = Cd v(p) = Unary C (x 1 )…Unary C (x d ) use the unary code for each point along each dimension isometric embedding d 1 (p,q) = d H (v(p),v(q)) embedding preserves the distance between points

Locality Sensitive Hashing (LSH) Hash functions(1/2) Build a hash function on Hamming cube in d’ dimensions Choose L subsets of the dimensions: I 1,I 2,..I L I j consists of k elements from {1,…,d’} found by sampling uniformly at random with replacement Project each point on each I j. g j (p) = projection of p on I j obtained by concatenating the bit values of p for dimensions I j Store p in buckets g j (p), j = 1.. L

Locality Sensitive Hashing (LSH) Hash functions(2/2) Two levels of hashing LSH function maps a point p to bucket g j (p) standard hash function maps the contents of buckets into a hash table of size M B : bucket capacity  : memory utilization parameter

Query processing Search buckets g j (q) until CL points are found or all L indices are searched. Approximate K-NNS output the K points closest to q fewer if less than K points are found -neighbor with parameter r

Analysis where r 1 P 2 Family of single projections in Hamming cube H d’ is (r, r(1+ ), 1-r/d’, 1- r(1+ )/d’) sensitive if d H (q,p) = r (r bits on which p and q differ) Pr[ h(q) h(p)] = r/d’

LSH solve (r+ ) Neighbor problem Determine if there exists a point within distance r of query point q or whether all points are at least a distance r(1+ ) away from q In the former case, return a point within distance r(1+ ) of q. Repeat construction to boost the probability.

ε-NN problem For a given query point q, return a point p from the dataset P multiple instances of (r, )-neighbor solution. (r 0, )-neighbor, (r 0 (1+ ), )-neighbor, (r 0 (1+ ) 2, )- neighbor, …,r max neighbor

Experiments(1/3) Datasets color histograms (Corel Draw) n = 20,000; d= 8,…,64 texture features (Aerial photos) n = 270,000; d = 60 Query sets Disk second level bucket is directly mapped to a disk block

Experiments(2/3) profiles color histogramtexture features Interpoint distance Normalized frequency

Experiments(3/3) Performance speed : average number of blocks accessed effective error d LSH : LSH NN distance(q), d* : NN distance(q) miss ratio the fraction of queries for which no answer was found

Experiments : color histogram(1/4) Error vs. Number of indices(L)

Experiments : color histogram(2/4) Dependence on n Approximate 1 NNS Approximate 10 NNS Number of database points Disk Accesses

Experiments : color histogram(3/4) Miss ratios Approximate 1 NNS Approximate 10 NNS Number of database points Miss ratio

Experiments : color histogram(4/4) Dependence on d Approximate 1 NNS Approximate 10 NNS Number of dimension Disk Accesses

Experiments : texture features(1/2) Number of indices vs. Error

Experiments : texture features(2/2) Number of indices vs. Size

Concluding remarks Locality Sensitive Hashing fast approximation dynamic/join version Future work hybrid techniques : tree-based and hashing-based