Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Slides:



Advertisements
Similar presentations
Song Intersection by Approximate Nearest Neighbours Michael Casey, Goldsmiths Malcolm Slaney, Yahoo! Inc.
Advertisements

Indexing DNA Sequences Using q-Grams
Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.
Aggregating local image descriptors into compact codes
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Fast Algorithms For Hierarchical Range Histogram Constructions
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Searching on Multi-Dimensional Data
MIT CSAIL Vision interfaces Towards efficient matching with random hashing methods… Kristen Grauman Gregory Shakhnarovich Trevor Darrell.
Efficiently searching for similar images (Kristen Grauman)
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Similarity Search in High Dimensions via Hashing
VLSH: Voronoi-based Locality Sensitive Hashing Sung-eui Yoon Authors: Lin Loi, Jae-Pil Heo, Junghwan Lee, and Sung-Eui Yoon KAIST
Mutual Information Mathematical Biology Seminar
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Evaluating Hypotheses
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
J Cheng et al,. CVPR14 Hyunchul Yang( 양현철 )
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Evaluating Performance for Data Mining Techniques
Indexing Techniques Mei-Chen Yeh.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
Nearest Neighbor Paul Hsiung March 16, Quick Review of NN Set of points P Query point q Distance metric d Find p in P such that d(p,q) < d(p’,q)
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Fast Similarity Search in Image Databases CSE 6367 – Computer Vision Vassilis Athitsos University of Texas at Arlington.
1 CSI5388 Current Approaches to Evaluation (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
1 Learning Embeddings for Similarity-Based Retrieval Vassilis Athitsos Computer Science Department Boston University.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Spatial Data Management
SIMILARITY SEARCH The Metric Space Approach
Probabilistic Data Management
Spatio-temporal Pattern Queries
Lecture 11: Nearest Neighbor Search
K Nearest Neighbor Classification
Randomized Algorithms CS648
Near(est) Neighbor in High Dimensions
Locality Sensitive Hashing
University of Crete Department Computer Science CS-562
Searching Similar Segments over Textual Event Sequences
cse 521: design and analysis of algorithms
CS5112: Algorithms and Data Structures for Applications
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Topological Signatures For Fast Mobility Analysis
LSH-based Motion Estimation
Machine Learning: Lecture 5
Presentation transcript:

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multi-bit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing. HierarchicalDBH Rank Queries according to D(Q,N(Q) Divide space into disjoint subsets (equi-height) Train separate indices for each subset Reduce Hash Cost Use small number of “pseudoline” points Database Group Problem NEAREST NEIGHBOR: Given a database S, a distance function D our task is: for a previous unseen query q, locate a point p of the database such that the distance between q and every point o of the database is greater or equal than the distance between p and q. COST MODEL: Minimize number of Distance Computations Computing D may be very expensive Dynamic Time Warping for Time Series Edit Distance Variants for DNA alignment PROBLEM DEFINITION: Define index structure to answer Nearest Neighbor queries efficiently A SOLUTION: Brute Force! Try them all and get the exact answer OUR SOLUTION: Are we willing to trade accuracy for efficiency ? Hash Based Indexing Idea: 1. Come up with hash functions that hash similar objects to similar buckets 2. Hash every database object to some buckets 3. At query time apply the same hash function to the query 4. Filter: Retrieve the collisions. The rest of the database is pruned. 5. Refine: Compute actual distances. Return the object with the smallest distance as the NN. Locality Sensitive Hashing Locality Sensitive Family of Functions Amplify the gap between p 1 and p 2 : Randomly pick l hash vectors of k functions each. Probability of collision in at least one of l hash tables: H using Pseudoline Projections ( H DBH ) Works on Arbitrary Space but is not Locality Sensitive! Define a line projection function that maps an arbitrary space into the real line R: Real valued  Discrete valued: Hash tables should be balanced. Thus t1, t2 are chosen from V: ACCURACY vs. EFFICIENCY: How often is the actual NN retrieved? How much time does NN retrieval take? Analysis  Probability of collision between any two objects:  Same probability on a k-bit hash table:  Prob of collision in at least one of the l hash tables:  Accuracy, i.e. the probability over all queries Q that we will retrieve the nearest neighbor N(Q):  LookupCost: Expected number of objects that collide in at least one of the l hash tables  HashCost: # of distance computations to evaluate h-functions:  Total Cost per query:  Efficiency (for all Queries): Use Sampling to estimate Accuracy and Efficiency 1. Sample Queries 2. Sample Database Objects 3. Sample Hash Functions 4. Compute Integrals Finding optimal k & l..given accuracy (say 90%)…..For k=1,2,…..compute smallest l that yields required accuracy. Typically, optimal k : last k for which efficiency improves. C(Q,N(Q)) Number of Queries Additional Optimizations Experiments D(Q,N(Q)) Conclusion General purpose Distance is black box Does not require metric properties Statistical analysis is possible Even when NN is not returned, a very close N is returned… For many applications that’s fine!! Not sublinear in size of DB Statistical (not probabilistic) Need “representative” sample sets Hands dataset.. actual performance was different than simulation.. – the training set was not representative!