SIMILARITY SEARCH The Metric Space Approach

Slides:



Advertisements
Similar presentations
1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Advertisements

Indexing DNA Sequences Using q-Grams
Chapter 9 Approximating Eigenvalues
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Aggregating local image descriptors into compact codes
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CS6800 Advanced Theory of Computation
PARTITIONAL CLUSTERING
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Searching on Multi-Dimensional Data
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
SASH Spatial Approximation Sample Hierarchy
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Efficient Statistical Pruning for Maximum Likelihood Decoding Radhika Gowaikar Babak Hassibi California Institute of Technology July 3, 2003.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Evaluating Hypotheses
Basic Data Mining Techniques
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
What is Cluster Analysis?
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Active Learning for Class Imbalance Problem
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
The many facets of approximate similarity search Marco Patella and Paolo Ciaccia DEIS, University of Bologna - Italy.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Multi-object Similarity Query Evaluation Michal Batko.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Copyright © Cengage Learning. All rights reserved. 3 Discrete Random Variables and Probability Distributions.
Chapter 24 Survey Methods and Sampling Techniques
Strategies for Spatial Joins
Chapter 7. Classification and Prediction
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Database Management System
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Instance Based Learning
A paper on Join Synopses for Approximate Query Answering
Data Mining K-means Algorithm
KD Tree A binary search tree where every node is a
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Fast Sequence Alignments
Indexing and Hashing Basic Concepts Ordered Indices
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
15-826: Multimedia Databases and Data Mining
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Locality Sensitive Hashing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Probabilistic Data Management
Instance Based Learning
Distributed Probabilistic Range-Aggregate Query on Uncertain Data
Machine Learning: UNIT-4 CHAPTER-1
Nearest Neighbors CSC 576: Data Mining.
CS5112: Algorithms and Data Structures for Applications
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Minwise Hashing and Efficient Search
President’s Day Lecture: Advanced Nearest Neighbor Search
Presentation transcript:

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching Survey of exiting approaches Part II: Metric searching in large collections Centralized index structures Approximate similarity search Parallel and distributed indexes P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate similarity search Approximate similarity search overcomes problems of exact similarity search using traditional access methods Moderate improvement of performance with respect to sequential scan Dimensionality curse Similarity search returns mathematically precise result sets Similarity is subjective so, in some cases, also approximate result sets satisfy the user Approximate similarity search processes query faster at the price of imprecision in the returned result sets Useful for instance in interactive systems Similarity search is an iterative process where temporary results are used to create a new query Improvements up to two orders of magnitude P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate similarity search Approximation strategies Relaxed pruning conditions Data regions overlapping the query regions can be discarded depending on the specific strategy Early termination of the search algorithm Search algorithm might stop before all regions have been accessed P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate Similarity Search relative error approximation (pruning condition) Range and k-NN search queries good fraction approximation small chance improvement approximation proximity-based approximation PAC nearest neighbor searching performance trials P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Relative error approximation Let oN be the nearest neighbour of q. If then oA is the (1+e)-approximate nearest neighbor of q This can be generalized to the k-th nearest neighbor P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Relative error approximation Exact pruning strategy: rp p rq q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Relative error approximation Approximate pruning strategy: rp p rq q rq/(1+e) P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate Similarity Search relative error approximation (pruning condition) Range and k-NN search queries good fraction approximation (stop condition) K-NN search queries small chance improvement approximation proximity-based approximation PAC nearest neighbor searching performance trials P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Good fraction approximation The k-NN algorithm determines the final result by reducing distances of current result set When the current result set belongs to a specific fraction of the objects closest to the query, the approximate algorithm stops Example: Stop when current result set belongs to the 10% of the objects closest to the query P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Good fraction approximation For this strategy we use the distance distribution defined as The distance distribution Fq(x) specifies what is the probability that the distance of a random object o from q is smaller than x It is easy to see that Fq (x) gives, in probabilistic terms, the fraction of the database corresponding to the set of objects whose distance from q is smaller than x P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Good fraction approximation   ok q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Good fraction approximation When Fq(d(ok,q)) < r all objects of the current result set belong to the fraction r of the dataset ok q P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Good fraction approximation Fq(x) is difficult to be handled since we need to compute it for all possible queries It was proven that the overall distance distribution F(x) defined as follows can be used in practice, instead of Fq(x), since they have statistically the same behaviour. F(x) can be easily estimated as a discrete function and it can be easily maintained in main memory P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate Similarity Search relative error approximation (pruning condition) Range and k-NN search queries good fraction approximation (stop condition) K-NN search queries small chance improvement approximation (stop c.) proximity-based approximation PAC nearest neighbor searching performance trials P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Small chance improvement approximation The M-Tree’s k-NN algorithm determines the final result by improving the current result set Each step of the algorithm the temporary result is improved and the distance of the k-th element decreases When the improvement of the temporary result set slows down, the algorithms can stop P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Small chance improvement approximation P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Small chance improvement approximation Function f (x) is not known a priori. A regression curve j (x), which approximate f (x), is computed using the least square method while the algorithm proceeds Through the derivative of j (x) it is possible to decide when the algorithm has to stop P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Small chance improvement approximation The regression curve has the following form where c1 and c2 are such that is minimum We have used both j1(x)=ln(x) and j1(x)=1/x P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Regression curves P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate Similarity Search relative error approximation (pruning condition) Range and k-NN search queries good fraction approximation (stop condition) K-NN search queries small chance improvement approximation (stop c.) proximity-based approximation (pruning cond.) PAC nearest neighbor searching performance trials P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Proximity-based approximation Regions whose probability of containing qualifying objects is below a certain threshold are pruned even if they overlap the query region Proximity between regions is defined as the probability that a randomly chosen object appears in both the regions. This resulted in an increase of performance of two orders of magnitude both for range queries and nearest neighbour queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Proximity-based approximation 2 q 1 3 q R 1 2 3 P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate Similarity Search relative error approximation (pruning condition) Range and k-NN search queries good fraction approximation (stop condition) K-NN search queries small chance improvement approximation (stop c.) proximity-based approximation (pruning cond.) PAC nearest neighbor searching (pruning & stop) 1-NN search queries performance trials P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

PAC nearest neighbour searching It uses the same time a relaxed branching condition and a stop condition The relaxed branching condition is the same used for the relative error approximation to find an (1+e)-approximate-nearest neighbor In addition it halts prematurely when the probability that we have found the (1+e)-approximate-nearest neighbor is above the threshold d It can only be used for 1-NN search queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

PAC nearest neighbour searching Let us suppose that then nearest neighbour found so far is oA Let eact be the actual error on distance of oA The algorithm stops if The above probability is obtained by computing the distribution of the distance of the nearest neighbor. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

PAC nearest neighbour searching Distribution of the distance of the nearest neighbor in X (of cardinality n) with respect to q: Given that The algorithm halts when P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Approximate Similarity Search relative error approximation (pruning condition) Range and k-NN search queries good fraction approximation (stop condition) K-NN search queries small chance improvement approximation (stop c.) proximity-based approximation (pruning cond.) PAC nearest neighbor searching (pruning & stop) 1-NN search queries performance trials P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons tests Tests on a dataset of 11,000 objects Objects are vectors of 45 dimensions We compared the five approximation approaches Range queries tested on the methods: Relative error Proximity Nearest-neighbors queries tested on all methods P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons: range queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons: range queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons NN queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons NN queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons NN queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons NN queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Comparisons NN queries P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1

Conclusions: Approximate similarity search in metric spaces These techniques for approximate similarity search can be applied to generic metric spaces Vector spaces are a special case of metric space. High accuracy of approximate results are generally obtained with high improvement of efficiency Best performance obtained with the good fraction approximation methods The proximity based is a bit worse than good fraction approximation but can be used for range queries and k-NN queries. P. Zezula, G. Amato, V. Dohnal, M. Batko: Similarity Search: The Metric Space Approach Part I, Chapter 1