Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Slides:



Advertisements
Similar presentations
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.
Advertisements

Indexing DNA Sequences Using q-Grams
On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Danzhou Liu Ee-Peng Lim Wee-Keong Ng
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Searching on Multi-Dimensional Data
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.
ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.
On Fast Non-Metric Similarity Search by Metric Access Methods Tomáš Skopal Charles University in Prague Faculty of Mathematics and Physics.
July 29HDMS'08 Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management.
Feature Selection Presented by: Nafise Hatamikhah
Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
1 Jun Wang, 2 Sanjiv Kumar, and 1 Shih-Fu Chang 1 Columbia University, New York, USA 2 Google Research, New York, USA Sequential Projection Learning for.
A New Point Access Method based on Wavelet Trees Nieves R. Brisaboa, Miguel R. Luaces, Diego Seco Database Laboratory University of A Coruña A Coruña,
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Chapter 7 Indexing Objectives: To get familiar with: Indexing
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Multimedia and Time-series Data
1 Physical Data Organization and Indexing Lecture 14.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Copyright © Curt Hill Query Evaluation Translating a query into action.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in Prague Department of Software.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Exact indexing of Dynamic Time Warping
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Outline Introduction State-of-the-art solutions
Updating SF-Tree Speaker: Ho Wai Shing.
Multimedia Information Retrieval
Spatio-temporal Pattern Queries
15-826: Multimedia Databases and Data Mining
Implementation of Relational Operations
Efficient Processing of Top-k Spatial Preference Queries
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search in Metric Spaces DEXA 2009, Linz, Austria

Outline Metric approach to similarity search Motivation for index-free similarity search D-file (+ D-cache) Experiments Conclusion DEXA 2009, Linz, Austria

Similarity search Multimedia databases, time series, bioinformatics,... Content-based similarity search (query by example) DEXA 2009, Linz, Austria k nearest neighbors query (give me the 3 most similar) range query (give me the very similar ones – over 80%)

Metric approach to similarity search the similarity  (actually distance) is computationally expensive – often O(m 2 ), sometimes even O(2 m ) w.r.t. the size (m) of a compared object querying by a sequential scan over the database of n objects is thus expensive the goal: minimizing the number of distance computations  for a query the way: using metric distances (metric postulates) allows to partition the data space the search is then performed just in several partitions → efficient search DEXA 2009, Linz, Austria

a cheap determination of tight lower-bound distance of  (*,*) provides a mechanism how to quickly filter irrelevant objects from search this filtering is used in various forms by metric access methods, where X stands for a database object and P for a pivot object Using lower-bound distances for filtering database objects DEXA 2009, Linz, Austria query ball Q P X r The task: check if X is inside query ball we know  (Q,P) we know  (P,X) we do not know  (Q,X) we do not have to compute  (Q,X), because its lower bound  (Q,P)-  (X,P) is larger than r, so X surely cannot be in the query ball, so X is ignored

Index-based metric access methods All metric access methods (MAM) are index-based, i.e. preprocessing of a database is always needed. Index construction usually takes between O(kn) to O(n 2 ). DEXA 2009, Linz, Austria M-treePM-treeGNAT

Motivation for index-free search indexing is not desirable (or even possible) if we have a highly changeable database more inserts/deletes/updates than searches, i.e., streaming databases, archives, logs, sensory databases, etc. we perform isolated searches a database is created for a few queries and then discarded, i.e., in data mining tasks we switch between distances (changing similarity) the distance function is tuned at query time, e.g., weighing of object features is applied dynamically DEXA 2009, Linz, Austria

D-file just the original database using sequential scan, BUT it uses D-cache a memory-resident structure that maintains the distances computed during previous queries provides lower-bounds of requested distances that can be used to filter some of the database objects when querying O(1) complexity for a lower bound retrieval no preprocessing (indexing) of database DEXA 2009, Linz, Austria

D-file – range query DEXA 2009, Linz, Austria simple sequential searchsequential search enhanced by D-cache filtering Q Oi ???

D-cache every time a D-file computes a distance  (*,*), it is stored into D-cache the D-cache could be viewed as a sparse matrix, where queries denote rows, database object denote columns, and a cell contains a value of  (Q,O) DEXA 2009, Linz, Austria

D-cache DEXA 2009, Linz, Austria D-cache has two functionalities it allows to retrieve the exact distance  (Q,O), if it is there the main functionality: it provides tight lower bound to  (Q,O) How to obtain a lower bound? prior to a new query Q, determine some old queries DP i Q (acting as dynamic pivots) and compute the distances  (Q, DP i Q ) when a lower bound to d(Q,O) is required, search for available distances  (Q, DP i Q ) in the D-cache and obtain the max(  (DP i Q, O) –  (Q, DP i Q )); that is our tight lower bound distance

D-cache DEXA 2009, Linz, Austria how to choose the old queries (dynamic pivots)? “Recent” policy simple – we just choose k previous queries motivation: the recently added distances are likely to still sit in the D-cache “Internal” policy advanced – we select k of the previous queries which are probably close we avoid computation of any distance between new and old queries, we just estimate the distance using distances from D-cache motivation: a close query (pivot) produces tighter lower bounds

D-cache implementation DEXA 2009, Linz, Austria Cell cache a simple hash table used to determine individual cell values, based on id 1, id 2 used for Recent pivot selection Row cache in inverted list (list of objects belonging to old queries) used to determine the mediators when using Internal pivot selection Replacement policies because the size of D-cache is limited, both Cell cache and Row cache apply the LRU distance replacement policy

Experiments datasets A subset of Corel features – 65, dimensional vectors of color moments, and the  = L 1 distance A synthetic Polygons set; 500,000 randomly generated 2D polygons varying in the number of vertices from 10 to 15, and the  = Hausdorff distance (maximum distance of a point set to the nearest point in the other set). A subset of GenBank file rel147, namely 50,000 protein sequences of lengths from 50 to 100, and the  = edit distance DEXA 2009, Linz, Austria

Experiments DEXA 2009, Linz, Austria D-file was compared with 3 metric access methods and the trivial sequential scan M-tree PM-tree GNAT seq. scan we have observed the number of distance computations spent on indexing querying

Experiments DEXA 2009, Linz, Austria construction costs

Experiments DEXA 2009, Linz, Austria unknown queries (query objects outside the dataset)

Experiments DEXA 2009, Linz, Austria unknown queries (query objects outside the dataset)

Experiments DEXA 2009, Linz, Austria database queries (used when browsing, etc.)

Experiments DEXA 2009, Linz, Austria database queries (used when browsing, etc.)

Conclusion D-file – an index-free metric access method requires no indexing suitable for highly changeable databases, isolated searches or when changing the similarity D-cache a structure used by D-file to cheaply determine lower-bound distances uses distances computed and cached during previous queries processing DEXA 2009, Linz, Austria

Future work we plan to include the D-cache also into index-based metric access methods to improve the efficiency of index construction simple queries (range and kNN) advanced operations (similarity joins) and so on... Thank you for your attention! DEXA 2009, Linz, Austria