Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search in Metric Spaces DEXA 2009, Linz, Austria

Outline Metric approach to similarity search Motivation for index-free similarity search D-file (+ D-cache) Experiments Conclusion DEXA 2009, Linz, Austria

Similarity search Multimedia databases, time series, bioinformatics,... Content-based similarity search (query by example) DEXA 2009, Linz, Austria 0.1 0.15 0.3 0.6 0.8 k nearest neighbors query (give me the 3 most similar) range query (give me the very similar ones – over 80%)

Metric approach to similarity search the similarity  (actually distance) is computationally expensive – often O(m 2 ), sometimes even O(2 m ) w.r.t. the size (m) of a compared object querying by a sequential scan over the database of n objects is thus expensive the goal: minimizing the number of distance computations  for a query the way: using metric distances (metric postulates) allows to partition the data space the search is then performed just in several partitions → efficient search DEXA 2009, Linz, Austria

a cheap determination of tight lower-bound distance of  (*,*) provides a mechanism how to quickly filter irrelevant objects from search this filtering is used in various forms by metric access methods, where X stands for a database object and P for a pivot object Using lower-bound distances for filtering database objects DEXA 2009, Linz, Austria query ball Q P X r The task: check if X is inside query ball we know  (Q,P) we know  (P,X) we do not know  (Q,X) we do not have to compute  (Q,X), because its lower bound  (Q,P)-  (X,P) is larger than r, so X surely cannot be in the query ball, so X is ignored

Index-based metric access methods All metric access methods (MAM) are index-based, i.e. preprocessing of a database is always needed. Index construction usually takes between O(kn) to O(n 2 ). DEXA 2009, Linz, Austria M-treePM-treeGNAT

Motivation for index-free search indexing is not desirable (or even possible) if we have a highly changeable database more inserts/deletes/updates than searches, i.e., streaming databases, archives, logs, sensory databases, etc. we perform isolated searches a database is created for a few queries and then discarded, i.e., in data mining tasks we switch between distances (changing similarity) the distance function is tuned at query time, e.g., weighing of object features is applied dynamically DEXA 2009, Linz, Austria

D-file just the original database using sequential scan, BUT it uses D-cache a memory-resident structure that maintains the distances computed during previous queries provides lower-bounds of requested distances that can be used to filter some of the database objects when querying O(1) complexity for a lower bound retrieval no preprocessing (indexing) of database DEXA 2009, Linz, Austria

D-file – range query DEXA 2009, Linz, Austria simple sequential searchsequential search enhanced by D-cache filtering Q Oi ???

D-cache every time a D-file computes a distance  (*,*), it is stored into D-cache the D-cache could be viewed as a sparse matrix, where queries denote rows, database object denote columns, and a cell contains a value of  (Q,O) DEXA 2009, Linz, Austria

D-cache DEXA 2009, Linz, Austria D-cache has two functionalities it allows to retrieve the exact distance  (Q,O), if it is there the main functionality: it provides tight lower bound to  (Q,O) How to obtain a lower bound? prior to a new query Q, determine some old queries DP i Q (acting as dynamic pivots) and compute the distances  (Q, DP i Q ) when a lower bound to d(Q,O) is required, search for available distances  (Q, DP i Q ) in the D-cache and obtain the max(  (DP i Q, O) –  (Q, DP i Q )); that is our tight lower bound distance

D-cache DEXA 2009, Linz, Austria how to choose the old queries (dynamic pivots)? “Recent” policy simple – we just choose k previous queries motivation: the recently added distances are likely to still sit in the D-cache “Internal” policy advanced – we select k of the previous queries which are probably close we avoid computation of any distance between new and old queries, we just estimate the distance using distances from D-cache motivation: a close query (pivot) produces tighter lower bounds

D-cache implementation DEXA 2009, Linz, Austria Cell cache a simple hash table used to determine individual cell values, based on id 1, id 2 used for Recent pivot selection Row cache in inverted list (list of objects belonging to old queries) used to determine the mediators when using Internal pivot selection Replacement policies because the size of D-cache is limited, both Cell cache and Row cache apply the LRU distance replacement policy

Experiments datasets A subset of Corel features – 65,615 32-dimensional vectors of color moments, and the  = L 1 distance A synthetic Polygons set; 500,000 randomly generated 2D polygons varying in the number of vertices from 10 to 15, and the  = Hausdorff distance (maximum distance of a point set to the nearest point in the other set). A subset of GenBank file rel147, namely 50,000 protein sequences of lengths from 50 to 100, and the  = edit distance DEXA 2009, Linz, Austria

Experiments DEXA 2009, Linz, Austria D-file was compared with 3 metric access methods and the trivial sequential scan M-tree PM-tree GNAT seq. scan we have observed the number of distance computations spent on indexing querying

Experiments DEXA 2009, Linz, Austria construction costs

Experiments DEXA 2009, Linz, Austria unknown queries (query objects outside the dataset)

Experiments DEXA 2009, Linz, Austria database queries (used when browsing, etc.)

Conclusion D-file – an index-free metric access method requires no indexing suitable for highly changeable databases, isolated searches or when changing the similarity D-cache a structure used by D-file to cheaply determine lower-bound distances uses distances computed and cached during previous queries processing DEXA 2009, Linz, Austria

Future work we plan to include the D-cache also into index-based metric access methods to improve the efficiency of index construction simple queries (range and kNN) advanced operations (similarity joins) and so on... Thank you for your attention! DEXA 2009, Linz, Austria

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Similar presentations

Presentation on theme: "Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.

Similar presentations

Presentation on theme: "Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search."— Presentation transcript:

Similar presentations

About project

Feedback