# Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

## Presentation on theme: "Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics."— Presentation transcript:

Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1SISAP 2011, Lipari

Presentation outline Similarity search in metric spaces Pivot tables Clustered pivot tables Static variant Dynamic variant Experiments 2SISAP 2011, Lipari

Similarity search Suitable for unstructured data, query often not in DB Similarity is often modeled by a metric distance Expensive distance functions - EMD, SQFD, DTW, … Metric indexing Based on lower-bounding If abs(d(p, q) – d(p, o)) > r filter out object o 3SISAP 2011, Lipari

Pivot tables Simple yet efficient main memory metric index Having k static pivots P i and database S of n objects O j, pivot table stores all the distances d(P i, O j ) in the matrix of size k x n Pivot tables = two structures - distance matrix + data file Cheap filtering of non-relevant objects (lower-bounding) Non-filtered objects are refined by the original expensive distance function 4SISAP 2011, Lipari

Clustered pivot tables What if the pivot table does not fit into main memory? Solution 1 – just slice datafile +simple to construct - sequential scan => high I/O cost Solution 2 – reorganize and slice datafile +similar objects in one page (page = cluster) => higher probability that all objects are filtered => lower I/O cost -metric clustering is expensive 5SISAP 2011, Lipari

Metric clustering? M-tree! Dynamic, persistent, balanced structure Leaf node represents cluster of similar objects Many construction strategies considering quality of M-tree hierarchy with complexity < O(n 2 ) Single/Multi/Hybrid-way leaf selection Slim-down algorithm Reinsertions SISAP 2011, Lipari6

Static CPT Data file = objects serialized from M-tree leaves Classic pivot table reorganizing input Fixed page size in a paged data file Preserve M-tree? Future re-indexing Query processing 7SISAP 2011, Lipari

Dynamic CPT Data file = set of M-tree leaves Distance matrix connected to the M-tree leaves Internal fragmentation M-tree leaves contain different number of data objects, utilization is not 100% Dynamic operations do not degenerate created clusters 8SISAP 2011, Lipari

CPT - Querying Filtering based on lower-bounding If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization SISAP 2011, Lipari9

CPT - Querying problems Problem 1 – LAESA kNN algorithm sorts DB objects according to their lower bound to the query object – not optimal for I/O cost Solution - CPT does not sort objects => objects are processed sequentially SISAP 2011, Lipari10

CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari11

CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari12 Q x Q x

Experiments (1) 2 real datasets subset of CoPhIR, subset of Corel 2 synthetic datasets Cloud, PolygonSet We considered more M-tree variants Single/Multi way leaf selection Reinsertions Measured I/O cost CPT vs. PT vs. M-tree 13SISAP 2011, Lipari

Experiments (2) 14SISAP 2011, Lipari

Experiments (3) 15SISAP 2011, Lipari

Conclusion We have designed I/O-optimized method for persistent pivot tables Future work Thorough experiments on SSD disks Use other metric clustering techniques 16SISAP 2011, Lipari

Thank you 17SISAP 2011, Lipari

Download ppt "Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics."

Similar presentations