Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.

Similar presentations


Presentation on theme: "Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics."— Presentation transcript:

1 Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1SISAP 2011, Lipari

2 Presentation outline Similarity search in metric spaces Pivot tables Clustered pivot tables Static variant Dynamic variant Experiments 2SISAP 2011, Lipari

3 Similarity search Suitable for unstructured data, query often not in DB Similarity is often modeled by a metric distance Expensive distance functions - EMD, SQFD, DTW, … Metric indexing Based on lower-bounding If abs(d(p, q) – d(p, o)) > r filter out object o 3SISAP 2011, Lipari

4 Pivot tables Simple yet efficient main memory metric index Having k static pivots P i and database S of n objects O j, pivot table stores all the distances d(P i, O j ) in the matrix of size k x n Pivot tables = two structures - distance matrix + data file Cheap filtering of non-relevant objects (lower-bounding) Non-filtered objects are refined by the original expensive distance function 4SISAP 2011, Lipari

5 Clustered pivot tables What if the pivot table does not fit into main memory? Solution 1 – just slice datafile +simple to construct - sequential scan => high I/O cost Solution 2 – reorganize and slice datafile +similar objects in one page (page = cluster) => higher probability that all objects are filtered => lower I/O cost -metric clustering is expensive 5SISAP 2011, Lipari

6 Metric clustering? M-tree! Dynamic, persistent, balanced structure Leaf node represents cluster of similar objects Many construction strategies considering quality of M-tree hierarchy with complexity < O(n 2 ) Single/Multi/Hybrid-way leaf selection Slim-down algorithm Reinsertions SISAP 2011, Lipari6

7 Static CPT Data file = objects serialized from M-tree leaves Classic pivot table reorganizing input Fixed page size in a paged data file Preserve M-tree? Future re-indexing Query processing 7SISAP 2011, Lipari

8 Dynamic CPT Data file = set of M-tree leaves Distance matrix connected to the M-tree leaves Internal fragmentation M-tree leaves contain different number of data objects, utilization is not 100% Dynamic operations do not degenerate created clusters 8SISAP 2011, Lipari

9 CPT - Querying Filtering based on lower-bounding If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization SISAP 2011, Lipari9

10 CPT - Querying problems Problem 1 – LAESA kNN algorithm sorts DB objects according to their lower bound to the query object – not optimal for I/O cost Solution - CPT does not sort objects => objects are processed sequentially SISAP 2011, Lipari10

11 CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari11

12 CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari12 Q x Q x

13 Experiments (1) 2 real datasets subset of CoPhIR, subset of Corel 2 synthetic datasets Cloud, PolygonSet We considered more M-tree variants Single/Multi way leaf selection Reinsertions Measured I/O cost CPT vs. PT vs. M-tree 13SISAP 2011, Lipari

14 Experiments (2) 14SISAP 2011, Lipari

15 Experiments (3) 15SISAP 2011, Lipari

16 Conclusion We have designed I/O-optimized method for persistent pivot tables Future work Thorough experiments on SSD disks Use other metric clustering techniques 16SISAP 2011, Lipari

17 Thank you 17SISAP 2011, Lipari


Download ppt "Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics."

Similar presentations


Ads by Google