Download presentation

Presentation is loading. Please wait.

Published bySonny Pedley Modified over 2 years ago

1
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1SISAP 2011, Lipari

2
Presentation outline Similarity search in metric spaces Pivot tables Clustered pivot tables Static variant Dynamic variant Experiments 2SISAP 2011, Lipari

3
Similarity search Suitable for unstructured data, query often not in DB Similarity is often modeled by a metric distance Expensive distance functions - EMD, SQFD, DTW, … Metric indexing Based on lower-bounding If abs(d(p, q) – d(p, o)) > r filter out object o 3SISAP 2011, Lipari

4
Pivot tables Simple yet efficient main memory metric index Having k static pivots P i and database S of n objects O j, pivot table stores all the distances d(P i, O j ) in the matrix of size k x n Pivot tables = two structures - distance matrix + data file Cheap filtering of non-relevant objects (lower-bounding) Non-filtered objects are refined by the original expensive distance function 4SISAP 2011, Lipari

5
Clustered pivot tables What if the pivot table does not fit into main memory? Solution 1 – just slice datafile +simple to construct - sequential scan => high I/O cost Solution 2 – reorganize and slice datafile +similar objects in one page (page = cluster) => higher probability that all objects are filtered => lower I/O cost -metric clustering is expensive 5SISAP 2011, Lipari

6
Metric clustering? M-tree! Dynamic, persistent, balanced structure Leaf node represents cluster of similar objects Many construction strategies considering quality of M-tree hierarchy with complexity < O(n 2 ) Single/Multi/Hybrid-way leaf selection Slim-down algorithm Reinsertions SISAP 2011, Lipari6

7
Static CPT Data file = objects serialized from M-tree leaves Classic pivot table reorganizing input Fixed page size in a paged data file Preserve M-tree? Future re-indexing Query processing 7SISAP 2011, Lipari

8
Dynamic CPT Data file = set of M-tree leaves Distance matrix connected to the M-tree leaves Internal fragmentation M-tree leaves contain different number of data objects, utilization is not 100% Dynamic operations do not degenerate created clusters 8SISAP 2011, Lipari

9
CPT - Querying Filtering based on lower-bounding If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization SISAP 2011, Lipari9

10
CPT - Querying problems Problem 1 – LAESA kNN algorithm sorts DB objects according to their lower bound to the query object – not optimal for I/O cost Solution - CPT does not sort objects => objects are processed sequentially SISAP 2011, Lipari10

11
CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari11

12
CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari12 Q x Q x

13
Experiments (1) 2 real datasets subset of CoPhIR, subset of Corel 2 synthetic datasets Cloud, PolygonSet We considered more M-tree variants Single/Multi way leaf selection Reinsertions Measured I/O cost CPT vs. PT vs. M-tree 13SISAP 2011, Lipari

14
Experiments (2) 14SISAP 2011, Lipari

15
Experiments (3) 15SISAP 2011, Lipari

16
Conclusion We have designed I/O-optimized method for persistent pivot tables Future work Thorough experiments on SSD disks Use other metric clustering techniques 16SISAP 2011, Lipari

17
Thank you 17SISAP 2011, Lipari

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google