Download presentation

Presentation is loading. Please wait.

Published bySonny Pedley Modified over 3 years ago

1
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics and Physics Charles University in Prague 1SISAP 2011, Lipari

2
Presentation outline Similarity search in metric spaces Pivot tables Clustered pivot tables Static variant Dynamic variant Experiments 2SISAP 2011, Lipari

3
Similarity search Suitable for unstructured data, query often not in DB Similarity is often modeled by a metric distance Expensive distance functions - EMD, SQFD, DTW, … Metric indexing Based on lower-bounding If abs(d(p, q) – d(p, o)) > r filter out object o 3SISAP 2011, Lipari

4
Pivot tables Simple yet efficient main memory metric index Having k static pivots P i and database S of n objects O j, pivot table stores all the distances d(P i, O j ) in the matrix of size k x n Pivot tables = two structures - distance matrix + data file Cheap filtering of non-relevant objects (lower-bounding) Non-filtered objects are refined by the original expensive distance function 4SISAP 2011, Lipari

5
Clustered pivot tables What if the pivot table does not fit into main memory? Solution 1 – just slice datafile +simple to construct - sequential scan => high I/O cost Solution 2 – reorganize and slice datafile +similar objects in one page (page = cluster) => higher probability that all objects are filtered => lower I/O cost -metric clustering is expensive 5SISAP 2011, Lipari

6
Metric clustering? M-tree! Dynamic, persistent, balanced structure Leaf node represents cluster of similar objects Many construction strategies considering quality of M-tree hierarchy with complexity < O(n 2 ) Single/Multi/Hybrid-way leaf selection Slim-down algorithm Reinsertions SISAP 2011, Lipari6

7
Static CPT Data file = objects serialized from M-tree leaves Classic pivot table reorganizing input Fixed page size in a paged data file Preserve M-tree? Future re-indexing Query processing 7SISAP 2011, Lipari

8
Dynamic CPT Data file = set of M-tree leaves Distance matrix connected to the M-tree leaves Internal fragmentation M-tree leaves contain different number of data objects, utilization is not 100% Dynamic operations do not degenerate created clusters 8SISAP 2011, Lipari

9
CPT - Querying Filtering based on lower-bounding If all data objects from one page are filtered out, page from data file is not loaded into memory => I/O optimization SISAP 2011, Lipari9

10
CPT - Querying problems Problem 1 – LAESA kNN algorithm sorts DB objects according to their lower bound to the query object – not optimal for I/O cost Solution - CPT does not sort objects => objects are processed sequentially SISAP 2011, Lipari10

11
CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari11

12
CPT – Querying problems Problem 2 – in CPT the dynamic radius decreases slower during the kNN processing Solution - First bunch of objects is not clustered SISAP 2011, Lipari12 Q x Q x

13
Experiments (1) 2 real datasets subset of CoPhIR, subset of Corel 2 synthetic datasets Cloud, PolygonSet We considered more M-tree variants Single/Multi way leaf selection Reinsertions Measured I/O cost CPT vs. PT vs. M-tree 13SISAP 2011, Lipari

14
Experiments (2) 14SISAP 2011, Lipari

15
Experiments (3) 15SISAP 2011, Lipari

16
Conclusion We have designed I/O-optimized method for persistent pivot tables Future work Thorough experiments on SSD disks Use other metric clustering techniques 16SISAP 2011, Lipari

17
Thank you 17SISAP 2011, Lipari

Similar presentations

OK

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on water pollution in india 4g wireless systems seminar ppt on 4g Ppt on projectile motion in physics Ppt on pre ignition causes Medical templates free download ppt on pollution Ppt on south african culture tattoos Ppt on textile industry in india Ppt on 9/11 conspiracy Ppt on air conditioning Ppt on cv writing