Download presentation

Presentation is loading. Please wait.

Published byMaurice Gidney Modified about 1 year ago

2
Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)

3
Introduction Similarity query is one of the most important query type in multimedia DB. A promising and widely used approach is to map the multimedia objects into points in some d-dimensional feature space and similarity is then defined as the proximity of their feature vectors in feature space.

6
Similarity search in high-dimensional data space is an inherently computationally intensive problem The goal is to make the data which has to be read in executing a query are distributed as equally as possible among disks. ---------decluster problem The core problem of designing a parallel nearest neighbor algorithm is to determine an adequate distribution of the data to disk The use of parallelism is crucial for improving the performance

8
Buckets may be characterized by the its position in the d-dimension space: (c 0, c 1, …, c d-1 ) So, a decluster algorithm can be described as a mapping from the bucket characterization (c 0, c 1, …, c d-1 ) disk number. 01 11 1000

9
Many algorithms solving the declustering problem have been proposed. 1. Disk Modulo method: n: the number of the disks

10
2. FX method (support partial match queries) d-1 FX(c 0, c 1, …, c d-1 ) = XOR c i mod n i =0 3. Hilbert method: ( Hilbert curve maps a d-dimensional space to a 1-demensional space) HI(c 0, c 1, …, c d-1 ) = Hilbert (c 0, c 1, …, c d-1 ) mod n Unfortunately, they do not provide an adequate data distribution for similarity queries in high dimensional feature spaces

11
In high-dimensional spaces, it’s not possible to consider more than a binary partition: Thus, the bucket coordinates (c 0, c 1, …, c d-1 ) can be seen as binary values. And the bucket number is defined as: * 2 i 1. 2-dimension: if space is divided 100 times in both x and y direction, # of bucks = 100 *100 =10,000 16-dimension: a complete binary partition would already produce 2 16 = 65,536 partitions. 2. The usage of a finer partitioning would produce many underfilled buckets.

12
most data items are located near the (d-1) dimensional surface of the data space. ( let’s define “near” means the distance of the point to the surface is less than 0.1) An important property of high-dimensional data space Possibilty of locating near a surface: 1- (1-(0.2)) 2 = 0.36 = 36% P s (d) = 1 - ( 1 - 0.2 ) d Probability grows rapidly with increasing dimension and reaches more than 97% for a dimensionality of 16. 510dimension P 0.5

13
If the radius of the NN-sphere is less than 0.5, only the bucket containing the query point is accessed. If the radius of the NN-sphere is 0.6, 2 other buckets are involved. For obtaining a good speed-up, the 3 buckets involved in the search should be distributed to different disks. This observation holds for most queries since query point is very likely be on a lower-dimensional surface.

14
most data items are located near the (d-1) dimensional surface of the data space. ( let’s define “near” means the distance of the point to the surface is less than 0.1) An important property of high-dimensional data space Possibilty of locating near a surface: 1- (1-(0.2)) 2 = 0.36 = 36% P s (d) = 1 - ( 1 - 0.2 ) d Probability grows rapidly with increasing dimension and reaches more than 97% for a dimensionality of 16. 510dimension P 0.5

15
Definition: direct and indirect neighbors Given two buckets b and c. b and c are direct neighbors, b~ d c, if and only if b and c are indirect neighbors, b~ i c, if and only if

16
Intuitively, 2 buckets b and c are direct neighbors, if their coordinates differ in one dimension, and the remaining (d-1) coordinates are identical. The XOR of 2 direct neighbors results in a bit string 0*10*. 2 buckets b and c are indirect neighbors, if their coordinates differ in two dimensions, and the remaining (d-2) coordinates are identical. The XOR of 2 indirect neighbors results in a bit string 0*10*10*.

17
Near-optimal declustering: A decluster algorithm DA is near-optimal if and only if for any 2 buckets b and c and for any dimension d of the data space: b~ d c DA(b) !=DA(c) & b~ i c DA(b) !=DA(c) We may find that disk modulo, the FX, and the Hilbert declustering techniques are not near- optimal declustering

18
Disk modulo FX 23 2 2 1 1 1 0 0 1 0 1 1 0 0 1 Hilbert Near-Optimal Declustering 2 2 1 1 0 0 3 3 1 1 2 2 3 3 0 0

19
Coloring/Declustering Algorithm: Function col (c:integer): integer var I:interger; begin col:=0; for I:=0 to dimension-1 do if bit_set(I, c ) then col:= col XOR (i+1); endif endfor end Near-Optimal declustering Graph coloring problem Graph G=(V,E) where V is a set of buckets and E= { (b,c) | b~ d c or b~ i c} is the set of direct and indirect neighborhood relationship.

20
EQ: How many colors are needed to solve the graph coloring problem ? How many disks are needed for near-optimal declustering ? Experiments show that the near-optimal declustering provides an almost linear speed-up and a constant scale-up. Answer:

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google