Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)

Introduction Similarity query is one of the most important query type in multimedia DB. A promising and widely used approach is to map the multimedia objects into points in some d-dimensional feature space and similarity is then defined as the proximity of their feature vectors in feature space.

Similarity search in high-dimensional data space is an inherently computationally intensive problem The goal is to make the data which has to be read in executing a query are distributed as equally as possible among disks. ---------decluster problem The core problem of designing a parallel nearest neighbor algorithm is to determine an adequate distribution of the data to disk The use of parallelism is crucial for improving the performance

Buckets may be characterized by the its position in the d-dimension space: (c 0, c 1, …, c d-1 ) So, a decluster algorithm can be described as a mapping from the bucket characterization (c 0, c 1, …, c d-1 ) disk number. 01 11 1000

Many algorithms solving the declustering problem have been proposed. 1. Disk Modulo method: n: the number of the disks

2. FX method (support partial match queries) d-1 FX(c 0, c 1, …, c d-1 ) = XOR c i mod n i =0 3. Hilbert method: ( Hilbert curve maps a d-dimensional space to a 1-demensional space) HI(c 0, c 1, …, c d-1 ) = Hilbert (c 0, c 1, …, c d-1 ) mod n Unfortunately, they do not provide an adequate data distribution for similarity queries in high dimensional feature spaces

In high-dimensional spaces, it’s not possible to consider more than a binary partition: Thus, the bucket coordinates (c 0, c 1, …, c d-1 ) can be seen as binary values. And the bucket number is defined as: * 2 i 1. 2-dimension: if space is divided 100 times in both x and y direction, # of bucks = 100 *100 =10,000 16-dimension: a complete binary partition would already produce 2 16 = 65,536 partitions. 2. The usage of a finer partitioning would produce many underfilled buckets.

most data items are located near the (d-1) dimensional surface of the data space. ( let’s define “near” means the distance of the point to the surface is less than 0.1) An important property of high-dimensional data space Possibilty of locating near a surface: 1- (1-(0.2)) 2 = 0.36 = 36% P s (d) = 1 - ( 1 - 0.2 ) d Probability grows rapidly with increasing dimension and reaches more than 97% for a dimensionality of 16. 510dimension P 0.5

If the radius of the NN-sphere is less than 0.5, only the bucket containing the query point is accessed. If the radius of the NN-sphere is 0.6, 2 other buckets are involved. For obtaining a good speed-up, the 3 buckets involved in the search should be distributed to different disks. This observation holds for most queries since query point is very likely be on a lower-dimensional surface.

most data items are located near the (d-1) dimensional surface of the data space. ( let’s define “near” means the distance of the point to the surface is less than 0.1) An important property of high-dimensional data space Possibilty of locating near a surface: 1- (1-(0.2)) 2 = 0.36 = 36% P s (d) = 1 - ( 1 - 0.2 ) d Probability grows rapidly with increasing dimension and reaches more than 97% for a dimensionality of 16. 510dimension P 0.5

Definition: direct and indirect neighbors Given two buckets b and c. b and c are direct neighbors, b~ d c, if and only if b and c are indirect neighbors, b~ i c, if and only if

Intuitively, 2 buckets b and c are direct neighbors, if their coordinates differ in one dimension, and the remaining (d-1) coordinates are identical. The XOR of 2 direct neighbors results in a bit string 0*10*. 2 buckets b and c are indirect neighbors, if their coordinates differ in two dimensions, and the remaining (d-2) coordinates are identical. The XOR of 2 indirect neighbors results in a bit string 0*10*10*.

Near-optimal declustering: A decluster algorithm DA is near-optimal if and only if for any 2 buckets b and c and for any dimension d of the data space: b~ d c DA(b) !=DA(c) & b~ i c DA(b) !=DA(c) We may find that disk modulo, the FX, and the Hilbert declustering techniques are not near- optimal declustering

Disk modulo FX 23 2 2 1 1 1 0 0 1 0 1 1 0 0 1 Hilbert Near-Optimal Declustering 2 2 1 1 0 0 3 3 1 1 2 2 3 3 0 0

Coloring/Declustering Algorithm: Function col (c:integer): integer var I:interger; begin col:=0; for I:=0 to dimension-1 do if bit_set(I, c ) then col:= col XOR (i+1); endif endfor end Near-Optimal declustering Graph coloring problem Graph G=(V,E) where V is a set of buckets and E= { (b,c) | b~ d c or b~ i c} is the set of direct and indirect neighborhood relationship.

EQ: How many colors are needed to solve the graph coloring problem ? How many disks are needed for near-optimal declustering ? Experiments show that the near-optimal declustering provides an almost linear speed-up and a constant scale-up. Answer:

Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)

Similar presentations

Presentation on theme: "Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)

Similar presentations

Presentation on theme: "Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)"— Presentation transcript:

Similar presentations

About project

Feedback