Presentation on theme: "IDistance -- Indexing the Distance iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish."— Presentation transcript:
iDistance -- Indexing the Distance iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance: an efficient method to KNN processing, VLDB 2001.
Similarity queries: Similarity range and KNN queries Similarity range query: Given a query point, find all data points within a given distance r to the query point. KNN query: Given a query point, find the K nearest neighbours, in distance to the point. r Kth NN Query Requirement
SS-tree : R-tree based index structure; use bounding spheres in internal nodes Metric-tree : R-tree based, but use metric distance and bounding spheres VA-file : use compression via bit strings for sequential filtering of unwanted data points Psphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNN A-tree: R-tree based, but use relative bounding boxes Problems: hard to integrate into existing DBMSs Other Methods
Basic Definition Euclidean distance: Relationship between data points: Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q).
Basic Concept of iDistance Indexing points based on similarity y = i * c + dist (Si, p) S1S1 S2S2 S3S3 SkSk S k+1 Reference/anchor points S1S1 S2S2 S3S3... d S1+dS1+d c
Data points are partitioned into clusters/ partitions. For each partition, there is a Reference Point that every data point in the partition makes reference to. Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-tree Iterative range queries are used in KNN searching. iDistance
Searching region is enlarged till getting K NN. A range in B+-tree KNN Searching... S1S1 S2S2
dist (S 1, q) S2S2 S1S1 Increasing search radius : r Dis_min(S 1 ) Dis_max(S 1 ) q S1S1 S2S2 0 dist (S 2, q) Dis_max(S 2 ) Dis_min(S 2 ) Dis_min(S 1 ) Dis_max(S 1 ) Dis_max(S 2 ) r dist (S 1, q) dist (S 2, q) KNN Searching
dist (S, q) Inefficient situation: When K= 3, query sphere with radius r will retrieve the 3 NNs. Among them only the o 1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o 3) S q r o2o2 o1o1 o3o3 Over Search?
Stopping Criterion Theorem 2: The KNN search algorithm terminates when K NNs are found and the answers are correct. Case 1: dist(furthest(KNN’), q) < r Case 2: dist(furthest(KNN’), q) > r r Kth ? In case 2
(centroid of hyperplane, closest distance) (external point, closest distance) Space-based Partitioning: Equal-partitioning
(centroid of hyper-plane, furthest distance) Space-based Partitioning: Equal-partitioning from furthest points (external point, furthest distance)
Using external point to reduce searching area Effect of Reference Points on Query Space
Using (centroid, furthest distance) can greatly reduce search area The area bounded by these arches is the affected searching area. Effect on Query Space
0.671.0 0.31 0.20 0.70 0 1.0 Using cluster centroids as reference points Data-based Partitioning I
0.671.0 0.31 0.20 0.70 0 1.0 Using edge points as reference points Data-based Partitioning II
100K uniform data set Using (external point, furthest distance) Effect of search radius on query accuracy Dimension = 8Dimension = 16 Dimension = 30 Performance Study: Effect of Search Radius
10-NN queries on 100K uniform data sets Using (external point, furthest distance) Effect of search radius on query cost I/O Cost vs Search Radius
10-NN queries on 100K 30-d uniform data set Different Reference Points Effect of Reference Points
KNN queries on 100K 30-d clustered data set Effect of query radius on query accuracy for different partition number Effect of Clustered # of Partitions on Accuracy
10-NN queries on 100K 30-d clustered data set Effect of # of partitions on I/O and CPU Costs Effect of # of Partitions on I/O and CPU Cost
KNN queries on 100K, 500K 30-d clustered data sets Effect of query radius on query accuracy for different size of data sets Effect of Data Sizes
10-KNN query on 100K,500K 30-d clustered data sets Effect of query radius on query cost for different size of data set Effect of Clustered Data Sets
10-KNN query on 100K 30-d clustered data set Effect of Reference Points: Cluster Edge vs Cluster Centroid Effect of Reference Points on Clustered Data Sets
10-KNN query on 100K,500K 30-d clustered data sets Query cost for variant query accuracy on different size of data set iDistance ideal for Approximate KNN?
10-KNN query on 100K 30-d clustered data sets C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using B+- trees. Performance Study -- Performance Study -- Compare iMinMax and iDistance
iDistance vs A-tree
Summary of iDistance iDistance is simple, but efficient It is a Metric based Index The index can be integrated to existing systems easily.