Presentation on theme: " iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:"— Presentation transcript:
1iDistance -- Indexing the Distance An Efficient Approach to KNN IndexingC. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish.Indexing the distance: an efficient method to KNN processing, VLDB 2001.Now, let me move on to a slightly different problem – indexing for similarity search. In order to support similarity queries efficiently, I proposed a NOVEL index called iDistance.In this talk, I shall concentrate on KNN search. KNN search is a more complex search than similarity range search, which is the multi-dimensional implementation of conceptual similarity search!This content of this talk is based on the recent paper I co-authored with my supervisor, my adviser and a colleague in Michigan.
2Query Requirement Similarity range and KNN queries Similarity queries: Query RequirementSimilarity queries:Similarity range and KNN queriesSimilarity range query: Given a query point, find all data points within a given distance r to the query point.KNN query: Given a query point,find the K nearest neighbours,in distance to the point.rSo, similarity queries refer to similarity RANGE and K NEAREST NEIGHBOR queries.A similarity RANGE query, or sometimes simply referred to as RANGE query, is a query where data points with r distance to the given query point are RETRIEVED. The query is a sphere with radius r and it is used to check for ALL points falling within it.K nearest neighbor query or KNN query is a more complicated form of similarity query, in that the sphere’s radius is NOT FIXED before hand. Given a query point, the sphere is enlarged SLOWLY till all the K NEAREST NEIGHBORS are found. In some way, we can VISUALIZE it as issueing MULTIPLE range query.Kth NN
3Other MethodsSS-tree : R-tree based index structure; use bounding spheres in internal nodesMetric-tree : R-tree based, but use metric distance and bounding spheresVA-file : use compression via bit strings for sequential filtering of unwanted data pointsPsphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNNA-tree: R-tree based, but use relative bounding boxesProblems: hard to integrate into existing DBMSsSimilarity Search is a hot topic at this moment, as there is NOT winner yet as to which index structure is the BEST, and the application areas are wide (MEDICAL database, MULTIMEDIA database for example). In this slide, we list down a few recent and better KNOWN indexes.SS-tree was proposed in 1995 (ICDE’95) – it is actually an R-tree, but instead of using bounding boxes, it used bounding sphere. Several advantages can be noted: ONE, the notion of distance is CAPTURED. TWO, sphere description requires less storage, and hence can store more entries in a node. However, it has been shown by some papers, that its performance still suffers from the dimensionality curse, due to the high overlap between internal nodes.The metric-tree is quite similar to the SS-tree in concept (VLDB’97) -- it is designed for KNN search. Each entry in the internal node isa covering object (known as routing object, Or (radius r), and has a reference to its parent object. It uses kth nearest object obtained to prune the search of sibling subtrees.The VA-file is NOT a tree structure (VLDB’98). It partitions the domain along each dimension into fixed ranges, and for each range a bit string is used to represent that range. The high-dimensional vector is therefore approximated by the lossy bit strings (it is LOSSY, as information is lost, and hence, false positives may be FETCHED). Surprisingly, it has been shown that due to the number of dimensions, 4 bits is good enough to produce impressive performance. This is partly also due to the linear scan of MUCH smaller VA-file.The Psphere-tree was proposed in last year VLDB. It is a two level indexing structure. It preprocesses the data set to derive various parameters for tree constructions. Sample queries are used to determine the representative centroids of clusters, and for each, X number of nearest data points are kept in the cluster. A data point may appear in multiple clusters!The main problems of such method is that it works only on static database (due to heavy preprocessing and effectiveness of fixed size clusters), and queries that do not follow the distribution of sample queries may NOT perform WELL.AS we all know, today database management systems are huge in size and it WILL not be easy to touch the kernel. Therefore, the above FOUR examples will require great implementation effort SHOULD they be integrated into the exsiting DBMS.
4Basic Definition Euclidean distance: Relationship between data points: Basic DefinitionEuclidean distance:Relationship between data points:Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q).In our work, we use Euclidean distance to measure the distance or similarity between two data points, and we also have the following relationships between data points. The last one is very important as it allows us to make iDistance GEOMETRY.From the pairwise relationship between data points and query point, we now define the theorem – which states if a query point is within the query distance, then we can find the point via its reference point.
5Basic Concept of iDistance Basic Concept of iDistanceIndexing points based on similarityy = i * c + dist (Si, p)Reference/anchor pointsdS3S1S2My index is based on the fact if we can define the relationship of a data point a reference point, then I can retrieve this data point based on the the distance between the relationship between the data point and reference point, and the relationship between the reference point and query point.The basic principle of iDistance is to partition the data set into partitions based on the distance of data point to some representative reference points. Once I know the distance, I can then transform such a relationship into a single dimensional value with respect to each reference point.The data point is represented by the formula y = I*c + dist(Si, p),Where c is the max distance/radius.NOW, suppose we have S1,…,Sk as our reference points, and we associate data points to the closest reference points, then the iDistance value of a data point can be obtained based on the relative distance to its distance to the data point, and these data points are in turned represented as some values in the single dimensional space.. . .S1S2S3SkcSk+1S1+d
6iDistance Data points are partitioned into clusters/ partitions. iDistanceData points are partitioned into clusters/ partitions.For each partition, there is a Reference Point that every data point in the partition makes reference to.Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-treeIterative range queries are used in KNN searching.In summary, our idea is very simple,First, we select reference points, either based on SAMPLING, data distribution, or some fixed strategies.Second, all data points are mapped into single dimensional values based on such data points,.Third, these transformed data points indexed using the CLASSICAL B+-tree, and the KNN query is answered using iterative range queries!If queried:It works as the index is geometry-aware! Relationships between objects are captured via reference objects, and the inequality relationship!
7KNN Searching Searching region is enlarged till getting K NN. S2 S1 KNN SearchingS2S1This figures illustrates how a query is enlarged till all KNNs are obtained. The shaded regions illustrate the range that needs to be searched in the B+-tree.In this example, we search left and right, till the range that intersects the query sphere is checked.Note that, for each intersection (the maximum is limited by the number of reference points), we start with a traversal and search left and right of that value, as we enlarge the search region.If queried only:In the original proposal, the leaf nodes of B+-trees are linked lft to right. However, recently, in order to support more applications, the leaf nodes are linked left and right. For your reference, please refer to Raghu’s textbook -- Raghu Ramakrishnan is a professor at Wisconin.......A range in B+-treeSearching region is enlarged till getting K NN.
8Increasing search radius : r KNN Searchingdist (S1, q)dist(S2, q)S1S2qDis_min(S2)Dis_min(S1)Dis_max(S2)Dis_max(S1)For each partition or a reference point, we do maintain an additional small structure to keep the information about them. Information such as maximum radius (Dis_max) is used in searching.This example illustrates the EFFECTIVE search space. The bottom line (point) is the data range in the B+-tree tree. When we transform the query sphere into search ranges, we get two search ranges in this case: one for S1 and the other for S2. Note that the search of S2 is constrained from Dis_max (maximum radius defined by the furthest object from S2), and the search starts from RIGHT to LEFT. The search is efficient as the effect is similar to any search of HIERARCHICAL indexes that capture spatial relationships. The index is geometry-aware.Further, ONE can SEE that the search is a normal but ITERATIVE range search of B+-tree!Increasing search radius : rrS1S2dist (S1,q)Dis_max(S1)dist (S2,q)Dis_min(S1)Dis_max(S2)
9KNN SearchingAs noted in the previous slide, apart from the B+-tree, we maintain a small structure that stores information about each reference point and the radius that defines it data space. During KNN search, as the search sphere is ENLARGED, we have to check such structure to see if data space of a new reference is touched. If so, we need to do another traversal of the B+-tree.Example, initially we search Q1 using query point Q, and as we enlarge the query sphere, we have to search Q2 range in the B+-tree from right to left.If queried:The right to left search is because we intersect the objects that are furthest away from Q2 first.NOTE: this slide can be removed now...Q2
10Over Search? r Inefficient situation: Over Search?Inefficient situation:When K= 3, query sphere with radius r will retrieve the 3 NNs.Among them only the o1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o3)o2So1qro3There is no doubt iDistance is efficient.However, there are situations that iDistance “over” search -- in the sense when it gets the answer, it has to continue till STOPPING CRITERION becomes TRUE. The answers come from the areas not covered by the query sphere, but are within the same range to the reference point. That is, the KNN search CANNOT guarantee that it has found the KNNs that CERTAIN condition is TRUE (which we shall see later).For example, if K=3, we can get all answers when we first search the tree using a search radius of r. However, we CANNOT guarantee that we have all K NNs, and hence, the search MUST continue with a bigger r till r greater than the distance of the furthest object from q.dist (S, q)
11Stopping CriterionTheorem 2: The KNN search algorithm terminates when K NNs are found and the answers are correct.Case 1: dist(furthest(KNN’), q) < rCase 2: dist(furthest(KNN’), q) > rThe theorem says that our search algorithm is correct, as it terminates when correct KNNs are obtained. Lets us see WHY.We have TWO cases to consider actually.Case 1 occurs when all K points fall within the search region. If this is the case, the algorithm stops since the distance of the furthest object from query is less than r.The asnwers must be correct, else, their distance to q is bigger than r (outside the search sphere). Of course, for each intersection, we have to finish search that range!Lets consider the second case.Due to the pinkish shaded area that needs to be searched as well, there may be POINTS contributing to some points in S, and the distance of such a point to q is further than r,. That is, there could be points nearer to q, and hence search continues with slightly enlarged search space (r + delta r). Now, this continues till Case 1 becomes true.rKth ? In case 2
12Space-based Partitioning: Equal-partitioning Space-based Partitioning: Equal-partitioning(external point, closest distance)After having proved that our algorithm stops, now let us consider how can we improve the search performance. To a large extent, the performance is dependent on how we partition the data space BY selecting the reference points. I shall examine some strategies here.Straight forward partitioning strategies are based on selection of OBVIOUS data points as the reference points. For example, we can PICK the centroi.d of each hyperplane. In fact we can pick some other points along the centroid to the common top of the pyramid (FIRST FIGURE).To better distribute the data points so that less data points are mapped into the same value, we should pick lines that are short on average. This motivates us to consider POINTS outside the data space. That is, we use EXTERNAL data points as our reference points (show the SECOND FIGURE).If queried:The difference between the two is that number of points falling on the lines (representative values). The second approach will have flatter and shorted lines.(centroid of hyperplane, closest distance)
13Space-based Partitioning: Equal-partitioning from furthest points Space-based Partitioning: Equal-partitioning from furthest pointsHere, we shall take a look at two other cases of reference points.In this example, we choose the furthest internal/external points as reference points. The diagrams are similar to the previous diagrams we saw, but the reference points are on the opposite pyramids!By choosing different reference points, we can see that number of objects being mapped to the same value (line) is different!In fact, the pyramid tree space partitioning is ANOTHER subcase of our space partitioning. Unlike existing methods, our indexes are flexible ENOUGH to adopt different partitioning to handle different data distributions!(centroid of hyper-plane, furthest distance)(external point, furthest distance)
14Effect of Reference Points on Query Space Effect of Reference Points on Query SpaceUsing external point to reduce searching areaNow, let us see the effect of using external point on the query space. In this example, suppose the green circle is the region we have to search for KNNs, we can restrict the search space to the area bounded by the THICK lines. It is obvious that for iDistance to work effectively, we must reduce redundant search area!
15Effect on Query SpaceThe area bounded by these arches is the affected searching area.Using (centroid, furthest distance) can greatly reduce search areaFor the same search query, if we use furthest external centroid as the reference point, we can further reduce the search area!
16Data-based Partitioning I Data-based Partitioning I1.00.700.31Very often, data form clusters. So, the question is why NOT we partition the data space based on clusters. Again, to iDistance, it is just another optimization strategy.To select reference points based on data clusters, we could choose the centroid of a cluster as a reference point. NOTICE that EVEN when we do that, the data spaces DO NOT overlap, as illustrated in the diagram. Why? This is because we choose the partition for the database points based on the closest reference point! Hence NO data point BELONGS to multiple data subspaces or partitions!0.200.671.0Using cluster centroids as reference points
17Data-based Partitioning II Data-based Partitioning II1.00.700.31Alternatively, for each cluster, we can use a data point along the hyperplane the reference point. An obvious choice is the CORNER or EDGE point!There are other alternatives or cases to consider, such as the tip of pyramid as the reference points. The main criterion to spread the distribution as much as possible, and have the data subspaces as small as possible. However, this depends a lot on distributions, hence data sampling is required to determine which is a better partitioning technique.0.200.671.0Using edge points as reference points
18Performance Study: Effect of Search Radius Performance Study: Effect of Search RadiusDimension = 8Dimension = 16100K uniform data setUsing (external point, furthest distance)Effect of search radius on query accuracyI have done extensive performance study using data sets of different distributions, different data size and different number of dimensions. I implemented the index in ‘C’ on SUN SPARC machine. Here, I shall only present some representative results.First, we want to see the effect of search radius on the accuracy of answers. We gradually enlarge the search radius and find the data points that contribute to the KNNs. We record the answers obtained so far against the final K to get the accuracy.Actually, apart from the data points that are NOT meant to be in the search sphere (due to the stripe as in Case 2 of slide 9: stopping Criterion), the data fall inside the search region are optimal. That is, if we have to enlarge the search sphere in the data space without any index structure, we should get the same NNs.NOTE: people may query why never compared against other methods.Say, we did, and have compared against the A-tree -- the results will be included into the paper. The A-tree (like R-tree, but use relative ccordinates) has been shown to be more efficient than VA-file, our preliminary results show if the same approach to implementation is taken (node address is kept elsewhere, and smaller address is sued), iDistance is more efficient.Dimension = 30
19I/O Cost vs Search Radius I/O Cost vs Search RadiusIn this experiment, we want to see the efficiency of iDistance for different number of dimensions: 8, 16 and 30. As the number of dimensions increases, the query cost increases and so does the search radius required to get the answers. This is due to the expansion of data space with increasing dimensionality, and increase in the number of leaf nodes due to increase in vector storage.The radius shown is the radius required to retrieve all KNNs correctly. The program terminates as soon as conditions defined in Theorem II becomes correct, and all data points within the same “search area” are examined.10-NN queries on 100K uniform data setsUsing (external point, furthest distance)Effect of search radius on query cost
20Effect of Reference Points Effect of Reference PointsNOTE: Maybe get rid of this SLIDE10-NN queries on 100K 30-d uniform data setDifferent Reference Points
21Effect of Clustered # of Partitions on Accuracy Effect of Clustered # of Partitions on AccuracyAfter having seen the performance of iDistance on uniform data set, now lets us see the effect of clustered data sets. For this experiment we used 100K 30-dimensional clustered data sets.I ran various experiments using different K (1, 10, 20, 100), and two different number of partitions: 20 and 50s.The difference in performance is VERY subtle in fact. Bigger K and smaller number of partitions result in less accuracy, but the difference is NOT significant!KNN queries on 100K 30-d clustered data setEffect of query radius on query accuracy for different partition number
22Effect of # of Partitions on I/O and CPU Cost Effect of # of Partitions on I/O and CPU CostFor the same data set, we now want to examine the efficiency of the index in terms of I/O and CPU costs. We note that the BIGGER number of clusters, or smaller cluster in other words, yield better performance. This is because smaller partition has smaller data space, and hence less data points share the same “idistance” value. So, less false frops are resulted!An interesting observation is that BOTH I/O and CPU costs exhibit the same trend. They are dependent on the number of pages read, and number of data points being examined!10-NN queries on 100K 30-d clustered data setEffect of # of partitions on I/O and CPU Costs
23Effect of Data SizesNEXT, we want to see the EFFECT of data size. Here we increased the number of data points to 500K, and again, we observe that the accuracy is HRADLY affected by the data sizes. In fact, bigger data size provides better accuracy, due to the density of data space!KNN queries on 100K, 500K 30-d clustered data setsEffect of query radius on query accuracy for different size of data sets
24Effect of Clustered Data Sets Effect of Clustered Data SetsLet us examine the performance gain of iDistance over the sequence scan.. For large data sets, the gain is widened, although the percentage of the gain is fairly similar in both cases. Here, we can REALLY appreciate the performance of the iDistance.Note that the performance can be FURTHER improved as in other indexes, where NO real pointers are stored or bigger NODES are used.10-KNN query on 100K,500K 30-d clustered data setsEffect of query radius on query cost for different size of data set
25Effect of Reference Points on Clustered Data Sets Effect of Reference Points on Clustered Data SetsNow, let us further examine the effect of clustered data sets. In this experiment, we used two different approach to selecting reference points.Surprisingly, using the centroid of cluster does not perform as well using the edge or corner point near the cluster.WHY? This due to the query effective search space. Remember, when we use centroid, we have spheric search space, and all points falling on the same ring have the same “idistance” value. But when we use EDGE, the line is flatter and shorter in length, and hence LESS false drops.LESS FALSE drops mean less pages to be fetched, and less objects to be examined!If queried:The effectiveness of the partitioning strategy is based on the density of data points per “idistance” value, and hence a method with small overall density is likely to cause smaller FALSE drops and hence gives BETTER performance.Density ==average # of data points falling on the SAME idistance value10-KNN query on 100K 30-d clustered data setEffect of Reference Points: Cluster Edge vs Cluster Centroid
26iDistance ideal for Approximate KNN? iDistance ideal for Approximate KNN?Here, we show the accuracy against the I/O cost. We note that due to the COMPLEXITY of KNN search, and the fact that in MOST applications, small errors can be TOLERATED or UNNOTICEABLE (such as image retrieval), APPROXIMATE KNN search becomes an acceptable compromise, and an acceptable approach to KNN queries.From the results, we TAKE NOTE that the iDistance is VERY VERY efficient if we can relax the requirement on accuracy, such as 80-90%. Further, the ITERATIVE approach adopted by iDistance is a good candidate for INTERNET based PROGRESSIVE query processing – initial less ACCURATE KNNs are give, and improved as more processing is DONE.10-KNN query on 100K,500K 30-d clustered data setsQuery cost for variant query accuracy on different size of data set
27Performance Study -- Compare iMinMax and iDistance Performance Study --Compare iMinMax and iDistanceI also extended iMinMax for approximate KNN search. Here, we performed some tests using 30-dimensional clustered data sets, and I compare it against two variants of iDistances.Due to the way the iMinMax was designed, it is NOT as efficient as iDistance! However, it is STILL comparatively very efficient, and UNLIKE most KNN or RANGE indexes, iMinMax can now support both KNN and window search. This is important, as most indexes designed for window queries do NOT capture similarity relationship, and most KNN indexes do NOT index based on attributes and hence cannot support searches based on individual attributes.10-KNN query on 100K 30-d clustered data setsC. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using B+-trees.
30Summary of iDistance iDistance is simple, but efficient Summary of iDistanceiDistance is simple, but efficientIt is a Metric based IndexThe index can be integrated to existing systems easily.In summary, I had proposed a NOVEL index for KNN search called iDistance. It is simple and EFFICIENT, and it can be integrated easily to existing DBMSs or used at the application layer (as in Geofoto.com image search engine)!