Presentation is loading. Please wait.

Presentation is loading. Please wait.

When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Similar presentations


Presentation on theme: "When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)"— Presentation transcript:

1 When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

2 Motivation - Scalability Experiments Dozens of papers describe experiments about index scalability with increased dimensions. –Constants are: Number of data points Data and Query distribution Index structure / search algorithm –Variable: Number of dimensions –Measurement: Performance of index.

3 Example From PODS 1997

4

5 Motivation In many cases the conclusion is that the empirical evidence suggests the index structures do scale with dimensionality We would like to investigate these claims mathematically – supply a proof of scalability or non-scalability.

6 Historical Context Continues work done in When Is Nearest Neighbors Meaningful? (ICDT 1999) Previous work about behavior of distance distributions. This work about behavior of indexing structures under similar conditions.

7 Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

8 Vanishing Variance Same definition used in ICDT 99 work (although not named in that work) In 1999 we showed that the workloads become meaningless – ratios of distances between query and various data points become arbitrarily small. We use the same result here.

9 Vanishing Variance A scalability experiment contains a series of workloads W 1,W 2,…,W m,… –m is the number of dimensions –each workload W 1 has n data points and a query point (same distribution) –Distance distribution marked as D m Vanishing Variance:

10 Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

11 Convex Description Index Data points distributed to buckets (e.g. disk pages). Access to a buckets is all or nothing. We allow redundancy. A bucket contains at least two data points. Each bucket associated with a description – a convex region containing all data points in the bucket. Search algorithm accesses at least all buckets whose convex region is closer than the nearest neighbor. Cost of search is the number of data points retrieved.

12 Example: R-Tree Buckets are disk pages. Under normal construction buckets contain more than two data points each. Bucket descriptions are convex and contain all data points (Bounding Rectangles). Search algorithm accesses all buckets whose convex region is closer than the nearest neighbor (and probably a few more).

13 Convex Description Indexes All R-Tree variants X-Tree M-Tree kdb-Tree SS-Tree and SR-Tree Many more

14 Other indexes (non-CD) Probability structures (P-Tree, VLDB 2000) –Access based on clusters. A near enough bucket may not be accessed Projection index (like VA-file) –Compression structures. –All data points accessed in pieces, not in buckets.

15 Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

16 Indexing Theorem If: –Scalability experiment uses a series of workloads with Vanishing Variance –The distance metric is Euclidean –The indexing structure is Convex Description Then: –The expected cost of a query converges to the number of data points – I.e., a linear scan of the data

17 Sketch of Proof Because of Vanishing Variance, the ratio of distances between various query and data points becomes arbitrarily close to 1. When using Euclidean distance, we can look at an arbitrary data bucket and a query point, choose two data points from the bucket and create a triangle:

18 Bucket Q D1D1 D2D2 Y Distances of Q, D 1, D 2,…, D n are about the same. Distance of Q to Y is much smaller Therefore, distance of Q to data bucket is less than distance to nearest neighbor.

19 Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

20 Conclusion Dozens of papers describe experiments about index scalability with increased dimensions. We wanted to investigate these claims mathematically – supply a proof of scalability or non-scalability. We proved that many of these experiments do not scale in dimensionality.

21 Conclusion Use this theorem to to channel indexing research into more useful and practical avenues Review previous results accordingly.

22 Future Work Remove restriction of at least two data points in bucket. –Easy exercise, need to take into account the cost of traversing a hierarchical data structure. Investigate other L p metrics Investigate projection indexes using Euclidean metric (looks like they do not scale either)

23 Find scalable indexing structure for Uniform data and L metric –Hint: use compression Find number of data points needed for R-Tree to be practical on uniform data, L 2 metric. –Approx: Future Work

24 Questions


Download ppt "When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)"

Similar presentations


Ads by Google