When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Slides:



Advertisements
Similar presentations
Order-k Voronoi Diagram in the Plane
Advertisements

Trees for spatial indexing
The Capacity of Wireless Networks Danss Course, Sunday, 23/11/03.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Edgar Chávez Karina Figueroa Gonzalo Navarro UNIVERSIDAD MICHOACANA, MEXICO.
Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.
        iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:
Fast Algorithms For Hierarchical Range Histogram Constructions
Similarity Search on Bregman Divergence, Towards Non- Metric Indexing Zhenjie Zhang, Beng Chi Ooi, Srinivasan Parthasarathy, Anthony K. H. Tung.
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
When is “Nearest Neighbor Meaningful? Authors: Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan Uri Shaft Presentation by: Vuk Malbasa For CIS664 Prof.
Similarity Search for Adaptive Ellipsoid Queries Using Spatial Transformation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa (Nara.
2-dimensional indexing structure
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
1. 2 General problem Retrieval of time-series similar to a given pattern.
Dept. of Computer Science Distributed Computing Group Asymptotically Optimal Mobile Ad-Hoc Routing Fabian Kuhn Roger Wattenhofer Aaron Zollinger.
Based on Slides by D. Gunopulos (UCR)
Lattices for Distributed Source Coding - Reconstruction of a Linear function of Jointly Gaussian Sources -D. Krithivasan and S. Sandeep Pradhan - University.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Approximation Algorithms for MAX-MIN tiling Authors Piotr Berman, Bhaskar DasGupta, S. Muthukrishman S. Muthukrishman Published on Journal of Algorithms,
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.
Birch: An efficient data clustering method for very large databases
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Nonoverlap of the Star Unfolding Boris Aronov and Joseph O’Rourke, 1991 A Summary by Brendan Lucier, 2004.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
The Curse of Dimensionality Richard Jang Oct. 29, 2003.
INTERACTIVELY BROWSING LARGE IMAGE DATABASES Ronald Richter, Mathias Eitz and Marc Alexa.
My Honors Project By.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
1 The Precise Definition of a Limit Section The Precise Definition of a Limit A geometric interpretation of limits can be given in terms of the.
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
International Graduate School of Dynamic Intelligent Systems, University of Paderborn Fighting Against Two Adversaries: Page Migration in Dynamic Networks.
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Spatial Data Management
Data Science Algorithms: The Basic Methods
Compressing Relations And Indexes
Spatial Indexing I Point Access Methods.
K Nearest Neighbor Classification
Nearest-Neighbor Classifiers
The Curve Merger (Dvir & Widgerson, 2008)
Near-Optimal (Euclidean) Metric Compression
Yair Bartal Lee-Ad Gottlieb Hebrew U. Ariel University
Locality Sensitive Hashing
cse 521: design and analysis of algorithms
Data Mining Classification: Alternative Techniques
Lecture 15: Least Square Regression Metric Embeddings
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering.
President’s Day Lecture: Advanced Nearest Neighbor Search
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

Motivation - Scalability Experiments Dozens of papers describe experiments about index scalability with increased dimensions. –Constants are: Number of data points Data and Query distribution Index structure / search algorithm –Variable: Number of dimensions –Measurement: Performance of index.

Example From PODS 1997

Motivation In many cases the conclusion is that the empirical evidence suggests the index structures do scale with dimensionality We would like to investigate these claims mathematically – supply a proof of scalability or non-scalability.

Historical Context Continues work done in When Is Nearest Neighbors Meaningful? (ICDT 1999) Previous work about behavior of distance distributions. This work about behavior of indexing structures under similar conditions.

Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

Vanishing Variance Same definition used in ICDT 99 work (although not named in that work) In 1999 we showed that the workloads become meaningless – ratios of distances between query and various data points become arbitrarily small. We use the same result here.

Vanishing Variance A scalability experiment contains a series of workloads W 1,W 2,…,W m,… –m is the number of dimensions –each workload W 1 has n data points and a query point (same distribution) –Distance distribution marked as D m Vanishing Variance:

Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

Convex Description Index Data points distributed to buckets (e.g. disk pages). Access to a buckets is all or nothing. We allow redundancy. A bucket contains at least two data points. Each bucket associated with a description – a convex region containing all data points in the bucket. Search algorithm accesses at least all buckets whose convex region is closer than the nearest neighbor. Cost of search is the number of data points retrieved.

Example: R-Tree Buckets are disk pages. Under normal construction buckets contain more than two data points each. Bucket descriptions are convex and contain all data points (Bounding Rectangles). Search algorithm accesses all buckets whose convex region is closer than the nearest neighbor (and probably a few more).

Convex Description Indexes All R-Tree variants X-Tree M-Tree kdb-Tree SS-Tree and SR-Tree Many more

Other indexes (non-CD) Probability structures (P-Tree, VLDB 2000) –Access based on clusters. A near enough bucket may not be accessed Projection index (like VA-file) –Compression structures. –All data points accessed in pieces, not in buckets.

Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

Indexing Theorem If: –Scalability experiment uses a series of workloads with Vanishing Variance –The distance metric is Euclidean –The indexing structure is Convex Description Then: –The expected cost of a query converges to the number of data points – I.e., a linear scan of the data

Sketch of Proof Because of Vanishing Variance, the ratio of distances between various query and data points becomes arbitrarily close to 1. When using Euclidean distance, we can look at an arbitrary data bucket and a query point, choose two data points from the bucket and create a triangle:

Bucket Q D1D1 D2D2 Y Distances of Q, D 1, D 2,…, D n are about the same. Distance of Q to Y is much smaller Therefore, distance of Q to data bucket is less than distance to nearest neighbor.

Contents Vanishing Variance property Convex Description index structures Indexing Theorem –The performance of CD index does not scale for VV workloads using Euclidean distances. Conclusion Future Work

Conclusion Dozens of papers describe experiments about index scalability with increased dimensions. We wanted to investigate these claims mathematically – supply a proof of scalability or non-scalability. We proved that many of these experiments do not scale in dimensionality.

Conclusion Use this theorem to to channel indexing research into more useful and practical avenues Review previous results accordingly.

Future Work Remove restriction of at least two data points in bucket. –Easy exercise, need to take into account the cost of traversing a hierarchical data structure. Investigate other L p metrics Investigate projection indexes using Euclidean metric (looks like they do not scale either)

Find scalable indexing structure for Uniform data and L metric –Hint: use compression Find number of data points needed for R-Tree to be practical on uniform data, L 2 metric. –Approx: Future Work

Questions