Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason Yeung CSIS 7101 Advanced Database

ContributionContribution  As dimension increase, the distance to the nearest neighbor approaches the distance to the farthest neighbor  The distinction between nearest and farthest neighbors may blur with as few as 15 dimensions  Linear scan almost out-performs Nearest Neighbor processing techniques

Part One Introduce Nearest Neighbor

What is nearest neighbor (NN) problem?  Given a collection of data points and a query point in a m-dimensional metric space, find the data point that is closest to the query point

Nearest neighbor algorithm 1.Make two sets of nodes, set A and set B and put all nodes into set B 2.Put your starting node into set A 3.Pick the node which is closest to the last node which was placed in set A and is not in set A; put this closest neighboring node into set A 4.Repeat step 3 until all nodes are in set A

Query point and its nearest neighbor Query point NN

Practical Applications of NN Search  Medical Imaging  Molecular Biology  Spatial and Multimedia databases

Adaptable Similarity Approach  In multimedia database, given an image database  one may want to retrieve all images that are similar to a given query image  data domain is high dimensional

Color-oriented similarity of images  On Image Database

Shape-oriented similarity of images  Aims at the level of individual pixels

Shape-oriented similarity of 3-D objects  On 3-D protein database

Approximation-based Shape similarity of 3-D surface segments  Measures the similarity of 3-D segments by using geometric approx.

Exception Case  Distance between the nearest neighbor and any other point in the data set is small  We call this unstable query

Part Two Nearest Neighbor in high dimensional

Unstable query  A nearest neighbor query is unstable for a given € if the distance from the query point to most data point is less than (1 + € ) times the distance from the query point to its nearest neighbor

NN in High-Dimensional Space  Proof in the paper that the concept NN become meaningless as dimensionality (m) increases  If a pre-condition holds: As m increases, the difference in distance between the query point and all data points become negligible (i.e., the query becomes unstable)

Assumption for the pre-condition to hold  The data distribution and query distribution are IID in all dimensions  Unique dimensions with correlation between all dimensions

What is IID?  Independent and identically distributed  It means that the distribution of values in each dimension is identical (i.e. all values are uniformly distributed or dimensional have same skew) and independent

High Dimensional indexing can be meaningful  When the dimensions of both the query point and data points follow identical distribution, but are completely dependent (i.e: value in D1 = values in D2= … )  The result is a set of data points and query point on a diagonal line  The underlying query can actually be converted to 1D NN problem

Graphical View X Y Z  All dimension has same value  All data points are on the diagonal

High Dimensional indexing can be meaningful (Cont ’ d)  The underlying dimensionality is much lower than the actual dimensionality  E.g.: It is a 3-D data set, but the data always have the Z coordinate

High Dimensional indexing can be meaningful (Cont ’ d)  When the query point is within some small distance of a data point (instead of being required to be identical to a data point)  The result set of the query is to return all points within the closest cluster, not just the nearest point

NN query in clustered data Query point Nearest Cluster  E.g.: Data falls into discrete classes or cluster in some potentially high dimensional feature space

Distribution of distances in clustered data Points are close and are in same cluster (NN meaningful) Point are in other cluster which are all far

Experimental studies of NN  Want to find out the rate of convergence  Based on 3 synthetic work-load and one real data set  NN can become unstable with as few as 10- 20 dimensions  The graph is exponential  In reality, the dimensions might be 1000

Correlated Distributions Recursive and uniform workload (NN not meaningful) Two degrees of freedom workload (NN meaningful)

Part Three Linear Scan is powerful …

NN indexing VS Linear scan  Linear scan can handily beats NN indexing  NN indexing is meaningful when data consists of small, well-formed clusters  And the query is guaranteed to land in or very near one of these cluster

Why Linear scan  A set of sequentially arranged disk pages is much faster than unordered retrieval of the same pages  Fetching a large number of data pages through multi-dimensional index usually results in unordered retrieval

Linear Scan outperforms  Both the SS tree and the R* tree at 10 dimensions in all cases  SR tree in all cases at 16 dimensional synthetic data set

JustificationJustification  All the report performance studies examined situations in which the difference in distance between the query point and NN differed little from the distance to other data points  In reality, it might be different

Other related work  Dimensionality Curse  Fractal Dimensions

Dimensionality Curse  Vague indication that high dimensionality causes problems in some situations  Examples:  NN problem  “ Boundary effects ” not taken into account on NN query in high dimensional case

Fractal Dimensions  It is a measure of how "complicated" a self-similar figure (data) is  NN queries become stable when fractal dimensionality is low  In reality, real data sets do not exhibit fractal behavior

ConclusionConclusion  The effect of dimensionality on NN queries  High dimensional index can be meaningful  Evaluate NN workload  Linear scan outperforms NN processing technique on some meaningful workload

ReferenceReference  Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft. What Is “ Nearest Neighbor ” Meaningful?  Thomas Seidl. Adaptable Similarity Search in 3-D Spatial Database System  http://www.dbs.informatik.uni- muenchen.de/Forschung/Similarity/

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

Similar presentations

Presentation on theme: "Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

Similar presentations

Presentation on theme: "Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason."— Presentation transcript:

Similar presentations

About project

Feedback