Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

Similar presentations


Presentation on theme: "Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason."— Presentation transcript:

1 Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason Yeung CSIS 7101 Advanced Database

2 ContributionContribution  As dimension increase, the distance to the nearest neighbor approaches the distance to the farthest neighbor  The distinction between nearest and farthest neighbors may blur with as few as 15 dimensions  Linear scan almost out-performs Nearest Neighbor processing techniques

3 Part One Introduce Nearest Neighbor

4 What is nearest neighbor (NN) problem?  Given a collection of data points and a query point in a m-dimensional metric space, find the data point that is closest to the query point

5 Nearest neighbor algorithm 1.Make two sets of nodes, set A and set B and put all nodes into set B 2.Put your starting node into set A 3.Pick the node which is closest to the last node which was placed in set A and is not in set A; put this closest neighboring node into set A 4.Repeat step 3 until all nodes are in set A

6 Query point and its nearest neighbor Query point NN

7 Practical Applications of NN Search  Medical Imaging  Molecular Biology  Spatial and Multimedia databases

8 Adaptable Similarity Approach  In multimedia database, given an image database  one may want to retrieve all images that are similar to a given query image  data domain is high dimensional

9 Color-oriented similarity of images  On Image Database

10 Shape-oriented similarity of images  Aims at the level of individual pixels

11 Shape-oriented similarity of 3-D objects  On 3-D protein database

12 Approximation-based Shape similarity of 3-D surface segments  Measures the similarity of 3-D segments by using geometric approx.

13 Exception Case  Distance between the nearest neighbor and any other point in the data set is small  We call this unstable query

14 Part Two Nearest Neighbor in high dimensional

15 Unstable query  A nearest neighbor query is unstable for a given € if the distance from the query point to most data point is less than (1 + € ) times the distance from the query point to its nearest neighbor

16 NN in High-Dimensional Space  Proof in the paper that the concept NN become meaningless as dimensionality (m) increases  If a pre-condition holds: As m increases, the difference in distance between the query point and all data points become negligible (i.e., the query becomes unstable)

17 Assumption for the pre-condition to hold  The data distribution and query distribution are IID in all dimensions  Unique dimensions with correlation between all dimensions

18 What is IID?  Independent and identically distributed  It means that the distribution of values in each dimension is identical (i.e. all values are uniformly distributed or dimensional have same skew) and independent

19 High Dimensional indexing can be meaningful  When the dimensions of both the query point and data points follow identical distribution, but are completely dependent (i.e: value in D1 = values in D2= … )  The result is a set of data points and query point on a diagonal line  The underlying query can actually be converted to 1D NN problem

20 Graphical View X Y Z  All dimension has same value  All data points are on the diagonal

21 High Dimensional indexing can be meaningful (Cont ’ d)  The underlying dimensionality is much lower than the actual dimensionality  E.g.: It is a 3-D data set, but the data always have the Z coordinate

22 High Dimensional indexing can be meaningful (Cont ’ d)  When the query point is within some small distance of a data point (instead of being required to be identical to a data point)  The result set of the query is to return all points within the closest cluster, not just the nearest point

23 NN query in clustered data Query point Nearest Cluster  E.g.: Data falls into discrete classes or cluster in some potentially high dimensional feature space

24 Distribution of distances in clustered data Points are close and are in same cluster (NN meaningful) Point are in other cluster which are all far

25 Experimental studies of NN  Want to find out the rate of convergence  Based on 3 synthetic work-load and one real data set  NN can become unstable with as few as 10- 20 dimensions  The graph is exponential  In reality, the dimensions might be 1000

26 Correlated Distributions Recursive and uniform workload (NN not meaningful) Two degrees of freedom workload (NN meaningful)

27 Part Three Linear Scan is powerful …

28 NN indexing VS Linear scan  Linear scan can handily beats NN indexing  NN indexing is meaningful when data consists of small, well-formed clusters  And the query is guaranteed to land in or very near one of these cluster

29 Why Linear scan  A set of sequentially arranged disk pages is much faster than unordered retrieval of the same pages  Fetching a large number of data pages through multi-dimensional index usually results in unordered retrieval

30 Linear Scan outperforms  Both the SS tree and the R* tree at 10 dimensions in all cases  SR tree in all cases at 16 dimensional synthetic data set

31 JustificationJustification  All the report performance studies examined situations in which the difference in distance between the query point and NN differed little from the distance to other data points  In reality, it might be different

32 Other related work  Dimensionality Curse  Fractal Dimensions

33 Dimensionality Curse  Vague indication that high dimensionality causes problems in some situations  Examples:  NN problem  “ Boundary effects ” not taken into account on NN query in high dimensional case

34 Fractal Dimensions  It is a measure of how "complicated" a self-similar figure (data) is  NN queries become stable when fractal dimensionality is low  In reality, real data sets do not exhibit fractal behavior

35 ConclusionConclusion  The effect of dimensionality on NN queries  High dimensional index can be meaningful  Evaluate NN workload  Linear scan outperforms NN processing technique on some meaningful workload

36 ReferenceReference  Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft. What Is “ Nearest Neighbor ” Meaningful?  Thomas Seidl. Adaptable Similarity Search in 3-D Spatial Database System  http://www.dbs.informatik.uni- muenchen.de/Forschung/Similarity/


Download ppt "Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason."

Similar presentations


Ads by Google