Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

Slides:

Advertisements

Similar presentations

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Advertisements

When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.

Spatial Join Queries. Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer point queries.

Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.

Searching on Multi-Dimensional Data

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

When is “Nearest Neighbor Meaningful? Authors: Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan Uri Shaft Presentation by: Vuk Malbasa For CIS664 Prof.

Lazy vs. Eager Learning Lazy vs. eager learning

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Spatial Mining.

Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.

SASH Spatial Approximation Sample Hierarchy

Multivariate Methods Pattern Recognition and Hypothesis Testing.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Multi-dimensional Indexes

Techniques and Data Structures for Efficient Multimedia Similarity Search.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.

A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.

Birch: An efficient data clustering method for very large databases

Indexing Techniques Mei-Chen Yeh.

Data Mining Techniques

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Multimedia and Time-series Data

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Spatial Data Management Chapter 28.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

The Curse of Dimensionality Richard Jang Oct. 29, 2003.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Region-Based Saliency Detection and Its Application in Object Recognition IEEE TRANSACTIONS ON CIRCUITS AND SYSTEM FOR VIDEO TECHNOLOGY, VOL. 24 NO. 5,

Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.

CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.

Presented by Ho Wai Shing

Database Systems Laboratory The Pyramid-Technique: Towards Breaking the Curse of Dimensionality Stefan Berchtold, Christian Bohm, and Hans-Peter Kriegal.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.

Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.

1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Spatial Data Management

SIMILARITY SEARCH The Metric Space Approach

Compressing Relations And Indexes

Nearest Neighbor Queries using R-trees

Spatio-temporal Pattern Queries

Instance Based Learning (Adapted from various sources)

K Nearest Neighbor Classification

15-826: Multimedia Databases and Data Mining

15-826: Multimedia Databases and Data Mining

Probabilistic Data Management

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Data Transformations targeted at minimizing experimental variance

President’s Day Lecture: Advanced Nearest Neighbor Search

R-trees: An Average Case Analysis

Presentation transcript:

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason Yeung CSIS 7101 Advanced Database

ContributionContribution  As dimension increase, the distance to the nearest neighbor approaches the distance to the farthest neighbor  The distinction between nearest and farthest neighbors may blur with as few as 15 dimensions  Linear scan almost out-performs Nearest Neighbor processing techniques

Part One Introduce Nearest Neighbor

What is nearest neighbor (NN) problem?  Given a collection of data points and a query point in a m-dimensional metric space, find the data point that is closest to the query point

Nearest neighbor algorithm 1.Make two sets of nodes, set A and set B and put all nodes into set B 2.Put your starting node into set A 3.Pick the node which is closest to the last node which was placed in set A and is not in set A; put this closest neighboring node into set A 4.Repeat step 3 until all nodes are in set A

Query point and its nearest neighbor Query point NN

Practical Applications of NN Search  Medical Imaging  Molecular Biology  Spatial and Multimedia databases

Adaptable Similarity Approach  In multimedia database, given an image database  one may want to retrieve all images that are similar to a given query image  data domain is high dimensional

Color-oriented similarity of images  On Image Database

Shape-oriented similarity of images  Aims at the level of individual pixels

Shape-oriented similarity of 3-D objects  On 3-D protein database

Approximation-based Shape similarity of 3-D surface segments  Measures the similarity of 3-D segments by using geometric approx.

Exception Case  Distance between the nearest neighbor and any other point in the data set is small  We call this unstable query

Part Two Nearest Neighbor in high dimensional

Unstable query  A nearest neighbor query is unstable for a given € if the distance from the query point to most data point is less than (1 + € ) times the distance from the query point to its nearest neighbor

NN in High-Dimensional Space  Proof in the paper that the concept NN become meaningless as dimensionality (m) increases  If a pre-condition holds: As m increases, the difference in distance between the query point and all data points become negligible (i.e., the query becomes unstable)

Assumption for the pre-condition to hold  The data distribution and query distribution are IID in all dimensions  Unique dimensions with correlation between all dimensions

What is IID?  Independent and identically distributed  It means that the distribution of values in each dimension is identical (i.e. all values are uniformly distributed or dimensional have same skew) and independent

High Dimensional indexing can be meaningful  When the dimensions of both the query point and data points follow identical distribution, but are completely dependent (i.e: value in D1 = values in D2= … )  The result is a set of data points and query point on a diagonal line  The underlying query can actually be converted to 1D NN problem

Graphical View X Y Z  All dimension has same value  All data points are on the diagonal

High Dimensional indexing can be meaningful (Cont ’ d)  The underlying dimensionality is much lower than the actual dimensionality  E.g.: It is a 3-D data set, but the data always have the Z coordinate

High Dimensional indexing can be meaningful (Cont ’ d)  When the query point is within some small distance of a data point (instead of being required to be identical to a data point)  The result set of the query is to return all points within the closest cluster, not just the nearest point

NN query in clustered data Query point Nearest Cluster  E.g.: Data falls into discrete classes or cluster in some potentially high dimensional feature space

Distribution of distances in clustered data Points are close and are in same cluster (NN meaningful) Point are in other cluster which are all far

Experimental studies of NN  Want to find out the rate of convergence  Based on 3 synthetic work-load and one real data set  NN can become unstable with as few as dimensions  The graph is exponential  In reality, the dimensions might be 1000

Correlated Distributions Recursive and uniform workload (NN not meaningful) Two degrees of freedom workload (NN meaningful)

Part Three Linear Scan is powerful …

NN indexing VS Linear scan  Linear scan can handily beats NN indexing  NN indexing is meaningful when data consists of small, well-formed clusters  And the query is guaranteed to land in or very near one of these cluster

Why Linear scan  A set of sequentially arranged disk pages is much faster than unordered retrieval of the same pages  Fetching a large number of data pages through multi-dimensional index usually results in unordered retrieval

Linear Scan outperforms  Both the SS tree and the R* tree at 10 dimensions in all cases  SR tree in all cases at 16 dimensional synthetic data set

JustificationJustification  All the report performance studies examined situations in which the difference in distance between the query point and NN differed little from the distance to other data points  In reality, it might be different

Other related work  Dimensionality Curse  Fractal Dimensions

Dimensionality Curse  Vague indication that high dimensionality causes problems in some situations  Examples:  NN problem  “ Boundary effects ” not taken into account on NN query in high dimensional case

Fractal Dimensions  It is a measure of how "complicated" a self-similar figure (data) is  NN queries become stable when fractal dimensionality is low  In reality, real data sets do not exhibit fractal behavior

ConclusionConclusion  The effect of dimensionality on NN queries  High dimensional index can be meaningful  Evaluate NN workload  Linear scan outperforms NN processing technique on some meaningful workload

ReferenceReference  Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft. What Is “ Nearest Neighbor ” Meaningful?  Thomas Seidl. Adaptable Similarity Search in 3-D Spatial Database System  muenchen.de/Forschung/Similarity/