Metric based KNN indexing Lecturer:Prof Ooi Beng Chin Presenters:Frankie ChanHT00-3550Y Tan ZhenqiangHT01-6163J.

Slides:

Advertisements

Similar presentations

Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.

Advertisements

Trees for spatial indexing

On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Spatio-temporal Databases

Spatial and Temporal Data Mining V. Megalooikonomou Spatial Access Methods (SAMs) II (some slides are based on notes by C. Faloutsos)

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

Multimedia Database Systems

1 Top-k Spatial Joins

Nearest Neighbor Queries using R-trees

Searching on Multi-Dimensional Data

Nearest Neighbor Queries using R-trees Based on notes from G. Kollios.

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.

ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.

Mario Rodriguez Revollo School of Computer Science, UCSP SlimSS-tree: A New Tree Combined SS- tree With Slim-down Algorithm Lifang Yang, Xianglin Huang,

Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.

2-dimensional indexing structure

Spatio-temporal Databases Time Parameterized Queries.

Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)

Spatial Indexing for NN retrieval

Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.

Spatial Queries Nearest Neighbor and Join Queries.

Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1

Chapter 3: Data Storage and Access Methods

Spatial Queries Nearest Neighbor Queries.

Techniques and Data Structures for Efficient Multimedia Similarity Search.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

M- tree: an efficient access method for similarity search in metric spaces Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:

9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray.

A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.

1 Reverse Nearest Neighbor Queries for Dynamic Databases SHOU Yu Tao Jan. 10 th, 2003 SIGMOD 2000.

Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

High-Dimensional Data. Topics Motivation Similarity Measures Index Structures.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.

Indexing Multidimensional Data

Spatial Data Management

Spatial Queries Nearest Neighbor and Join Queries.

KD Tree A binary search tree where every node is a

Nearest Neighbor Queries using R-trees

Spatio-Temporal Databases

15-826: Multimedia Databases and Data Mining

BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

Efficient Processing of Top-k Spatial Preference Queries

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Donghui Zhang, Tian Xia Northeastern University

C. Faloutsos Spatial Access Methods - R-trees

Presentation transcript:

Metric based KNN indexing Lecturer:Prof Ooi Beng Chin Presenters:Frankie ChanHT Y Tan ZhenqiangHT J

Outline Introduction Examples SR-Tree MVP-Tree Future works Conclusion References

Introduction Metric based queries – consider relative distance of object/point from a given query point Most commonly used metric is the Euclidean metric

Metric based queries Variations : used with joins queries – find 3 closest restaurants for each of 2 different theaters used with spatial queries – find KNN to east of a location

SR-Tree “The SR-tree: An Indexing Structure for High-Dimensional Nearest Neighbor Queries.” Norio KatayamaShin’ichi Satoh Multimedia Info. Research Div.Software Research Div.National Institute of Informatics

SR-Tree : Introduction SR-tree stands for “Sphere/Rectangle- Tree” SR-Tree is an extension of the R*-tree and the SS-tree A region of the SR-tree is specified by the intersection of a bounding sphere and a bounding rectangle

The R*-Tree Structure

The SS-Tree Structure

Definitions (SR-Tree) The diameter of a bounded region means diameter of a bounding sphere for the SS- Tree diagonal of a bounding rectangle for the R*-Tree

Properties Bounding rectangles divides points into smaller volume regions. But tends to have longer diameters than bounding spheres, especially in high-dimensional space. Bounding spheres divides points into short- diameter regions. But tends to have larger volumes than bounding rectangles.

Properties SR-Tree combined the use of bounding sphere and bounding rectangle, as the properties are complementary to each other.

The SR-Tree Structure

Bounded regions

Indexing Structure The structure of the leaf L : The structure of the node N :

Insertion Algorithm SR-tree insertion algorithm is based on SS-tree’s centroid-based algorithm Descend down the tree and choose a subtree with the centroid nearest to the new entry SR-tree algorithm updates both bounding spheres (diff. from SS-tree) and bounding rectangles (same as R*-tree)

Insertion Algorithm Bounding sphere computation : Center, X (X 1,X 2,……..,X D ) Radius, r

Deletion Algorithm SR-tree deletion algorithm is similar to that of the R-tree If entry deletion do not cause leaf/node under-utilisation, then just remove it Otherwise, remove under-utilised leaf/node and reinsert all orphaned entries

Nearest Neighbour Search Algorithm – ordered depth-first search It finds a number of points nearest to the query, to make a candidate set Then it revises the candidate set, when it visits every leaf whose region overlaps the range of the candidate set After it visited all leave, the final candidate is the search result

Definition (NN search) Minimum Distance (MINDIST) euclidean distance from query point to the bounded region Minimax Distance (MINMAXDIST) minimum value of all the maximum distances between the query point and points on each n axes respectively

Definition (NN search)

Search pruning Region R1 with MINDIST greater than MINMAXDIST of another region R2 is discarded, because it cannot contain NN (downward pruning) Actual distance from query point P to a given object O which is greater than MINMAXDIST of a region, is discarded (upward pruning) Region with MINDIST greater than the actual distance from query point P to an object O, is discarded (upward pruning)

Recursive procedure (leaf node) If Node.type = LEAF then For I := 1 to Node.count dist := objectDist(Pt,Node.region) if (dist < Nearest.dist) then Nearest.dist := dist Nearest.region := Node.region

Recursive procedure (non-leaf node) Else /* non-leaf node => order, sort, prune & visit node */ genBranchList(Pt,Node,branchList) sortBranchList(branchList) /* perform downward pruning */ last = pruneBranch(Node,Pt,NearestBranchList) For I := 1 to last newNode := Node.branch nearestNeighborSearch(newNode,Pt,Nearest) /* perform upward pruning */ last := pruneBranchList(Node,Pt,Nearest, branchList)

Performance Analysis - Insertion Insertion cost of R*-tree, SS-tree and SR-tree (uniform dataset)

Performance Analysis - Query Performance of VAMSplit R-tree, SS-tree and SR-tree (Uniform dataset)

Performance Analysis - Query Performance of VAMSplit R-tree, SS-tree and SR-tree (Real dataset)

SR-tree average volume & diameter Average volume & diameter of the leaf-level regions of R*-tree, SS-tree and SR-tree (real dataset)

Strengths SR-Tree divides points into regions with small volumes and short diameter. Division of points into smaller regions improves disjointness. Smaller volume and diameter enhances the nearest neighbour queries’ performance.

Weaknesses SR-tree suffers from the fanout problem (branching factor – max node entries) The node size grows as dimensionality increases The reduction of the fanout may requires more nodes to be read on queries Possibly affect query performance

MVP-Tree “Indexing Large Metric Spaces For Similarity Search Queries” Tolga Bozkaya Meral Ozsoyoglu Oracle corporation Dept. Comp Eng & Sci Case Western Reserve Univ.

Outline Main idea. Algorithm basis How to build mvp-tree based on the given data. How to do similarity search in mvp-tree. Performance analysis and comparison based on experiments.

Main Idea Triangle Inequality. Distance based Indexing Adopt more vantage points and levels Pre-computed distances are kept in leaf node. Use pre-computed distances to prune query branches.

Algorithm Basis vp1,vp2,vp3 are vantage points. Q is given query point p1 belongs to points set The more vantage points the more unnecessary query branches are pruned. The distance between two vantage points is normally the larger the better

How to construct mvp-tree(m,k,p) 1) If | S | = 0 then create an empty tree and quit. 2) If | S |  k then Create leaf node L, put all data to L, Quit. 3) Choose first vantage point S vp1, Keep distances in arrays. 4) Divide S into m groups with same cardinalities based on the distances between S vp1 to points in S. And keeps distances as well. 5) For the first v above group 5.1) Choose last point in previous group as new vantage point. 5.2) Divide the group into m sub-groups with same cardinalities based on the distances between Svp1 to points in S. And keeps distances as well. 6) Recursively create mvp-tree on the m v sub-groups based on the steps from 1) to 5).

Example S v1 :first vantage point(level 1) S v2 : vantange points(level 2) D[1..k]:the distances between data points in leaf node and vantage points x.Path[p]:the distances between the data point and the first p vantage points along the path from the root to the leaf node that keeps it

How to do similarity search Depth-first process. Q is the given query object. r is the distance. 1) For i=1 to m If d(Q, Sv  r then Svi is in the answer set. (Svi is the ith vantage points in current node.) 2) If current node is leaf node For all data points ( S  in the node, If for all vantage points Sv, [d(Q, Sv  - r  d(Si, Sv  d(Q, Sv  + r] holds, and for all i=1..p ( PATH[i] - r  Si.PATH[i]  PATH[i] + r ) holds, then compute d(Q, S  ). If d(Q, S  r, then Si is in the answer set. 3) If the current node is an internal node for all i=1..m if d(Q, Svi) + r  Mi then recursively search the first branch (Mi is the maximum of the distances between Svi to those points in its child node)

Comparison Experiment Performance results for the queries with the data set where data points form several physical clusters

Performance Analysis Experiments to compare mvp-trees with vp-trees but use only one vantage point at each level, and do not make use of the pre-computed distances show that mvp-tree outperforms the vp-tree by 20% to 80% for varying query ranges and different distance distributions. For small query ranges, experiments with Euclidean vectors showed that mvp-trees require 40% to 80% less distance computation compared to vp-trees. For higher query ranges, the percentagewise difference decrease gradually, yet mvp-trees still perform better, making up to 30% less distance computations for the largest query ranges used. Experiments on gray-level images using the data set with 1151 images show mvp-trees performed up to 20-30% less distance computations.

Strengths Based on the thoughts of triangle inequality, more vantage points more unnecessary query branches pruned, the longer distances among vantage points the better and reusing pre-computed distances as much as possible. Mvp-tree is flatter than vp-tree. It is also balanced because of the way it is constructed. Experiments show that it is more efficient than vp- tree and M-tree.

Weaknesses 1.Construction cost O(nlog m n) distances computations 2.Additional storage cost are very high. There must be an array of size of p in every data point in leaf node. 3.Updating and inserting data points maybe lead to reconstruction of the mvp-tree. 4.If the insertions cause the tree structure to be skewed (that is, the additions of new data points change the distance distribution of the whole data set), global restructuring may have to be done, possibly during off hours of operation. 5.As the mvp-tree is created from an initial set of data objects in a top-down fashion, it is a rather static index structure.

Conclusion metric based indexing can be effective for high dimensional and non-uniform datasets (eg. Image/video similarity indexing) future work : algorithm to perform in both dynamic & static database environment analyse the use of metric with other attributes to enable range queries

References N. Katayama and S. Satoh. The SR-tree: An Indexing Structure for High-Dimensional Nearest Neighbor Queries. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages , Tucson, Arizona, R. Kurniawati, J. S. Jin, and J. A. Shepherd. The SS + -tree: An Improved Index Structure for Similarity Searches in a High-Dimensional Feature Space. SPIE: Storage and Retrieval for Image and Video Databases V, pages , N. Beckmann, H.P. Kriegel, R. Schneider, and B. Seeger. The R*tree : An Efcient and Robust Access Method for Points and Rectangles. In Proc. ACM SIGMOD Intl. Symp. on the Management of Data, pages , N. Roussopoulosi, S. Kelley, and F. Vincent. Nearest Neighbor Queries. Proc. ACM SIGMOD, San Jose, USA, pages 71-79, May Tolga Bozkaya and Meral Ozsoyoglu. Indexing Large Metric Spaces For Similarity Search Queries. Association for Computing Machinery transactions on Database System, pages 1- 34, Roberto Figueira Santos Filho, Agma Traina, Caetano Traina Jr and Christos Faloutsos. Similarity Search Without Tears: The OMNI-Family Of All-Purpose Access Methods. ICDE 2001.