Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Principles and Applications For Supporting Similarity Queries in Non-ordered Discrete and Continuous Data Spaces Gang Qian Advisor: Dr. Sakti Pramanik.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Multimedia Database Systems
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Multidimensional Indexing
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
2-dimensional indexing structure
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Accessing Spatial Data
Spatial Indexing SAMs.
1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.
Chapter 3: Data Storage and Access Methods
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
1 Database indices Database Systems manage very large amounts of data. –Examples: student database for NWU Social Security database To facilitate queries,
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CS4432: Database Systems II
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Indexing for Multidimensional Data An Introduction.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Antonin Guttman In Proceedings of the 1984 ACM SIGMOD international conference on Management of data (SIGMOD '84). ACM, New York, NY, USA.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
File Processing : Multi-dimensional Index 2015, Spring Pusan National University Ki-Joune Li.
R* Tree By Rohan Sadale Akshay Kulkarni.  Motivation  Optimization criteria for R* Tree  High level Algorithm  Example  Performance Agenda.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Spatio-Temporal Databases. Term Project Groups of 2 students You can take a look on some project ideas from here:
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.
Spatial Data Management
Mehdi Kargar Department of Computer Science and Engineering
Spatio-Temporal Databases
Multiway Search Trees Data may not fit into main memory
Tree-Structured Indexes
COP Introduction to Database Structures
Spatial Indexing I Point Access Methods.
Spatio-Temporal Databases
B+-Trees and Static Hashing
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Spatial Indexing I R-trees
Tree-Structured Indexes
Presentation transcript:

Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006

Summary Overview Motivation and Existing Work NSP-Tree Structure, Algorithms and Performance Conclusion and Future Work

Overview The NSP-tree is a disk-based index structure  Similar to B-tree/B+-tree It is designed to index a large amount of vectors with non-ordered discrete components  Domains with discrete values that are not naturally ordered are very common E.g., gender, profession, genome bases, etc. It is used to speed up similarity queries over the indexed data  Unlike exact queries, a similarity query searches for data items that are similar to the given query data item

Motivation Traditional database technology is mature  Data model: Relational Data Model  Design: ER/EER Diagrams  Query: SQL  Data integrity: Transaction Processing  Index: B-tree/B+-tree  Some hard unsolved issues still exist E.g., Multidimensional Query Optimization

New problems occur with the increasing demand for the management of non-traditional data types  Multimedia data  Scientific data  Spatial data  Temporal data  Biological data, etc. With the new data types, exact queries are no longer useful  Similarity queries become more and more important

Vector Model The Vector Model is one of the very useful tools to support these new data types  Many non-traditional data types are vectors or can be easily converted into vectors E.g., feature vectors for images  Vectors can be deemed as points in high dimensional data spaces  Therefore, the distance between a pair of vectors is a natural quantitative measure of (dis)similarity between two data objects that the two vectors represent E.g., Euclidean distance

The problem of managing non-traditional databases becomes the problem of managing vector databases Designing index structures to support efficient similarity queries on vectors is an open research area of vector databases  For example, the NSP-tree is designed to index vectors with discrete and non-ordered components E.g., genome sequence data

Existing Work A number of index structures are proposed for vectors with continuous numerical components  E.g., R-tree and its variants: SS-tree SR-tree X-tree Hybrid tree, etc. Due to the volume of the data, almost all proposed index structures are disk-based

The basic structure of these indices are very similar to that of the B+-tree  Hierarchical tree structure  Each tree node occupies one and only one disk block and has a minimum utilization requirement  Vectors are stored in leaf nodes  Non-leaf nodes contain routing information that is used for tree construction and searching Routing information are usually represented by a certain type of minimum bounding shapes  Minimum Bounding Rectangle (MBR), Minimum Bounding Sphere (MBS), etc.

Example: R-Tree Structure Figure adopted from “The SR-tree: An Index Structure for High- Dimensional Nearest Neighbor Queries” (SIGMOD 1997).

 Such an index tree grows in a bottom-up fashion Vectors are incrementally inserted into the tree When a leaf node is full, it is split into two leaves The split of a child in the tree may cause the split of a parent Node split may propagate all the way up to the root, when the root itself will be split to create a new root  Search works top-down from the root Search performance is usually measured in terms of the total number of disk blocks/nodes accessed Search efficiency is derived from pruning branches that are not within the search range  Unlike a brute force linear search, vectors in irrelevant branches will not be visited

Unfortunately, those index trees mentioned in previous slides cannot be directly used for vectors with non-ordered discrete components The ND-tree was proposed to index such vectors  See “The ND-Tree: A Dynamic Indexing Technique for Multidimensional Non-ordered Discrete Data Spaces” (VLDB 2003)

Discrete Space Concepts The structure of the ND-tree is very similar to those of the R-tree variants However, all the underlying geometrical concepts are redefined to accommodate discrete vectors Euclidean/Continuous SpaceDiscrete Space VectorDiscrete Vector RectangleDiscrete Rectangle AreaDiscrete Area Euclidean DistanceHamming Distance ……

Example: Discrete Rectangles  Introduced to bound vectors with non-ordered discrete components  Normal rectangle can be deemed as the Cartesian product of ranges for every dimension in the data space E.g., [0.1, 0.2]  [0.7, 0.8] is a two-dimensional rectangle  A discrete rectangle is defined as the Cartesian product of sets of discrete values from every dimension E.g., {a, g}  {t, c, g} is a two-dimensional discrete rectangle that covers vectors such as, and  Discrete Minimum Bounding Rectangles (DMBR) store the routing information for the ND-Tree

Problem of The ND-tree Overlap in an index tree may dramatically affect its search performance The construction of the ND-tree cannot totally avoid the overlap among DMBRs in the tree  The ND-tree works well when the data is randomly distributed  However, for certain data sets, overlap cannot be avoided For example, the skewed data set based on the Zipf distribution To guarantee the minimum disk utilization, the split algorithm may NOT be able to find an overlap-free split for an overflow node

Basic Idea of The NSP-Tree There are three factors that affect search performance  Disk utilization  Overlap  Fan-out Maximum number of children of a tree node Since overlap can not be totally avoided when there is a minimum disk utilization requirement, the design of the NSP-tree dropped the requirement so that overlap-free can be guaranteed

Space-Partitioning Indexing Methods Ideas of overlap-free index structures are not new  What makes the NSP-tree new is that it can handle non- ordered discrete data based on an overlap-free structure There are a category of index trees that have such a feature  KDB-tree  hB-tree  LSD-tree, etc. They are called space-partitioning indexing methods  R-tree variants are called data-partitioning indexing methods All previous space-partitioning indices support only vectors with continuous numeric components

d:1 v: 0.6 d:2 v: 0.3 d:2 v: 0.6 d:1 v: 0.2 d:1 v: 0.4 d:2 v: 0.2 d:1 v: 0.75 <=> > > d: Split dimension v: Split point on the split dimension Space-partitioning InformationPartitioned Data Space

Space-Partitioning vs. Data-Partitioning Space-PartitioningData-Partitioning Objects that can be indexed Vectors only Vectors and spatial objects Minimum Utilization Requirement NoYes Guaranteed Overlap-free YesNo Fan-outLargeSmall

NSP-Tree Structure Similar to those of the B+-tree and the R-tree, but with no minimum disk utilization requirement  Each node occupies one disk block  Vectors are stored in leaf nodes  Space-partitioning information are stored in non-leaf nodes The space concept in the NSP-tree is discrete  A discrete data space is defined as the Cartesian product of the sets of all possible values on every dimension  Due to the non-ordered nature of the values, a split point on a split dimension is no long enough to describe a split Need to explicitly record how each values on a dimension are separated into two groups

Structure of The NSP-Tree

Routing Information: Split History Tree (SPT)

Conceptually, each node corresponds to a subspace of the discrete data space  A subspace is defined as the Cartesian product of the subsets of values on every dimension  There is no overlap among the subspaces of the children on the same level  The subspace of a parent node contains the subspaces of all its children

Eliminating Dead Space One disadvantage of a pure space-partitioning approach is that the subspaces do not necessarily minimally bound the vectors in the space  See next slide To further improve the pruning power, DMBRs are used as additional routing information in tree However, the use of DMBRs reduces the fan-out of tree  More space in a node is needed to store the DMBRs  We found that the benefits of using DMBRs are usually greater than the disadvantage of the decrease of the fan- out

Actual Minimum Bounding Rectangle Subspace is not minimum bounding Dead space r Q

Tree Construction Algorithms An NSP-tree grows incrementally  Vectors are inserted one by one  Insertion starts from the root and goes down the tree until a suitable leaf node is found for the new vector  The tree grows in a bottom-up fashion There are two import algorithms used in the insertion procedure  ChooseSubtree  SplitNode

ChooseSubtree  Starting from the root, it is invoked on non-leaf nodes  Given the vector to insert, the algorithm decides which child nodes to follow based on whether a child’s subspace contains the new vector or not Due to the overlap-free property, there exists at most one child that can contain the new vector SplitNode  Splits an overflow node into two nodes  The split guarantees overlap-free  It also tries to maximize disk utilization by choosing the most balanced split

There are other algorithms for the NSP-tree  Generating and maintaining DMBRS  Query  Deletion, etc.

Query Performance

Disk Utilization

Summary The NSP-tree is the first indexing method that uses the space-partitioning approach to index vectors with non-ordered discrete components The benefit of using an overlap-free tree structure is obvious when data distribution is skewed With proper heuristics, the disadvantage of the removal of the minimum disk utilization requirement can be minimized In general, the benefit of using DMBRs to eliminate dead space (hence, increasing the pruning power) overrides the disadvantage of the fan-out decrease

Future Work Bulkloading the NSP-tree and the ND-tree  Insert more than one vector at a time Support approximate similarity queries  Beat the Curse of High Dimensionality Support queries based on the Editor Distance  Besides the Hamming distance, the Editor distance is another widely-used distance measure for discrete vectors Aggregate all the technology into a viable bioinformatics search engine