Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1

Slides:



Advertisements
Similar presentations
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Advertisements

Searching on Multi-Dimensional Data
A Simple and Efficient Algorithm for R-Tree Packing Scott T. Leutenegger, Mario A. Lopez, Jeffrey Edgington STR Sunho Cho Jeonghun Ahn 1.
B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
External Memory Geometric Data Structures
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
I/O-Efficient Construction of Constrained Delaunay Triangulations Pankaj K. Agarwal, Lars Arge, and Ke Yi Duke University.
2-dimensional indexing structure
I/O-Algorithms Lars Arge University of Aarhus February 21, 2005.
I/O-Algorithms Lars Arge Aarhus University February 27, 2007.
Micha Streppel TU Eindhoven  NCIM-Groep, the Netherlands and Ke Yi AT&T Labs, USA  HKUST, Hong Kong.
I/O-Algorithms Lars Arge Spring 2011 March 8, 2011.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Optimal Planar Point Enclosure Indexing Lars Arge, Vasilis Samoladas and Ke Yi Department of Computer Science Duke University Technical University of Crete.
B+-tree and Hashing.
I/O-Algorithms Lars Arge Aarhus University February 13, 2007.
I/O-Algorithms Lars Arge Aarhus University March 16, 2006.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
Accessing Spatial Data
I/O-Algorithms Lars Arge Aarhus University February 16, 2006.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge University of Aarhus February 13, 2005.
I/O-Algorithms Lars Arge Spring 2009 April 28, 2009.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Aarhus University March 5, 2008.
I/O-Efficient Structures for Orthogonal Range Max and Stabbing Max Queries Second Year Project Presentation Ke Yi Advisor: Lars Arge Committee: Pankaj.
I/O-Algorithms Lars Arge Aarhus University March 9, 2006.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
Spatial Indexing I Point Access Methods.
I/O-Algorithms Lars Arge Aarhus University March 6, 2007.
I/O-Algorithms Lars Arge University of Aarhus March 7, 2005.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
Primary Indexes Dense Indexes
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CS4432: Database Systems II
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
R-Trees Extension of B+-trees.  Collection of d-dimensional rectangles.  A point in d-dimensions is a trivial rectangle.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
Trees for spatial data representation and searching
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
External Memory Algorithms for Geometric Problems Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Mehdi Mohammadi March Western Michigan University Department of Computer Science CS Advanced Data Structure.
CSIS7101 – Advanced Database Technologies Spatio-Temporal Data (Part 1) On Indexing Mobile Objects Kwong Chi Ho Leo Wong Chi Kwong Simon Lui, Tak Sing.
Lars Arge Presented by Or Ozery. I/O Model Previously defined: N = # of elements in input M = # of elements that fit into memory B = # of elements per.
Lecture 2: External Memory Indexing Structures CS6931 Database Seminar.
Computational Geometry Piyush Kumar (Lecture 5: Range Searching) Welcome to CIS5930.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Lecture 3: External Memory Indexing Structures (Contd) CS6931 Database Seminar.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
External Memory Geometric Data Structures Lars Arge Duke University June 27, 2002 Summer School on Massive Datasets.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
STR: A Simple and Efficient Algorithm for R-Tree Packing.
A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.
1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.
Spatial Indexing I Point Access Methods.
Advanced Topics in Data Management
R-tree: Indexing Structure for Data in Multi-dimensional Space
Presentation transcript:

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1 Department of Computer Science Duke University TU Eindhoven Institute of Information and Computing Sciences Utrecht University

Problem Definition Input: N rectangles in the plane Window query Q Output: All rectangles intersecting Q Applications Spatial databases GIS CAD Computer vision Robotics …

R-Tree Definition [Guttman84]: Fanout: Ө(B) B: disk block size Advantages: Little redundancy Multi-purpose Easy to update Fanout: Ө(B) B: disk block size G F E B A H I A B C D E F G H I C D

How to Build an R-Tree Repeated insertions [Guttman84] R+-tree [Sellis et al. 87] R*-tree [Beckmann et al. 90] Bulkloading Hilbert R-Tree [Kamel and Faloutos 94] Top-down Greedy Split [Garcia et al. 98] Advantages: Much faster than repeated insertions Better space utilization Usually produce R-trees with higher quality

R-Tree Variant: Hilbert R-Tree Hilbert Curve To build a Hilbert R-Tree (cost: O(N/B logM/BN) I/Os) Sort the rectangles by the Hilbert values of their centers Build a B-tree on top 4D Hilbert R-tree

R-Tree Variant: TGS R-Tree (Top-down Greedy Split) To build a TGS R-tree Start from the root and build the tree top-down To build one node, use binary cuts until the desired fan-out is reached To make a binary cut, consider 4 orderings of the rectangles: xmin, ymin, xmax, ymax In each ordering, consider the B cutting positions Choose the one that minimizes the sum of the areas of the two resulted bounding boxes Typical bulk-load cost: O(N/B log2N) I/Os

Our Results None of existing R-tree variants has worst-case query performance guarantee! In the worst-case, a query can visit all nodes in the tree even when the output size is zero Priority R-Tree The first R-tree variant that answers a query by visiting nodes in the worst case T: Output size It is optimal! There exists a dataset such that for any R-tree, there is an empty query that visits nodes. [Kanth and Singh 99, Agarwal et al. 02]

Roadmap Pseudo-PR-Tree Has the desired worst-case guarantee Not a real R-tree Transform a pseudo-PR-Tree into a PR-tree A real R-tree Maintain the worst-case guarantee Experiments PR-tree Hilbert R-tree (2D and 4D) TGS-R-tree

Building a Pseudo-PR-Tree priority leaves root Step 1: take out B extreme rectangles from each direction and put them into priority leaves

Building a Pseudo-PR-Tree Step 2: Divide by the xmin coordinates and build subtrees recursively. Division is performed using xmin, ymin, xmax, ymax in a round-robin fashion, like a 4D kd-tree root Analysis sketch: # nodes with at least one priority leaf completely reported: O(T/B) # nodes with no priority leaf completely reported:

Pseudo-PR-Tree to a Real R-tree

Query Complexity Remains Unchanged Next level: # nodes visited on leaf level

PR-Tree: Bulkload & Updates O(N/B∙log2N) I/Os→O(N/B∙logM/BN) I/Os, using “grid method” [Agarwal et al. 01] The same as Hilbert R-tree, but with a larger constant Updates Can use any previous heuristic to update in O(logBN) I/Os Without worst-case query guarantee Use logarithmic method Insert: O(logBN + 1/B · logM/BN log2(N/M)) I/Os Delete: O(logBN) I/Os Extending to d-dimensions Query bound: O((N/B)1-1/d + T/B), still optimal Bulkload & update bounds remain the same

Experiments Implemented with TPIE Priority R-tree Hilbert R-tree 4D Hilbert R-tree TGS R-tree Real-life data TIGER datasets 16 million rectangles Synthetic data Varying from normal to extreme data 10 million rectangles

Experiments with Real-Life Data Query performance on the TIGER datasets Shown: # I/Os spent in answering a query T/B

Experiments with Synthetic Data: SIZE Each side of a rectangle is uniformly distributed in [0, max_side] Queries are squares with area 1%

Experiments with Synthetic Data: ASPECT Fix the area, vary aspect ratio

Experiments with Synthetic Data: SKEWED Randomly place points, then do y’=yc on the y-coordinates

Experiments with Synthetic Data: CLUSTER

Conclusions In theory The PR-tree is the first R-tree variant that answers a window query in I/Os worst-case, which is optimal In practice Roughly the same as previous best R-trees on real-life and relatively nicely distributed data Outperforms them significantly on more extreme data Future work How previous heuristics may affect the performance of the PR-tree in the dynamic case

Lower Bound Construction Each bounding box intersects at least queries N/B bounding boxes queries There exists a query that intersects at least bounding boxes

Pseudo-PR-Tree: Query Complexity Nodes v visited where all rectangles in at least one of the priority leaves of v’s parent are reported: O(T/B) Let v be a node visited but none of the priority leaves at its parent are reported completely, consider v’s parent u 2D 4D Q ymin = ymax(Q) xmax = xmin(Q)

Pseudo-PR-Tree: Query Complexity The cell in the 4D kd-tree of u is intersected by two different 3-dimensional hyper-planes The intersection of each pair of such 3-dimensional hyper-planes is a 2-dimensional hyper-plane Lemma: # of cells in a d-dimensional kd-tree that intersect an axis-parallel f-dimensional hyper-plane is O((N/B)f/d) So, # such cells in a 4D kd-tree: Total # nodes visited: u

Experiments with Real-Life Data Datasets: TIGER/Line data Bulk-loading: