1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Multimedia Database Systems
Indexing and Range Queries in Spatio-Temporal Databases
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Search Trees.
2-dimensional indexing structure
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Spatial Access Methods Chapter 26 of book Read only 26.1, 26.2, 26.6 Dr Eamonn Keogh Computer Science & Engineering Department University of California.
Spatial Indexing for NN retrieval
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Accessing Spatial Data
Project Proposals Simonas Šaltenis Aalborg University Nykredit Center for Database Research Department of Computer Science, Aalborg University.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
Spatial Indexing SAMs.
1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.
Chapter 3: Data Storage and Access Methods
I/O-Algorithms Lars Arge Aarhus University March 6, 2007.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
R-TREES: A Dynamic Index Structure for Spatial Searching by A. Guttman, SIGMOD Shahram Ghandeharizadeh Computer Science Department University of.
CS4432: Database Systems II
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
INDEXING SPATIAL DATABASES Atinder Singh Department of Computer Science University of California Riverside, CA
R-Trees Extension of B+-trees.  Collection of d-dimensional rectangles.  A point in d-dimensions is a trivial rectangle.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
INTRODUCTION TO MULTIWAY TREES P INTRO - Binary Trees are useful for quick retrieval of items stored in the tree (using linked list) - often,
Data Structures Balanced Trees 1CSCI Outline  Balanced Search Trees 2-3 Trees Trees Red-Black Trees 2CSCI 3110.
Antonin Guttman In Proceedings of the 1984 ACM SIGMOD international conference on Management of data (SIGMOD '84). ACM, New York, NY, USA.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
CS 61B Data Structures and Programming Methodology Aug 7, 2008 David Sun.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
R-T REES Accessing Spatial Data. I N THE BEGINNING … The B-Tree provided a foundation for R-Trees. But what’s a B-Tree? A data structure for storing sorted.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
R* Tree By Rohan Sadale Akshay Kulkarni.  Motivation  Optimization criteria for R* Tree  High level Algorithm  Example  Performance Agenda.
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Spatial Data Management
Mehdi Kargar Department of Computer Science and Engineering
Multiway Search Trees Data may not fit into main memory
CS 728 Advanced Database Systems Chapter 18
Chapter 25: Advanced Data Types and New Applications
Extra: B+ Trees CS1: Java Programming Colorado State University
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B+ Tree.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
(edited by Nadia Al-Ghreimil)
B Tree Adhiraj Goel 1RV07IS004.
Data Structures Balanced Trees CSCI
R-tree: Indexing Structure for Data in Multi-dimensional Space
Tree-Structured Indexes
B- Trees D. Frey with apologies to Tom Anastasio
A Robust Data Structure
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Spatial Indexing I R-trees
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
(edited by Nadia Al-Ghreimil)
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

1 R-Trees Guttman

2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects (boxes) Index structure is dynamic.

3 R-Tree Balanced (similar to B+ tree) I is an n-dimensional rectangle of the form (I 0, I 1,..., I n-1 ) where I i is a range Leaf node index entries: (I, tuple_id) Non-leaf node entry: (I, child_ptr) M is maximum entries per node. m  M/2 is the minimum entries per node.

4 Invariants 1.Every leaf (non-leaf) has between m and M records (children) except for the root. 2.Root has at least two children unless it is a leaf. 3.For each leaf (non-leaf) entry, I is the smallest rectangle that contains the data objects (children). 4.All leaves appear at the same level.

5 Example

6 Searching Given a search rectangle S Start at root and locate all child nodes whose rectangle I intersects S (via linear search). 2. Search the subtrees of those child nodes. 3. When you get to the leaves, return entries whose rectangles intersect S. Searches may require inspecting several paths. Worst case running time is not so good...

7 S = R16

8 Insertion Insertion is done at the leaves Where to put new index E with rectangle R? 1. Start at root. 2. Go down the tree by choosing child whose rectangle needs the least enlargement to include R. In case of a tie, choose child with smallest area. 3. If there is room in the correct leaf node, insert it. Otherwise split the node (will be illustrated later) 4. Adjust the tree If the root was split into nodes N 1 and N 2, create new root with N 1 and N 2 as children.

9 Adjusting the tree 1.N = leaf node. If there was a split, then NN is the other node. 2.If N is root, stop. Otherwise P = N’s parent and E N is its entry for N. Adjust the rectangle for E N to tightly enclose N. 3.If NN exists, add entry E NN to P. E NN points to NN and its rectangle tightly encloses NN. 4.If necessary, split P 5.Set N=P and go to step 2.

10 Splitting Nodes A well designed splitting strategy should obey: Minimize the total area of the two nodes Minimize the overlapping of the two nodes

11 Splitting Nodes – Exhaustive Search Try all possible combinations. Optimal results! Bad running time!

12 Splitting Nodes – Quadratic Algorithm 1.Find pair of entries E 1 and E 2 that maximizes area(J) - area(E 1 ) - area(E 2 ) where J is covering rectangle. 2.Put E 1 in one group, E 2 in the other. 3.If one group has M-m+1 entries, put the remaining entries into the other group and stop. If all entries have been distributed then stop. 4.For each entry E, calculate d 1 and d 2 where d i is the minimum area increase in covering rectangle of Group i when E is added. 5.Find E with maximum |d 1 - d 2 | and add E to the group whose area will increase the least. 6.Repeat starting with step 3.

13 Greedy continued Algorithm is quadratic in M. Linear in number of dimensions. But not optimal.

14 Splitting Nodes – Linear Algorithm 1.For each dimension, choose entry with greatest range. 2.Normalize by dividing the range by the width of entire set along that dimension. 3.Put the two entries with largest normalized separation into different groups. 4.Randomly, but evenly divide the rest of the entries between the two groups. 5.Algorithm is linear, almost no attempt at optimality.

15 The R*-tree: An Efficient and Robust Access Method for Points and Rectangles Norbert Beckmann, Hans-Peter Kriegel Ralf Schneider, Bernhard Seeger

16 Introduction R-trees use heuristics to minimize the areas of all enclosing rectangles of its nodes. Why? Why not... minimize overlap of rectangles? minimize margin (sum of length on each dimension) of each rectangle (i.e. make it as square as possible)? optimize storage utilization? all of the above?

17 Minimizing Covering Rectangle Dead space is the area covered by the covering rectangle which is not covered by the enclosing rectangles. Minimizing dead space reduces the number of paths to traverse during a search, especially if no data matches the search.

18 Minimizing Overlap Also reduces number of paths to be traversed during a search, especially when there is data that matches the search criteria.

19 Mimimizing Margin Minimizing margin produces “square-like” rectangles. Squares are easier to pack so this tends to produce smaller covering rectangles in higher levels.

20 Storage Utilization Reduces height of tree, so searches are faster. Searches with large query rectangles benefit because there are more matches per page.

21 Problems with Guttman’s Quadratic Split Distributing entries during a split favors the larger rectangle since it will probably need the least enlargement to add an additional item. When one group gets M-m+1 entries, all the rest are put in the other node.

22 R*-tree - ChooseSubtree Let E 1,..., E p be rectangles of entries of the current node, ChooseSubtree(level) finds the best place to insert a new node at the appropriate level. CS1: Set N to be the root CS2: If N is at the correct level, return N. CS3: If N’s children are leaves, choose the entry whose overlap cost will increase the least. If N’s children are not leaves choose entry whose rectangle will need least enlargement. CS4: Set N to be the child whose entry was selected and repeat CS2. Ties are broken by choosing entry whose rectangle needs least enlargement. After that choose rectangle with smallest area.

23 ChooseSubtree analysis The only difference from Guttman’s algorithm is to calculate overlap cost for leaves. This creates slightly better insert performance.

24 Optimizing Splits For each dimension, entries are sorted by low value, and then by high value. For each sort we create d = M-2m+2 distributions. In the k th distribution (1  k  d), the first group has the first (m-1)+k entries. We also have the following measures (R i is the bounding rectangle for group i) : 1. area-value = area[R 1 ]+area[R 2 ] 2. margin-value = margin[R 1 ]+margin[R 2 ] 3. overlap-value = area[R 1  R 2 ]

25 Optimizing Splits Split: S1: call ChooseSplitAxis to find axis perpendicular to the split. S2: call ChooseSplitIndex to find the best distribution. Use this distribution to create the two groups. ChooseSplitAxis: CSA1: for each dimension, compute the sum of margin-values for each distribution produced. CSA2: return the dimension that has minimum sum. ChooseSplitIndex: CSI1: for the chosen split axis, choose distribution with minimum overlap- value. Break ties by choosing distribution with minimum area-value.

26 Analyzing Splits Split algorithm was chosen based on performance and not on any particular theory. Split is O(n log(n)) in dimension. m = 40% of M yields good performance (same value of m is also near-optimal for Guttman’s quadratic split algorithm).

27 Forced Reinsert Splits improve local organization of tree. Can the improvement be made less local? Why not perform reinsert during inserts?

28 R* Insert InsertData ID1: call Insert with leaf level as the parameter. Insert(level) I1: call ChooseSubtree(level) to find the node N (at the appropriate level) into which we place the new entry. I2: if there is room in N, insert new entry, otherwise call OverflowTreatment with N’s level as parameter. I3: if OverflowTreatment caused a split, propagate OverflowTreatment up the tree (if necessary). I4: if root was split, create new root. I5: adjust all covering rectangles in insertion path.

29 R* Insert OverflowTreatment(level) OT1: If the level is not the root and this is the first OverflowTreatment for this level during insertion of 1 rectangle, call Split. Otherwise call Reinsert with level as the parameter. Reinsert(level) RI1: In decreasing order, sort the entries E i of N based on the distance from the center of E i to the center of N’s bounding rectangle. RI2: Remove the first p entries of N and adjust N’s bounding rectangle. RI3: call Insert(level) on the p entries in reversed sort order (close reinsert). Experimentally, a good value of p is 30% of M.

30 Insert Analysis Experimentally, R* insert reduces number of splits that have to be performed. Space utilization is increased. Close reinsert tends to favor the original node. Outer rectangles may be inserted elsewhere, making the original node more quadratic. Forced reinsert can reduce overlap between neighboring nodes.