Presentation is loading. Please wait.

Presentation is loading. Please wait.

Donghui Zhang, Tian Xia Northeastern University

Similar presentations


Presentation on theme: "Donghui Zhang, Tian Xia Northeastern University"— Presentation transcript:

1 Donghui Zhang, Tian Xia Northeastern University
7/19/2019 A Novel Improvement to the R*-tree Spatial Index using Gain/Loss Metrics Donghui Zhang, Tian Xia Northeastern University 7/19/2019 ACM GIS'04, Washington, DC

2 Outline Background and motivation Definitions of some metrics
Algorithm sketches Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

3 The R*-tree R1 R2 R5 R3 R4 R6 P1 P2 P3 P4 P5 P6 P7 P8 R5 R6 R1 R2
7/19/2019 The R*-tree R1 R2 R5 R3 R4 R6 P1 P2 P3 P4 P5 P6 P8 P7 P1 P2 P3 P4 P5 P6 P7 P8 R R6 R R2 R R4 Recursively cluster objects into minimum bounding rectangles (MBR). Organize the MBRs into a dynamic, disk-based, balanced tree structure, similar to the B+-tree. 7/19/2019 ACM GIS'04, Washington, DC

4 Forced reinsertion in the R*-tree
7/19/2019 Forced reinsertion in the R*-tree b a 7 d Delayed split: if a disk page overflows, some (e.g. p) entries will be removed from the page and reinserted into the tree. Objects whose distances to the center of the page’s MBR are the largest will be picked. 5 c 7/19/2019 ACM GIS'04, Washington, DC

5 Two goals Reduce the MBR area.
7/19/2019 Two goals Reduce the MBR area. Keep the shape of the MBR close to a square. Rationale: The tree is likely to have less overlap, therefore improves the range query performance. Observation: R*-tree’s action of picking objects can be improved. Previous research has shown that if the shapes of the MBRs are more like the shape of the range query, the query will touch less disk pages. 7/19/2019 ACM GIS'04, Washington, DC

6 A better choice b a 7 d 5 c 7/19/2019 ACM GIS'04, Washington, DC

7 Outline Background and motivation Definitions of some metrics
Quality / Gain / Loss p-boundary / minP-boundary Algorithm sketches Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

8 Three constraints With same areas, the quality of a square is larger than a rectangle. With same shapes, the quality of a smaller rectangle is larger than a bigger rectangle. A rectangle is shrunk to another rectangle, the quality always increases. 7/19/2019 ACM GIS'04, Washington, DC

9 Quality Definition: Given a rectangle r with width w and height h, the quality of r is , [0,1], e.g. 0.5 7/19/2019 ACM GIS'04, Washington, DC

10 Quality (example) Given  = 0.5, Q=2 0.25 1 Q=1 1 Q=4 0.5 7/19/2019
ACM GIS'04, Washington, DC

11 Extension to d dimensions
Given a d-dimensional rectangle r, whose edges have length h1,…,hd, the quality of r is 7/19/2019 ACM GIS'04, Washington, DC

12 Gain / Loss Definition: the gain of shrinking rectangle r1 to r2 is
Symmetrically, the loss of expanding r2 to r1 is defined as the gain of shrinking r1 to r2. 7/19/2019 ACM GIS'04, Washington, DC

13 Extension to a set of objects
The quality of S is the quality of MBR(S). The gain of removing a subset P from S, is the gain of shrinking MBR(S) to MBR(S-P). 7/19/2019 ACM GIS'04, Washington, DC

14 Outline Background and motivation Definitions of some metrics
Quality / Gain / Loss p-boundary and minP-boundary Algorithm sketches Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

15 p-boundary If a page overflows, p objects will be removed from the page. The p-boundary is the optimal set of such p objects, which achieve the largest gain Gp. a b c d 7/19/2019 ACM GIS'04, Washington, DC

16 7/19/2019 minP-boundary Given a threshold b (e.g. 0.8), minP-boundary is the smallest set which achieves gain no less than b Gp. f e c d Reinsertion is costly, explain here. Give numbers for example. 100 vs. 20. Computation of minP-boundary depends on that of p-boundary. 7/19/2019 ACM GIS'04, Washington, DC

17 Modifications on the R*-tree
The forced reinsertion algorithm picks the minP-boundary to re-insert. Reinsertion is not always enforced. Apply Gain/Loss metrics on the process of choosing a subtree to accommodate a new entry. 7/19/2019 ACM GIS'04, Washington, DC

18 Outline Background and motivation Definitions of some metrics
Algorithm sketches Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

19 Straightforward solution 1
Enumerate all possible combinations of p objects in a page. The number of combinations is , which can be exponential in n. 7/19/2019 ACM GIS'04, Washington, DC

20 Straightforward solution 2
Enumerate all possible combinations among outside objects. The number of combinations could be exponential in p. e a c h f i g j b d k l p = 2 only a, b, d, e, l and k may appear in the p-boundary. 7/19/2019 ACM GIS'04, Washington, DC

21 Our idea Shrink the borders by levels.
7/19/2019 Our idea Shrink the borders by levels. e a c h f i g j b d k l 25 24 23 22 21 20 5 6 7 8 9 10 How to store the levels s.t. we can efficiently find and remove one level? Instead of using straightforward brute search methods. What are the duplicates? Example for “level” 7/19/2019 ACM GIS'04, Washington, DC

22 Outline Background and motivation Definitions of some metrics
Algorithm sketches Border structure Exhaustive algorithms Greedy algorithms Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

23 Border structure p = 4 e a c h f i g j b d k l TOP LEFT RIGHT BOTTOM
7/19/2019 Border structure p = 4 e a c h f i g j b d k l 25 24 23 22 21 20 5 6 7 8 9 10 TOP LEFT RIGHT BOTTOM a, b c, d e a, h b, k d k l Why some binary trees do not have 4 objects? In 2-dimensional space, the number of objects that may appear in the p-boundary is at most 4p. Organize the set of candidates by the border structure, which consists of: four binary trees; an object array; four coordinate arrays. a b c d e h k l LX=[5,6,7], LY=[20,21,22], HX=[10,9,8], HY=[25,24,23] 7/19/2019 ACM GIS'04, Washington, DC

24 Outline Background and motivation Definitions of some metrics
Algorithm sketches Border structure Exhaustive algorithms Greedy algorithms Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

25 Exhaustive search of p-boundary (Algorithm pick-p)
Shrink the borders by levels. e a c h f i g j b d k l 25 24 23 22 21 20 5 6 7 8 9 10 LX=[5,6,7], LY=[20,21,22], HX=[10,9,8], HY=[25,24,23] 7/19/2019 ACM GIS'04, Washington, DC

26 Exhaustive search of p-boundary (Algorithm pick-p)
A rectangle is valid if: No more than p objects outside; it is an MBR of all the objects inside; 7/19/2019 ACM GIS'04, Washington, DC

27 Exhaustive search of p-boundary (Algorithm pick-p)
Compute the gain of shrinking the original MBR to a valid rectangle, and keep the largest gain Gp and the corresponding combination. 7/19/2019 ACM GIS'04, Washington, DC

28 Exhaustive search of minP-boundary (Algorithm pick-minP)
To find optimal minP-boundary, store the intermediate results. For each distinct gain, only keep the combination which removes the smallest number of objects. At the end, choose the combination whose gain is the largest, no less than   Gp. 7/19/2019 ACM GIS'04, Washington, DC

29 Outline Background and motivation Definitions of some metrics
Algorithm sketches Border structure Exhaustive algorithms Greedy algorithms Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

30 Greedy algorithms Idea: always pick the border which, if we remove one level, will result in the largest average gain per removed object. p = 2 d a b c 7/19/2019 ACM GIS'04, Washington, DC

31 7/19/2019 Greedy algorithms look-ahead: compute the average gains of removing 1,…, m levels, and pick the one with largest average gain. p = 3 Animation for removing one obj 7/19/2019 ACM GIS'04, Washington, DC

32 Outline Background and motivation Definitions of some metrics
Algorithm sketches Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

33 Experiments – setup Real datasets p = 30% of the node capacity (1KB).
the Postal dataset: 123,593 point data (postal addresses in Northeast area of USA). the Street dataset: 131,461 line segments (streets of Los Angeles). p = 30% of the node capacity (1KB). a = 0.5, b = 0.9 and m = 5. 7/19/2019 ACM GIS'04, Washington, DC

34 Experiments – Gain comparison
7/19/2019 Experiments – Gain comparison Add slide before, explain the algorithms, make the figure larger, original->R*, what we are measuring, how do we get the objects, what p is. Mention focusing on one rectectangle. Comparison of gains under various number of objects Comparison of running time of various algorithms 7/19/2019 ACM GIS'04, Washington, DC

35 Experiments – Index construction
100000 20000 80000 16000 60000 12000 # Disk I/O # Disk I/O 40000 8000 20000 4000 Greedy Original Greedy Original the Postal dataset the Street dataset 7/19/2019 ACM GIS'04, Washington, DC

36 Experiments – Range query by varying query size
Page size: 1K, buffer size: 128K. the Postal dataset the Street dataset 7/19/2019 ACM GIS'04, Washington, DC

37 Experiments – Range query by varying page size
Query size: 0.01%, buffer size: 128K. the Postal dataset the Street dataset 7/19/2019 ACM GIS'04, Washington, DC

38 Experiments – Range query by varying buffer size
Page size: 1K, Query size: 0.01%. the Postal dataset the Street dataset 7/19/2019 ACM GIS'04, Washington, DC

39 Outline Background and motivation Definitions of some metrics
Algorithm sketches Experimental results Conclusion and future works 7/19/2019 ACM GIS'04, Washington, DC

40 Conclusions and future work
Defined novel quality/gain/loss metrics, considering both area and shape. Defined minP-boundary and proposed algorithms to find it. Integrated with R*-tree and achieved up to 20% range query improvement. Examining the idea to promote outlier objects to index nodes. 7/19/2019 ACM GIS'04, Washington, DC

41 Thank you! 7/19/2019 ACM GIS'04, Washington, DC


Download ppt "Donghui Zhang, Tian Xia Northeastern University"

Similar presentations


Ads by Google