Download presentation

Presentation is loading. Please wait.

Published byBret Burger Modified over 2 years ago

1
Partition Based Spatial – Merge Join Present by: Tony Tong (09049620) Cleo Tsang (09049630) June Yau (09030360) Chelsie Chan (10104740) Oengus Lam (10104790) 1

2
Agenda 1.Problem 2.Definition of Spatial Join 3.PBSM Algorithm 3.1 Filter Step 3.2 Refinement Step 3.3 Number of Partition 3.4 Spatial Partitioning Function 4.Performance 4.1 Indexed Nested Loops Join 4.2 R-tree Based Join Algorithm 4.3 Methodology 4.4 None of Indices Pre-exist 4.5 In the Presence of Pre-existing Index 4.6 CPU Costs 5.Conclusion 2

3
1. Problem In a spatial database system, like GIS, join queries objects involve large amount of memory Have no pre computing data for datasets Usually no index for intermediate result Solve this join problem efficiently 3

4
2. Definition of Spatial Join An operation of combining 2 or more datasets based on their spatial relationship Q: Find all pairs of rivers and cities that intersect Join Result Pairs: r2 r1 c1 c2 c3 c4 c5 4

5
3. PBSM Algorithm Partition Based Spatial-Merge Join (PBSM) PBSM operates in 2 steps 5 Filter Step Refinement Step IDRiver_NameLength r1r1Margaret River60km r2r2Brisbane River344km IDCity_NameCounty c1c1PerthWA c2BrisbaneQLD Unique Identifer (OID) Input R Input C

6
3.1 Filter Step Purpose: – To find all objects whose MBR intersects the query rectangle For each input (R and C), – Creation of Minimum Bounding Rectangle (MBR) – Rough Estimation for Search Region – Key-Pointer Element in New Input (R kp and C kp ) (OID + MBR) 6 Input R kp r2 r1 c1 c2 c3 c4 c5 Input C kp... Key-Pointer Element

7
3.1 Filter Step Spatial Join (1st Scenario) – R kp and C kp fit into main memory Plane-Sweeping Technique – Sort by MBR.xl for Each Input (R kp and C kp ) – Select the MBR in Either Input (e.g. R kp ) with Smallest MBR.xl – Scan along the x-axis from MBR.xl to MBR.xu to check if MBR r MBR c 7 r2 r1 c1 c2 c3 c4 c5 MBR.xl MBR.xu Start with the first entry r1, sweep a vertical line Check if MBR r1 MBR c2, add (OID r1, OID c2 ) to result set Check if MBR r1 MBR c1, add (OID r1, OID c1 ) to result set Scan until MBR.xu, start the next entry

8
3.1 Filter Step Spatial Join (2nd Scenario) – R kp and C kp do not fit into main memory Spatial Partitioning Technique – Size of Each Partition for both Input (R kp and C kp ) can fit into memory simultaneously – Perform Plane-Sweeping Technique for Preliminary Spatial Join in Each Partition Result Pair 8 r2 r1 c1 c2 c3 c4 c5 Partition 2 Partition 3 Partition 0 Partition 1

9
3.2 Refinement Step Purpose – #1: To eliminate duplicates induced by Partitioning 9 r2 c4 c5 Partition 2 Partition 3 Partition 0 Partition 1 Result Pairs: Partition 1: Partition 3:, – #2: To examine the actual R and S tuples & see if the attributes satisfy join condition

10
3.2 Refinement Step Procedure – #1: Sort OID pairs: Primary Sort Key: OID R Secondary Sort Key: OID C – #2: Read R tuples first, then C

11
3.3 Number of Partitions Number of Partition P is computed as:- where P: Number of partition R : Cardinality of R C : Cardinality of C Size key-ptr : Size of a key-pointer element (in bytes) M: Size of main memory (in bytes) 11

12
3.4 Spatial Partitioning Function Non-Uniform Distributed & Clustered Spatial Features By Regular Partitioning Method Large differences in size of partitions 12 Partition 0Partition 1 Partition 2Partition 3 Universe

13
3.4 Spatial Partitioning Function Step 1: Regular decomposition of universe into NT tiles, where NT P Step 2: Apply Tile-to-Partition Mapping Scheme Round Robin OR Hashing 13 Regular Partitioning Method Tile-based Partitioning Method + Round Robin Mapping Scheme Partition 0Partition 1 Partition 2Partition 3 Tile0/Part0Tile1/Part1Tile2/Part2Tile3/Part0 Tile4/Part1Tile5/Part2 Tile6/Part0Tile7/Part1 Tile8/Part2Tile9/Part0Tile10/Part1Tile11/Part2

14
3.4 Spatial Partitioning Function What is the PERFECT Spatial Partitioning Function ? Considerations: – Number of Tiles – Tile-to-Partition Mapping Scheme (Round Robin OR Hashing) Data set used for investigation: – Tiger Road Data (62.4MB, 456,613 tuples) – Sequoia Polygon Data (21.9MB, 58,115 tuples) 14 It assigns equal number of tuples to each partition

15
3.4 Spatial Partitioning Function Observation: Partitioning Function improves as No. of Tiles increases More uniform distribution 15 The PERFECT Partitioning Function has a coefficient of variation = 0 Spatial Partitioning Function Alternatives: Tiger Road Data

16
3.4 Spatial Partitioning Function Observation: No. of Tiles, Replication Overhead 16 Replication Overhead: Tiger Road Data (16 Partitions) Replication Overhead: Sequoia Polygon Data(16 Partitions) Number of Tiles = An integral multiple of Number of Partitions

17
Tile0/Part0Tile1/Part1Tile2/Part2 Tile3/Part0Tile4/Part1Tile5/Part2 Tile6/Part0Tile7/Part1Tile8/Part2 r1 c1 c2 Scenario: No. of Tiles = 9 P = 3 Tile-to-Partition Mapping Scheme = Round Robin The entire column is being mapped to a single partition Replications by partitioning, Replication overheads The entire column is being mapped to a single partition Replications by partitioning, Replication overheads

18
3.4 Spatial Partitioning Function Observation: No. of Tiles, Replication Overhead 18 Replication Overhead: Tiger Road Data (16 Partitions) Replication Overhead: Sequoia Polygon Data(16 Partitions) Number of Tiles = An integral multiple of Number of Partitions

19
4. Performance V.S Indexed Nested Loops Join PBSM Join (1024 tiles) V.S R-tree Based Join 19

20
4.1 Index Nested Loops Join 20 Build an index in R (the smaller input) Reads the extent R Extracts the key-pointer ( ) Sort the key-pointer by MBRBuild R-tree for the key-pointer Scan on C For each C, fetch each R

21
4.2 R-tree Based Join Algorithm 21 Build an R-tree index in both R and C Find MBR with union set which is not null Visit the roots Move down the levels until leaf nodes Find ID pairs with dataset union which is not null

22
4.3 Methodology Database System: Paradise Machine: Sun SPARC-10/51 – 64 MB of memory – SunOS Release 4.1.3 – One Seagate 2GB disk 22

23
TIGER file Road, Hydrography and Rail data of the United States etc… 2 join queries – Road with Hydrography – Between the Road and the Rail data 23 Data Type# of ObjectsTotal SizeR-tree Size Road656,61362.4 MB24.0 MB Hydrography122,14925.2 MB6.5 MB Rail16,8442.4 MB1.0 MB

24
Sequoia 2000 Storage Benchmark Polygon – Regions of homogeneous landuse characteristics in California Islands – Holes in the polygon data 24 Data Type# of ObjectsTotal SizeR-tree Size Polygons58,11521.9 MB3.0 MB Islands21,0076.2 MB1.1 MB

25
4.4 NONE OF INDICES PRE-EXIST 25

26
TIGER: Join Road with Hydrograhy 26 PBSM is 48-98% faster than the R- Tree Based; 93- 300% faster than the Idx. Nested Loops.

27
TIGER: Join Road with Rail 27 Rail data: 2.4MB (Index: 1.0MB), fits in buffer pool; Idx. Nested Loops performs better than R- Tree Based.

28
Cluster Data continuously, i.e. not randomly distributed Data are mostly in sequential order in real life Less computationally expensive 28

29
Clustered TIGER: Join Road with Hydrography 29 PBSM is 40% faster than the R- Tree Based; and 60-80% faster than the Idx. Nested Loops.

30
Costs Index Building Cost – Cost of extracting the key-pointers from the input – Sorting the key-pointers – Building the index using the sorted key pointers – If Input is clustered No sorting key-pointers Cost of building index Tree Joining Cost Refinement Step Cost 30

31
Sequoia Data 31 PBSM is 13-27% faster than the R- Tree Based; and 17-114% faster than the Idx. Nested Loops.

32
Summary PBSM is better than R-tree and the Indexed Nested Loops based algorithm When sizes of 2 inputs differ significantly, Indexed Nested Loops is better than the R- tree based algorithm All algorithms improve if join inputs are clustered 32

33
4.5 IN THE PRESENCE OF PRE-EXISTING INDEX 33

34
When indices pre-exist on both the inputs, the R- tree based algorithm has the best performance TIGER: Join Road with Hydrograhy 34

35
When an index exists only on the smaller input, PBSM performs best. TIGER: Join Road with Rail 35

36
4.6 CPU Cost Insert a table CPU cost > I/O cost System – CPU intensive – Much less I/O is needed 36

37
5. Principal Behind Divide and Conquer Optimization on memory size

38
6. Playback of this presentation Efficient PBSM algorithm Comparison among different algorithm Performance Analysis Clustered Data Indexed Data 38

39
Question ? 39

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google