Download presentation

Presentation is loading. Please wait.

Published byDamion Pratt Modified over 2 years ago

1
ICDT 20051 Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005

2
ICDT 20052 Declustering Data Declustering data over multiple disks to improve performance for range queries has been well studied Applications include: –Spatio-temporal databases –Image and video data –Scientific simulation datasets

3
ICDT 20053 Goal Divide data uniformly along dimensions to create tiles Put records contained in each tile on different disks so that I/O can be parallelized Assumptions –Data can be tiled in such a way –Disks have constant retrieval times Assigning tiles to disks is similar to a coloring problem (disks are colors) A range query can be answered optimally if the # of I/O retrievals for any specific disk is: # of tiles/# of disks Two approaches: –Coloring schemes –Replication

4
ICDT 20054 Notations k is number of disks m is number of tiles in queries r is level of replication (i.e., is 2) Q is the set of all range queries ret(q) is the actual retrieval time of q Optimal retrieval time for a query q is o q = m/k Additive error ε, max q Q {ret(q)-o q }

5
ICDT 20055 Coloring schemes Disk Modulo (DM) [Du and Sobolewski, 1982] Fieldwise XOR (FX) [Kim and Pramanik, 1988] Cyclic Schemes (RPHM, GFIB, EXH) – [Prabhakar et al, 1998] Golden Ratio Sequences (GRS) – [Bhatia et al, 2000]

6
ICDT 20056 Other schemes [Atallah and Prabhakar, 2000] developed a scheme in two dimensional grids for k=2 n disks the has additive error of O(log k) [Sinha et al, 2001] proved lower bounds on the additive error of (log k) and (log (d-1)/2 k) for 2 dimensions and d (>2) dimensions respectively [Chen and Cheng, 2002] showed that an additive error of O(log (d-1) k) is achievable for any # of dimensions (>2)

7
ICDT 20057 Replication Placing records on multiple disks can further improve performance of declustering schemes Two Problems: –How to schedule a query (i.e., what tiles are retrieved from each disk) –How to use replication to balance load Approaches: –Chained Declustering [Hsiao and DeWitt, 1990] –Random Duplication Allocation [Sanders et al 2000], [Sanders, 2001], and [Czumaj and Scheidler, 2003]

8
ICDT 20058 Replication Results Chained Declustering –Fast Scheduling Algorithm O(m+k) time to test if a specific retrieval time is possible [Aerts et al, 2000] RDA –If mck(log k) then optimal with high prob [Czumaj and Scheideler, 2003] –Fast scheduling algorithm O(Δk O(1) ) time [Czumaj and Scheideler, 2003] Hybrid techniques [Chen and Cheng, 2002] –Use GRS with second random disk

9
ICDT 20059 Our Results We define a new class of schemes called the shift schemes Deterministic Any query with at least k(k-1)ε tiles can be answered in an optimal fashion Queries can be scheduled in O(m+k(log ε)) time If a single disk fails, then any query with at least k(k-1)ε tiles can be answered optimally Experimental performance similar to RDA (better for many cases)

10
ICDT 200510 Shift Scheme Definition Use any strong coloring scheme Use a modified chain declustering –Defined by shift value s (where gcd(s,k)=1) Base scheme is defined by function f(x,y) –Second color is (f(x,y)+s mod k)

11
ICDT 200511 Shift Scheme Definition Use any strong coloring scheme Use a modified chain declustering –Defined by shift value s (where gcd(s,k)=1) Base scheme is defined by function f(x,y) –Second color is (f(x,y)+s mod k) 0,31,42,03,14,2 2,03,14,20,31,4 4,20,31,42,03,1 1,42,03,14,20,3 3,14,20,31,42,0

12
ICDT 200512 Scheduling Can use modification of chain declustering scheduling algorithm to schedule queries in O(m+k(log ε)) time Essentially, use previous algorithm to test if a specific load is possible and do a binary search on the possible loads

13
ICDT 200513 Bound(1) There are k disks (D 0,…,D k-1 ) Disk D i has t i tiles initially (as the primary disk) The number of tiles is m=t 0 +…+t k-1 D i shifts d i tiles to D i+1 d i t i The goal is to minimize the most tiles at a disk, i.e., max 0ik-1 {d i-1 +t i -d i }

14
ICDT 200514 Bound(2) Recall, –o= m/k –max 0ik-1 {t i } o+ε Suppose mk(k-1)ε Then, –o (k-1)ε –Surplus ( ) is bounded by (k-1)ε –max 0ik-1 {d i } (k-1)ε o Two cases: –If disk has a surplus –If disk has a shortage

15
ICDT 200515 32 disks

16
ICDT 200516 64 disks

17
ICDT 200517 128 disks

18
ICDT 200518 32 disks, 3 dimensions

19
ICDT 200519 Generalizations Permutations Higher levels of replication Survivability –If the level of replication is r, can handle any r- 1 failures –When r=2, and a single disk fails then: Fast scheduling still possible Large queries still optimal

20
ICDT 200520 Summary Shift schemes are a new class of schemes –Optimal for large enough queries –Efficient scheduling algorithm –Resilient to disk failures Future Work –Better analysis of scheme –Choosing shift values

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google