# ICDT 20051 Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005.

## Presentation on theme: "ICDT 20051 Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005."— Presentation transcript:

ICDT 20051 Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005

ICDT 20052 Declustering Data Declustering data over multiple disks to improve performance for range queries has been well studied Applications include: –Spatio-temporal databases –Image and video data –Scientific simulation datasets

ICDT 20053 Goal Divide data uniformly along dimensions to create tiles Put records contained in each tile on different disks so that I/O can be parallelized Assumptions –Data can be tiled in such a way –Disks have constant retrieval times Assigning tiles to disks is similar to a coloring problem (disks are colors) A range query can be answered optimally if the # of I/O retrievals for any specific disk is: # of tiles/# of disks Two approaches: –Coloring schemes –Replication

ICDT 20054 Notations k is number of disks m is number of tiles in queries r is level of replication (i.e., is 2) Q is the set of all range queries ret(q) is the actual retrieval time of q Optimal retrieval time for a query q is o q = m/k Additive error ε, max q Q {ret(q)-o q }

ICDT 20055 Coloring schemes Disk Modulo (DM) [Du and Sobolewski, 1982] Fieldwise XOR (FX) [Kim and Pramanik, 1988] Cyclic Schemes (RPHM, GFIB, EXH) – [Prabhakar et al, 1998] Golden Ratio Sequences (GRS) – [Bhatia et al, 2000]

ICDT 20056 Other schemes [Atallah and Prabhakar, 2000] developed a scheme in two dimensional grids for k=2 n disks the has additive error of O(log k) [Sinha et al, 2001] proved lower bounds on the additive error of (log k) and (log (d-1)/2 k) for 2 dimensions and d (>2) dimensions respectively [Chen and Cheng, 2002] showed that an additive error of O(log (d-1) k) is achievable for any # of dimensions (>2)

ICDT 20057 Replication Placing records on multiple disks can further improve performance of declustering schemes Two Problems: –How to schedule a query (i.e., what tiles are retrieved from each disk) –How to use replication to balance load Approaches: –Chained Declustering [Hsiao and DeWitt, 1990] –Random Duplication Allocation [Sanders et al 2000], [Sanders, 2001], and [Czumaj and Scheidler, 2003]

ICDT 20058 Replication Results Chained Declustering –Fast Scheduling Algorithm O(m+k) time to test if a specific retrieval time is possible [Aerts et al, 2000] RDA –If mck(log k) then optimal with high prob [Czumaj and Scheideler, 2003] –Fast scheduling algorithm O(Δk O(1) ) time [Czumaj and Scheideler, 2003] Hybrid techniques [Chen and Cheng, 2002] –Use GRS with second random disk

ICDT 20059 Our Results We define a new class of schemes called the shift schemes Deterministic Any query with at least k(k-1)ε tiles can be answered in an optimal fashion Queries can be scheduled in O(m+k(log ε)) time If a single disk fails, then any query with at least k(k-1)ε tiles can be answered optimally Experimental performance similar to RDA (better for many cases)

ICDT 200510 Shift Scheme Definition Use any strong coloring scheme Use a modified chain declustering –Defined by shift value s (where gcd(s,k)=1) Base scheme is defined by function f(x,y) –Second color is (f(x,y)+s mod k)

ICDT 200511 Shift Scheme Definition Use any strong coloring scheme Use a modified chain declustering –Defined by shift value s (where gcd(s,k)=1) Base scheme is defined by function f(x,y) –Second color is (f(x,y)+s mod k) 0,31,42,03,14,2 2,03,14,20,31,4 4,20,31,42,03,1 1,42,03,14,20,3 3,14,20,31,42,0

ICDT 200512 Scheduling Can use modification of chain declustering scheduling algorithm to schedule queries in O(m+k(log ε)) time Essentially, use previous algorithm to test if a specific load is possible and do a binary search on the possible loads

ICDT 200513 Bound(1) There are k disks (D 0,…,D k-1 ) Disk D i has t i tiles initially (as the primary disk) The number of tiles is m=t 0 +…+t k-1 D i shifts d i tiles to D i+1 d i t i The goal is to minimize the most tiles at a disk, i.e., max 0ik-1 {d i-1 +t i -d i }

ICDT 200514 Bound(2) Recall, –o= m/k –max 0ik-1 {t i } o+ε Suppose mk(k-1)ε Then, –o (k-1)ε –Surplus ( ) is bounded by (k-1)ε –max 0ik-1 {d i } (k-1)ε o Two cases: –If disk has a surplus –If disk has a shortage

ICDT 200515 32 disks

ICDT 200516 64 disks

ICDT 200517 128 disks

ICDT 200518 32 disks, 3 dimensions

ICDT 200519 Generalizations Permutations Higher levels of replication Survivability –If the level of replication is r, can handle any r- 1 failures –When r=2, and a single disk fails then: Fast scheduling still possible Large queries still optimal

ICDT 200520 Summary Shift schemes are a new class of schemes –Optimal for large enough queries –Efficient scheduling algorithm –Resilient to disk failures Future Work –Better analysis of scheme –Choosing shift values

Download ppt "ICDT 20051 Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005."

Similar presentations