Presentation on theme: "1 Koh-i-Noor Mark Manasse, Chandu Thekkath Microsoft Research Silicon Valley 1/29/2014."— Presentation transcript:
1 Koh-i-Noor Mark Manasse, Chandu Thekkath Microsoft Research Silicon Valley 1/29/2014
Motivation Large-scale storage systems can be expensive to build and maintain: Cost of managing the storage, Cost of provisioning the storage. Total cost of ownership is dominated by management cost, but provisioning 10,000 spindles worth of useful data is expensive, especially with redundant storage.
3 Rationale / Mathematics I Hotmail has storage needs approaching a petabyte. Currently built from difficult-to- manage components. 1,000 10-disk RAID arrays Requires significant backup effort, incessant replacement of failed drives, etc. Reliability could be improved by mirroring. But 10,000 mirrored drives is a lot of work to manage. For either system, difficult to expand gracefully, since the storage is in small, isolated clumps. The Galois Field of order p k (for p prime) is formed by considering polynomials in Z/Z p [x] modulo a primitive polynomial of degree k. Facts Any primitive polynomial will do; all the resulting fields are isomorphic. We write GF(p k ) to denote one such field. x is a generator of the field. Everything you know about algebra is still true.
4 Reducing the Total Cost of Ownership Reduce management cost: Use automatic management, automatic load balancing, incremental system expansion, fast backup. (Focus of earlier research and the Sepal project.) Reduce provisioning cost: Use redundancy schemes that tolerate multiple failed components without the cost of duplicating them as is done in mirroring or triplexing.
5 Rationale / Mathematics II There are existing product that improve storage management. It would be difficult to use them to build petabyte storage systems that can support something like Hotmail. Need thousands of these components and even with squadrons of perl and java programmers, it would be difficult to configure and manage such a system. These products lack formal management interfaces that can be used by external programs or each other to reliably manage the system. A Vandermonde matrix is of the form The determinant of a Vandermonde matrix is (a-b)(a-c)···(a-z)(b-c)···(b-z)···(y- z).
6 Large-Scale Mirroring Assume disk mean-time-to-failure of 50 years. Mirroring needs 20,000 disks for 10,000 data disks. Expect 8 disks to fail every week. Assume 1 day to repair a disk by copying from mirror. Mean time before data is unavailable (because of a double failure) is 45 years. Cost of storage is double what is actually needed.
7 Rationale / Mathematics III Goal: build a distributed block- level storage system that is highly-available, scalable, and easy to manage. Design each component of the system to automate or eliminate many management decisions that are today made by human beings. Automatically tolerate and recover from exceptional conditions such as component failures that traditionally require human intervention. Facts about determinants det(AB) = det(A) det(B) det(A | ka) = k det(A | a) Facts about GF(256) a + b = a XOR b. Every element other than 0 is x k, for some 0k255. If a=x k and b=x j, then ab=x k+j. With a 512-byte table of logarithms and a 1024-byte table of anti-logarithms, multiplication becomes trivial. A byte of data, viewed as an element of GF(256), multiplied and added with other bytes still occupies a byte of storage.
8 Improving Time to Data Unavailability Consider 10,000 disks, each with 50 years MTTF. ~ 50 years with mirroring at 2x provisioning cost. May be too short a period at too high a price. ~ ½M years with triplexing at 3x provisioning cost. Very long at very high price. An alternative: use Reed-Solomon Erasure Codes. 40 clusters of 256 disks each. Can tolerate triple (or more) failures. ~50K years before data unavailability. ~1.03x privisioning cost.
9 Rationale IV Suppose the instantaneous probability of a disk being in a failed state is p (computed as MTTR/MTTF). The probability of k disks failed is p k. The probability of finding k disks failed out of j is (j C k)p k =q. Cluster MTTF=MTTR/q; with N total disks, System MTTF =~ k! MTTF k /N (j MTTR) k-1 (if j >> k). RAID sets N=10,000, k=2, j=10, MTTF=50 years, MTTR=1 day, System MTTF=50 2 365 1 /50000 =~ 9 years. Duplexing N=20,000, k=2, j=2, System MTTF=50 2 365 1 /20000 years =~45 years. Triplexing N=30,000, k=3, j=3, System MTTF=50 3 365 2 /30000 years =~ 555000 years. Erasure codes N=10,000, k=4, j=256, System MTTF=~ 50 4 365 3 24/256 3 10000 years =~ 43500 years. Is that enough?
10 Inductive step proving the determinant of a Vandermonde matrix is the product of differences. Determinant here is 1. Expand on last column, after removing common factors, whats left if V k.
11 Reed-Solomon Erasure Codes 2. Suppose data disks 2,3 and check disk 3 fail. 4. Multiplying both sides by R -1, we recover all the data. 3. Omitting failed rows, we get an invertible n×n matrix R. 1. We use an n×(n+k) coding matrix to store data on n data disks and k check disks. (k=3 in our example)
12 Rationale / Mathematics Va The matrices were transposed so that the data would be in column vectors, which fits better on the slide. The correction matrix in the example uses my special erasure code for k = 3. The correction rows are taken from a Vandermonde matrix. As you might recognize, had you been reading the other half of the slides. For general k, we take a Vandermonde matrix using elements 0, 1, …, 255. The product of element differences is non-zero, because all the individual element differences are non- zero. Make it tall enough for all the data disks. Diagonalize the square part, and use the remaining columns for check disks. Row operations change the determinant in simple ways.
13 Rationale / Mathematics Vb Computing check digits is easy, and well-parallelized. To update a block on data disk j, compute the XOR of the old block with the new block. Multicast the log of the XOR together with j to the check disks (and have them start reading the block off disk). Each check disk multiplies the XORed data block by the jth entry in their column, XORs that with the old check value, and writes the new check block. Rotate which disks are check disks, making the load uniform. No difference in latency compared to RAID-5 (but double the disk bandwidth). For k=3, we can just take the identity matrix plus three columns of Vandermonde, using 1, x, x 2. Computation of check digits is faster: the logarithm for the kth check disk and jth data disk is kj. We omit non-failed data-disks in computing determinants. What remains is a small minor of the whole matrix. 1x1 works because all entries are non-zero. 3x3 works because its a trasposed Vandermonde matrix. 2x2 works because its either a Vandermonde matrix, or a Vandermonde matrix with columns multiplied by powers of x.
14 Parallel Reconstruction of a Failed Disk D2 XOR XOR D1 × (1+x+x2+x 3 +x 7 ) D4 × x XOR C1 × (1+x2 +x 4 +x 5 +x 6 +x 7 ) C2 × (x+x3 +x 4 +x 5 +x 6 ) Given the inverse matrix (as above), a failed disk is the sum (exclusive or) of products of the individual bytes on data and check disks with the elements of the matrix. The example below shows reconstruction of disk 2, in a system with 4 data disks and 2 check disks; disks 2 and 3 failed.
15 Rationale / Mathematics VI The reconstruction on the previous slide requires: Reading all of the surviving disks. Parallel multiplication by precomputed entries from the inverse matrix. XOR of pairs of disks entries working up a binary tree. A network of small, cheap switches provides the necessary connectivity. Put hot spare CPUs and drives into network; skip up tree at nodes with no useful partner. Given nature of matrix, the inverse matrix can be constructed via Gaussian elimination very quickly for small values of k. With more complicated erasure coding (2D parity, for example), its harder to see how to wire the network to accommodate hot-spares without requiring more from the interconnect.
16 Related Work Reed-Solomon error correcting codes have been used for single disks and CDs. Reed-Solomon and Even-Odd erasure codes have been proposed: For software RAID [ABC], For wide area storage systems [LANL]. Our usage of the code is different. Parallel reconstruction is novel. Patents in progress on both.
17 Potential deficiencies This could be a bad idea, depending on what youre trying to do: The bandwidth in at the top of the tree isnt enough to keep all the disk arms busy with small reads and writes. This could be a bad disk for building databases. The total disk bandwidth at least doubles during degraded operation, which might affect many more disks than with mirroring or standard RAID-5. Failures may not be independent, invalidating predicted reliability. CPUs (which, if you attach multiple disks, should serve disks in different pods) may fail too often, in ways that rebooting doesnt fix. Power supplies, cables, switches, … may fail, which we havent accounted for. Writes are going to be relatively expensive.
18 Comparisons and future work Other people have already built small erasure-coded arrays. Is there enough new here? We think so. Real systems needs: Backup, Geographical redundancy.