Disk Failures Xiaqing He ID: 204 Dr. Lin. Content 1) RAID stands for: “redundancy array of independent disks” 2) Several schemes to recover from disk.

Disk Failures Xiaqing He ID: 204 Dr. Lin

Content 1) RAID stands for: “redundancy array of independent disks” 2) Several schemes to recover from disk crashes: Mirroring – RAID level1 Parity Checks – RAID 4 Improvement – RAID 5 RAID 6

Mirroring The simplest scheme to recovery from Disk Crashes Mirror work -- making two or more copies of the data on different disks Benefit: -- save data in case one disk might fail; -- divide data on several disks and let access to several blocks at once

Mirroring (con’t)  When the data can be lost? -- in case there is a second (mirror/redundant) disk crash while the first (data) disk crash is being repaired.  Possibility: Suppose: One disk: mean time to failure = 10 years; One of the two disk: average of mean time to failure = 5 years; The process of replacing the failed disk= 3 hours=1/2920 year; So: the possibility of the mirror disk will fail=1/10 * 1/2,920 =1/29,200; The possibility of data loss by mirroring: 1/5 * 1/29,200 = 1/146,000

Parity Blocks  why changes? -- disadvantages of Mirroring: uses so many redundant disks  What’s new? -- RAID level 4: uses only one redundant disk  How this one redundant disk works? -- modulo-2 sum; -- the jth bit of the redundant disk is the modulo-2 sum of the jth bits of all the data disks.  Example

Parity Blocks(con’t)_Example Data disks: Disk1: 11110000 Disk2: 10101010 Disk3: 00111000 Redundant disk: Disk4: 01100010

RAID 4 (con’t)  Reading -- Similar to reading blocks from any disk;  Writing 1)change the data disk; 2)change the corresponding block of the redundant disk; Why? -- hold the parity checks for the corresponding blocks of all the data disks

RAID 4 (con’t) _ writing For a total N data disks: 1) naïve way: read N data disks and compute the modulo-2 sum of the corresponding blocks; rewrite the redundant disk according to modulo-2 sum of the data disks; 2) better way: Take modulo-2 sum of the old and new version of the data block which was rewritten; Change the position of the redundant disk which was 1’s in the modulo-2 sum;

RAID 4 (con’t) _ writing_Example  Data disks: Disk1: 11110000 Disk2: 10101010  01100110 Disk3: 00111000  to do: Modulo-2 sum of the old and new version of disk 2: 11001100 So, we need to change the positions 1,2,5,6 of the redundant disk. Redundant disk: Disk4: 01100010  10101110

RAID 4 (con’t) _failure recovery  Redundant disk crash: -- swap a new one and recomputed data from all the data disks;  One of Data disks crash: -- swap a new one; -- recomputed data from the other disks including data disks and redundant disk;  How to recomputed? (same rule, that’s why there will be some improvement) -- take modulo-2 sum of all the corresponding bits of all the other disks

An Improvement: RAID 5  Why need a improvement? -- Shortcoming of RAID level 4: suffers from a bottleneck defect (when updating data disk need to read and write the redundant disk);  Principle of RAID level 5 (RAID 5): -- treat each disk as the redundant disk for some of the blocks;  Why it is feasible? The rule of failure recovery for redundant disk and data disk is the same: “take modulo-2 sum of all the corresponding bits of all the other disks” So, there is no need to retreat one as redundant disk and others as data disks

3) RAID 5 (con’t)  How to recognize which blocks of each disk treat this disk as redundant disk? -- if there are n+1 disks which were labeled from 0 to N, then we can treat the i th cylinder of disk J as redundant if J is the remainder when I is divided by n+1;  Example;

3) RAID 5 (con’t)_example N=3; The first disk, labeled as 0 : 4,8,12…; The second disk, labeled as 1 : 1,5,9…; The third disk, labeled as 2 : 2,6,10…; ………. Suppose all the 4 disks are equally likely to be written, for one of the 4 disks, the possibility of being written: 1/4 + 3 /4 * 1/3 =1/2 If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m

4) Coping with multiple disk crashes  RAID 6 – deal with any number of disk crashes if using enough redundant disks  Example a system of seven disks ( four data disks_numer 1-4 and 3 redundant disks_ number 5-7); How to set up this 3*7 matrix ? (why is 3? – there are 3 redundant disks) 1)every column values three 1’s and 0’s except for all three 0’s; 2) column of the redundant disk has single 1’s; 3) column of the data disk has at least two 1’s;

4) Coping with multiple disk crashes (con’t)  Reading: read form the data disks and ignore the redundant disk  Writing: Change the data disk change the corresponding bits of all the redundant disks

4) Coping with multiple disk crashes (con’t)  In those system which has 4 data disks and 3 redundant disk, how they can correct up to 2 disk crashes? Suppose disk a and b failed: find some row r (in 3*7 matrix)in which the column for a and b are different (suppose a is 0’s and b is 1’s); Compute the correct b by taking modulo-2 sum of the corresponding bits from all the other disks other than b which have 1’s in row r; After getting the correct b, Compute the correct a with all other disks available;  Example

4) Coping with multiple disk crashes (con’t)_example 3*7 matrix data disk redundant disk disk number 1 2 3 4 5 6 7 1110100 1101010 1011001

4) Coping with multiple disk crashes (con’t)_example First block of all the disks disk contents 1) 11110000 2) 10101010 3) 00111000 4) 01000001 5) 01100010 6) 00011011 7) 10001001

4) Coping with multiple disk crashes (con’t)_example Two disks crashes; disk contents 1) 11110000 2) ????????? 3) 00111000 4) 01000001 5) ????????? 6) 00011011 7) 10001001

4) Coping with multiple disk crashes (con’t)_example In that 3*7 matrix, find in row 2, disk 2 and 5 have different value and disk 2’s value is 1 and 5’s value is 0. so: compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,4,6; then compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,2,3; 1) 11110000 2) ????????? => 00001111 3) 00111000 4) 01000001 5) ????????? => 01100010 6) 00011011 7) 10001001

Disk Failures Xiaqing He ID: 204 Dr. Lin. Content 1) RAID stands for: “redundancy array of independent disks” 2) Several schemes to recover from disk.

Similar presentations

Presentation on theme: "Disk Failures Xiaqing He ID: 204 Dr. Lin. Content 1) RAID stands for: “redundancy array of independent disks” 2) Several schemes to recover from disk."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Disk Failures Xiaqing He ID: 204 Dr. Lin. Content 1) RAID stands for: “redundancy array of independent disks” 2) Several schemes to recover from disk.

Similar presentations

Presentation on theme: "Disk Failures Xiaqing He ID: 204 Dr. Lin. Content 1) RAID stands for: “redundancy array of independent disks” 2) Several schemes to recover from disk."— Presentation transcript:

Similar presentations

About project

Feedback