Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

Similar presentations


Presentation on theme: "1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University"— Presentation transcript:

1 1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University pattara.l@ku.ac.th

2 2 Hardware Fault Tolerance Triple Modular Redundancy (TMR)* Can mask the failure of one hardware unit No explicit actions need to be performed for the occurrence of faults (error detection, recovery, etc.) Module Voting Element InputOutput Majority votes Majority *proposed by Von Neumann (replicas)

3 3 Triple Modular Redundancy (1) Triple Modular Redundancy (TMR) Suitable for transient faults  Voting element does not remove the faulty unit after an error occurs Reliability of the TMR becomes lower than a simplex system once a failure occurs.  Ex. (0,1,1) = 1, (1,0,0)=0, (1,0)=??? Module Voting Element InputOutput Majority votes Majority

4 4 Triple Modular Redundancy (2) Bit-wise voting  Take the majority for each bit Voting element has to be simple and highly reliable unit Tight synchronization is required  Single clock Generalization of TMR is N-modular redundancy (NMR) … Module Voting Element Voting Element Voting Element & & & + Voting Element (Voter)

5 5 Static Redundancy The effect of a faulty element (component, circuit, system) is immediately masked by permanently connected and continually operating replicas of the element. TMR and NMR are static redundancy scheme … Module Voting Element Voting Element Voting Element replicas fault KEY Points Permanently connected Continually operating

6 6 Dynamic Redundancy When a fault is detected, that fault or its effect is subsequently corrected =>Reconfiguration  Consists of several units, but with only one operating at a time.  Other units are just “Spare” Module Spare … Module … … Reconfiguration operating unit

7 7 Dynamic Redundancy Cold-Standby system  Only one unit is powered up and operational  Spares are not powered on -> they are still cold !  Faulty unit is replaced by turning off its power and powering up a spare Hot-Standby  All units are operating simultaneously  Their outputs are then matched If they are the same, one is selected arbitrarily If not, faulty unit is detected and the system will be reconfigured  Dual System Matching circuit continuously compares the results of two unit Module compare How to detect the fail unit ??

8 8 Coding Code  One of most important techniques for supporting fault tolerance hardware  Codeword, Non-codeword Ex. 0001 = a 0010 = b 0011 = c Single Parity Check Code  Even Parity  Odd Parity

9 9 Single Parity Check  Add check bit “Parity bit” to the information bit  Total number of 1s in the codeword is always even or always odd Odd parity check  Parity bit is 1 iff the number of 1s in the data bit is even Even parity check  Parity bit is 1 iff the number of 1s in the data bit is odd 010101 parity bit data bits 110101 parity bit data bits codeword The # of 1s in the codeword is Odd The # of 1s in the codeword is Even

10 10 Parity Check How it works? Ex. of odd parity check 010101 parity bit data bits Sending side 010001 Receiving ex.1 The parity bit is 1 iff the number of 1s in the data bit is even Check whether the # of 1s in the codeword is odd or not. 011001 Receiving ex.2  Occurrence of one bit error can be detected.  Cannot correct an error (no way to specify the place) x ok?

11 11 Step2 Parity Check (Advanced) Sending side odd parity or even parity? Receiving side Step1 Receiving side

12 12 Coding (Hamming Distance) Minimum Distance  The minimum distance of Hamming Distance between any pair of 2 different Codewords  Ex. Single Parity Check Code Minimum Distance = 2 1bit error can be detected 11110000 0001 0010 … 1101 0111 …01 0110 … d  (d -1)/2  Correction d - 1 Detection T d = number of bit errors that can be detected T c = number of bit errors that can be corrected d = minimum distance Why?

13 13 Self-Checking Can detect faults by itself Ex . Self-Checking Parity Checker x0x0 x2x2 x4x4 x6x6 x1x1 x3x3 x5x5 x7x7 x8x8 z2z2 z1z1 Functional Circuit Checker Inputs Error Indication If using odd parity Codewords (0, 1), (1, 0): Error Free Noncodewords (0, 0), (1, 1): Error (in A or B) x z A B

14 14 Self-Checking Circuit input F = set of faults Codeword or Non-codeword Non-codeword means fault Fault-Secure  Even f  F occurs, incorrect codeword will not be produced Self-Testing  When f  F occurs, there will be an input that leads to the output of non-codeword (which means the detections of fault) Totally Self-Checking  Fault-Secure + Self-Testing

15 15 2-Rail Logic Don’t use Not gate x x0x0 x1x1 x y x1x1 x0x0 y1y1 y0y0 x1x1 x0x0 y1y1 y0y0 x y z1z1 z0z0 z1z1 z0z0 z1z1 z0z0 0 0101 1 1010 x1x1 x0x0 y1y1 y0y0 x y

16 16 2-Rail Logic and Unidirectional Error The effect of fault on the output Unidirectional Error (definition) All erroneous signal are only one of:  Error that 1  0 occurs  Error that 0  1 occurs 2-Rail Logic  Incorrect codeword will never be produced ex. (0,1)  (1,0) never occurs  however, the non-codeword may be produced  Fault-Secure

17 17 Disk Shadowing Maintaining a set of identical disk images on separate several disk devices. Disk Mirroring  2 Disks with 2 disk controllers  Write to both disks, read from either of disks  Tandem System the first commercial fault-tolerant system Host Disk Controller Disk Controller

18 18 RAID (Redundant Array of Inexpensive Disks) Striping Divide the storage area into several parts called stripes, then distribute those stripes to several disks  Load balancing between disks to maximize throughput  Fault Tolerance can be implemented at low cost D4D4 D3D3 D0D0 D1D1 Controller D5D5 D2D2

19 19 RAID-0 Striping Advantage  Good performance due to high data throughput Disadvantage  Non-Fault Tolerance Usable Storage Capacity Percentage = 100% D4D4 D3D3 D0D0 D1D1 Controller D5D5 D2D2 Only striping No redundancy

20 20 RAID-1 Mirroring D1D1 D1D1 D0D0 D0D0 Controller Writing all data to N disks Advantage  High performance of fault tolerance (tolerate/mask failure of N-1 disk)  Faster on reads (compare to a single drive) Disadvantage  Slower on writes (compare to a single drive)  Low utilization efficiency Usable Storage Capacity Percentage = 100/N %

21 21 RAID-4 D4D4 D3D3 D0D0 D1D1 Controller D5D5 D2D2 P 3~5 P 0~2 N Advantage  Very good for read (the same as RAID-0)  High utilization efficiency  Tolerate/mask failure of 1 disk Disadvantage  Slow on writes (typically, small random write) *due to the concentration of access to the parity disk Usable Storage Capacity Percentage = 100*(N-1)/N % Add one redundant parity disk

22 22 RAID-5 D4D4 D3D3 D0D0 D1D1 Controller P 3~5 D2D2 D5D5 P 0~2 Similar to RAID-4, but distributes parity among the drives Advantage  Very good for read/write (even small random write) *Parity disk does not become a bottleneck anymore  High utilization efficiency  Tolerate/mask failure of 1 disk Disadvantage  Slower than RAID-4 on read *parity data must be skipped on each drive during reads Usable Storage Capacity Percentage = 100*(N-1)/N %


Download ppt "1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University"

Similar presentations


Ads by Google