Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault-Tolerant Computing Systems #3 Fault-Tolerant Software

Similar presentations


Presentation on theme: "Fault-Tolerant Computing Systems #3 Fault-Tolerant Software"— Presentation transcript:

1 Fault-Tolerant Computing Systems #3 Fault-Tolerant Software
Pattara Leelaprute Computer Engineering Department Kasetsart University

2 Fault-Tolerant Software
Software Design Fault Design fault only (no operational fault, why?) Bug Approaches Fault avoidance Only removes deterministic design faults Review, Testing, V&V (Verification & Validation) It’s difficult / impossible to guarantee that there is no design fault in the software Fault tolerance System level Software design level

3 Fault-Tolerant Software
Single-Version Single software Multi-Version Design Diversity concept Two or more different but functionally identical versions (variants) of a piece of software. Executed in sequence or parallel The versions are used as alternatives, in pairs, or in groups Examples Swedish State Railway Airbus A310, Airbus A320/A330/A340 while(1){ // count=25 writeline(x[0]=count/10); // 25/10=2, show2 <-- count-up, count=26 writeline(x[1]=count%mod); // 26%6=6, show "6" } temp=count; // count=25 writeline(x[0]=temp/10); // 25/10=2, show2 writeline(x[1]=temp%mod); // 26%6=6, show "6"

4 Single-Version Software Fault Tolerance
State-dependent fault (most often) Unanticipated Similar to transient hardware faults (appear and go) Activated by particular input sequence Use only single software Data diversity concept Techniques to be used N-Copy Programming Retry box Checkpoint and Restart (Rollback Recovery) Unanticipated 予期できない

5 Single-Version Software FT (Data Diversity)
 The same software is repeatedly executed with different but logically same input data Faults in software are usually input sequence dependent Ex.calculate sin(x) sin(x) = sin(a)sin(p/2–(x-a)) + sin(p/2–a)sin(x–a) = sin(a)sin(p/2–x+a) + sin(p/2–a)sin(x–a) sin(a+b) = sin(a)cos(b) + cos(a)sin(b) cos(a) = sin(p/2 – a) b = x-a b = x-a

6 N-Copy Programming sin(x) = sin(a)sin(p/2–x+a) + sin(p/2–a)sin(x–a)
Input sin(x) Re-express Data Copy 2 voter Output Copy N sin(aN)sin(p/2–x+aN) + sin(p/2–aN)sin(x–aN) Ex. x=a+b b=x-a x=5とする a1=2,b1=3 a2=3,b2=2 a3=4,b3=1 等 sin(x) = sin(a)sin(p/2–x+a) + sin(p/2–a)sin(x–a)

7 Retry Block Re-express Data Software No Deadline Expired? Fault OK NG
sin(ai)sin(p/2–x+ai) + sin(p/2–ai)sin(x–ai) Re-express Data Software No Acceptance Test Deadline Expired? Fault OK NG Yes

8 Checkpoint and Restart
Checkpoint and Restart (Rollback Recovery) Use a checkpoint to mark a state of system If an error occurs, restart the operation from the normal state before an error occurred. Good for transient fault (occur only under specific condition) Checkpoint Error Rollback Restart = 1.Static, 2.Dynamic 1.Static Like restart button. Go back to the initial reset state. Selection based on the operational situation. 2. Dynamic (全部のWorkを捨てなくてもいい) Dynamically create the checkpoint (snapshot) - Fixed interval / particular point based on some optimization rule existence of unrecoverable actions (external event)!

9 Multi-Version Software Fault Tolerance
Design Diversity Using two or more different but functionally identical (same spec) versions (variants) of a piece of software. Components built differently should fail differently Techniques to be used Recovery Blocks N-Version Programming Acceptance Voting & N Self-Checking System Certification Trails Validity = 正当性 Assertion = 断言・主張(確認)

10 Recovery Blocks Recovery Blocks
Combine checkpoint and restart with multi-version software Use acceptance test (AT) to detect error. If error has been detected, then use the other variant Acceptance Test Test the validity of an output Assertion

11 Recovery Block Recovery Blocks ・・・ Acceptance Test OK AT Version 1
Output Input Checkpoint NG OK AT Version 2 Checkpoint NG ・・・ Checkpoint needed to recovery the state after a version fail, to provide a valid operational starting point for the next version (in the case that error detected) Version2,3,4 can e degraded performance OK AT Version N Variants NG ERROR

12 N-Version Programming
Operate multiple variants at the same time, and take majority by the voter Version 1 Input Version 2 voter Output Version 3

13 Acceptance Voting & N Self-Checking System
Use separate AT for each version Version 1 Version 2 Version 3 voter AT Compare in pair If not agree, the response of the pair are discarded If agree, then compare again N Self-Checking System Version 1 Compare Version 2 Compare Version 3 Compare Version 4

14 Certification Trails Certification Trails
Improvement of 2-Version Programming Primary module leaves a certification trail (a trail of data at intermediate points in the computation) Secondary module uses certification trail, so it can execute more quickly / have simpler structure. Compare an output of primary and secondary If agree the results are accept, otherwise ERROR Version 1 Certification trail 計算途中の情報 Kormuul tii uu rawang garn pharamern phon Input Certification Trail Output Compare Version 2


Download ppt "Fault-Tolerant Computing Systems #3 Fault-Tolerant Software"

Similar presentations


Ads by Google