Presentation is loading. Please wait.

Presentation is loading. Please wait.

Process Redundancy for Future-Generation CMP Fault Tolerance Dan Gibson ECE 753 Project Presentation.

Similar presentations


Presentation on theme: "Process Redundancy for Future-Generation CMP Fault Tolerance Dan Gibson ECE 753 Project Presentation."— Presentation transcript:

1 Process Redundancy for Future-Generation CMP Fault Tolerance Dan Gibson ECE 753 Project Presentation

2 Overview Motivating Example Execution Redundancy – A Brief Tutorial Chip Multiprocessors Redundant Processes on CMPs – Analytic Results Portcullis Prototype – Measured Results

3 Today: Reliable Hardware tcsh(1)%./add 2 2 2 + 2 = 4 tcsh(2)%./add 3 7 3 + 7 = 10 tcsh(3)%./add 2 2 2 + 2 = 4 tcsh(4)%

4 Tomorrow: Faulty Hardware tcsh(1)%./add 2 2 2 + 2 = 4 tcsh(2)%./add 3 7 Segmentation Fault tcsh(3)%./add 2 2 2 + 2 = 5 tcsh(4)%

5 What Happened? Transistors are Shrinking – SRAM Cell Capacitance ↓ – Smaller Gates → Less Drive Strength Chips are not shrinking – Chips are Larger Relative to Transistor Size – Long Wires → More Crosstalk – More Complexity

6 What Happened? Transistors are Shrinking – SRAM Cell Capacitance ↓ – Smaller Gates → Less Drive Strength Chips are not shrinking – Chips are Larger Relative to Transistor Size – Long Wires → More Crosstalk – More Complexity

7 What Happened? Transistors are Shrinking – SRAM Cell Capacitance ↓ – Smaller Gates → Less Drive Strength Chips are not shrinking – Chips are Larger Relative to Transistor Size – Long Wires → More Crosstalk – More Complexity

8 What Can Be Done? Build Reliable HW – Complexity is Skyrocketing (Mistakes Inevitable) – Hurts Performance: Smaller Devices are Faster, but Less Reliable Larger Devices are Reliable, but Slow Accept Unreliable HW, Make Reliable SW – OK for MTBF Reasonably Large (e.g. Databases) – Not for the Fainthearted Programmer

9 One Solution: Execution Redundancy Run the Same Code Many Times – `Decide’ on the correct result (e.g. vote) – Many Flavors of Redundancy 3MR, 5MR – Concurrent Modular Redundancy Pair & Spare (eg. early NonStop) Sift-out Etc… Key Questions: – Where to run redundant code? – When to run redundant code? – How to run redundant code?

10 Re-Execution Time Inputs Run Once Inputs Run Again… Inputs Run N Times Compare Results Where: Same Hardware When: One After Another How: Provide Common Inputs, Compare Outputs.

11 Re-Execution: Fault Detection Time Inputs Run Once Inputs Run Again… 2+2=4 Segmentation Fault 4 X

12 Re-Execution: Fault Recovery Time Inputs Run Once Inputs Run Again… Inputs Run N Times 4 X 4 2+2=4 Compare Results 2+2=4

13 Re-Execution: Pros and Cons Simple Tolerates Transient Faults Tolerates Some Intermittent Faults No Redundant Hardware Needed  Overhead: (N-1) x 100%, plus checking overhead  No Tolerance for Permanent Faults

14 Lock-Step Time Where: On Tightly-Coupled Redundant Hardware When: Cycle-by-Cycle How: Check Every Result Inputs Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Outputs

15 Lock-Step: Fault Detection Time Inputs Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result 2+2=4 2+2=5 2+2=4 4 4 5 5 != 4: Fault!

16 Lock-Step: Fault Recovery Time Inputs Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Votes: 4 – 2 Votes 5 – 1 Vote Ergo: 2+2=4 4 4 5 4 Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Run One Unit Check Result Outputs

17 Lock-Step: Pros and Cons Tolerates Transient Faults Tolerates Some Intermittent Faults Tolerates Isolated Permanent Faults May Be Invisible to Software  Overhead: Requires Frequent Breaks in Execution → Slow  Expensive (Area, $)  Wasteful (Lock-Step Not Needed)  ‘Soft’ Disagreements E.g. Branch Prediction Both IBM and HP use designs like this!

18 Trailing Checker Time Where: On Tightly-Coupled (sometimes) Redundant Hardware When: Ad Hoc How: Second Thread Checks the First Checker Thread Starts Inputs Commence Execution Outputs

19 Trailing Checker: Fault Detection Time Checker Thread Starts Inputs Commence Execution Outputs 2+2=5 2+2=4

20 Trailing Checker: Fault Recovery Time Checker Thread Starts Inputs Commence Execution Outputs Following Thread Corrects Leading Thread

21 Trailing Checker: Pros and Cons Tolerates Transient Faults Tolerates Some Intermittent Faults High Performance Checker Can Be Simpler than Main Thread  Overhead: Some HW Redundancy Mutual Interruptability Pipeline and/or Cache Modifications  Checker Sometimes Assumed Non-Faulty  SMT: Fault Correlation DIVA, some SMT Techniques

22 Insight NMR Techniques have Desirable Properties – Well Understood – Simple, HW is Isolated Trailing Checker has Desirable Properties – Common-case Performance Redundancy is Parallel, but Checks Add Overhead – Non-Blocking Synchronization Asynchronous NMR Techniques

23 Asynchronous NMR Time Where: On Loosely-Coupled Redundant Hardware When: Ad Hoc How: Check Some Results (e.g. I/O) Inputs Start N Identical Executions 2+2=4 Log Output 2+2=4 Log Output, Compare Results 2+2=4 Log Output, Compare Results 2+2=4

24 Asynchronous NMR: Fault Detection Time Inputs Start N Identical Executions 2+2=4 Log Output 2+2=5 Log Output, Compare Results 5 != 4: Fault! 4 5

25 Asynchronous NMR: Fault Recovery Time Inputs Start N Identical Executions Log OutputLog Output, Compare Results 4 5 2+2=4 Log Output, Compare Results 4 Votes: 4 – 2 Votes 5 – 1 Vote Ergo: 2+2=4 4

26 Asynchronous NMR: Pros and Cons Tolerates Transient Faults Tolerates Some Intermittent Faults Tolerates Permanent Faults High Performance Leverages Parallelism Flexible  Overhead: Some HW Redundancy Needed for Performance Software Involvement  Flexible Many decisions  Synchronization Needed Performance/Complexity Tradeoff Chip Multiprocessors to the rescue!

27 Chip Multiprocessors (CMPs) Many Processors (aka Cores), One Chip – Quick Inter-Core Communication – Abundant Parallelism (More than SW knows what to do with!) Resource Sharing – Caches, Off-Chip Memories – On-Chip Interconnect

28 CMPs of Today Intel ® Core 2 Duo Extreme Edition Two Cores Four-Core Coming (Very) Soon

29 CMPs of Today Sun ® Niagara (SunFire TX000) Eight Cores, 32 Execution Contexts

30 CMPs of Today Sun ® Niagara 2 (In the works) Eight Cores, 64 Execution Contexts

31 CMPs of Today IBM ® Cell Nine Cores: One Beefy PPC Eight SPEs Virtualization: Only 7 Exposed SPEs

32 CMPs of Tomorrow Intel, IBM, Sun, AMD All Have CMPs Today Intel Announced a 100+ Core CMP – Technology Scaling alone will Enable 100’s of cores inside of 10 years BUT: – SW Is Largely Serial Code! – How can Future CMPs utilize Core Abundance? Combat Faults with Core-level Redundancy

33 Execution Redundancy on CMPs Combine Asynchronous NMR and CMPs: – HW Provides Redundancy Isolation and Detection HW Support for Managed Redundancy – SW Manages Concurrency: Synchronization Fault Recovery

34 HW Support for Managed Redundancy Heterogeneous Processing Elements: – Aggressive Cores: Scaled to the Limit of Technology Susceptible to Faults Small and Numerous – Reliable Cores: Conservatively Sized Much higher MTTF, but Significantly Slower Large and Few in Number

35 HW Support for Managed Redundancy

36

37

38 tcsh(1)% add 2 2

39 HW Support for Managed Redundancy tcsh(1)% add 2 2 2+2=4

40 Managing Redundancy Synergy between POSIX Process Boundary and Cores on a CMP OS Runs Only on Reliable Core System Calls (e.g. I/O) Provide a Natural Opportunity to Perform Checks/Voting – Already Interruption Overhead

41 Hardware Support Isolate Faults to Aggressive Cores (Error Propagation OK) System Call Proxy to Reliable Core Localized Software- Initiated Reset

42 Software Management OS Manages Redundancy (Redundant Process Control Module = RPCM) RPCM Provides Virtualization HW or SW Fault Detection SW Fault Recovery

43 Fault Detection Scenario 1: Hardware-Detected Fault Scenario 2: Software-Detected Fault 2+2=5 2+2=4

44 Fault Recovery 2+2=5 2+2=4 1)Reset Affected Core 2)Stop a Non-Faulty Process 3)Move a Copy of the Stopped Process to the Faulty Core 4)Resume Both

45 Virtualization getpid()  Processes Expect to Execute Alone!  Cannot allow processes to interfere  Must ensure identical executions Your PID is 4 getpid() Your PID is 4

46 Virtualization  Processes Expect to Execute Alone!  Cannot allow processes to interfere  Must ensure identical executions Ok, FD = 3 open(“out.txt”) Ok, FD = 3 open(“out.txt”) open(“out-0.txt”) open(“out-1.txt”)

47 Pros and Cons Tolerates Transient, Intermittent, and Permanent Faults High Performance Flexible User-Software Invisible  Overhead HW Redundancy  OS Support Required Redundancy Management Virtualization  Fault Propagation is Possible Must Combat with Larger N

48 Analytical Analysis 1

49 Analytical Analysis 2

50 Portcullis Prototype Future-CMP Isn’t Available Yet – Simulating it Would Take a LONG Time Modifying the OS Would Take a LONG Time Punt: Make a User-Level Prototype for Today’s Hardware

51 Portcullis Prototype Emulate the RPCM Trap System Calls – Detect Faults – Hide other Processes Allow the OS To Manage Sharing Provide Virtualization for Redundancy

52 Portcullis Performance 1

53 Portcullis Performance 2

54 Portcullis Performance 3

55 Concluding Remarks Execution Redundancy is a Large Field – We Saw Four Techniques – Others Exist CMPs Represent a Large Field – We Saw ~4 and Designed 1 xProduct of a Two Large Fields = Huge Field! NMR ≠ (N-1)*100% Overhead

56 Backup Slides

57 What’s Really Going On? execve(“/path/add”,”2”,”2”) brk(0) // get memory access(“/lib/glibc”) // find glibc open(“/lib/glibc”,”r”) mmap(1, 0xDEADBEEF) // import glibc close() stat(“/lib/mylib”) open(…) stat(…) // find other libraries open(…) stat(…) read(…) fstat(…) mmap(…) // import other libraries close() mprotect() // make libs exectuable set_thread_area() write(1,“2+2=4”) // do output munmap() // cleanup exit_group() Process Setup Process Teardown

58 What’s Really Going On? write( 1, “2+2=4” ); Map File Descriptor 1 to Actual File Descriptor 1’ Add “2+2=4” to Output Queue for 1’ Compare “2+2=4” Against other Processes’ output for 1’

59 Tolerable Faults Illegal Opcode Exception OR Eventual Disagree Segmentation Fault, Bus Error, Eventual Disagree Writeback to Wrong Address (e.g. Cache Tag Corruption) Redundant TLB Conservative Tags System Failure


Download ppt "Process Redundancy for Future-Generation CMP Fault Tolerance Dan Gibson ECE 753 Project Presentation."

Similar presentations


Ads by Google