Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Similar presentations


Presentation on theme: "(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2."— Presentation transcript:

1 (C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2 1 Department of Electrical & Computer Engineering Duke University 2 Computer Sciences Department University of Wisconsin-Madison

2 DSN 2003 – Daniel Sorin slide 2 My Talk in One Slide Commercial server availability is important –System model: Symmetric Multiprocessor (SMP) –Fault model: Mostly transient, some permanent Recent work developed efficient checkpoint/recovery –But we can only recover from hardware errors we detect –Many hardware errors are hard to detect Proposal: Dynamic verification of invariants –Online checking of end-to-end system invariants –Checking performed with distributed signature analysis –Triggers recovery if invariant is violated

3 DSN 2003 – Daniel Sorin slide 3 Outline Background –SMPs and availability –Existing hardware error detection Invariant checking with distributed signature analysis Two invariant checkers Evaluation Conclusions

4 DSN 2003 – Daniel Sorin slide 4 Symmetric Multiprocessor (SMP) IM Issue request Wait for response Receive response PPPP shared wire bus System ModelCache Coherence Transaction

5 DSN 2003 – Daniel Sorin slide 5 Symmetric Multiprocessor (SMP) IM Issue request Wait for response Receive response System ModelCache Coherence Transaction PPPP switch –Broadcast request not delivered to subset of nodes –Broadcast requests delivered out of order to subset of nodes

6 DSN 2003 – Daniel Sorin slide 6 Symmetric Multiprocessor (SMP) System ModelCache Coherence Transaction PPPP switch –Broadcast request not delivered to subset of nodes –Broadcast requests delivered out of order to subset of nodes IMt1 t3 t2 issue request response arrives request arrives response arrives request arrives –More chances for incorrect state transitions

7 DSN 2003 – Daniel Sorin slide 7 Backward Error Recovery Can improve availability with backward error recovery If error detected, then recover to pre-fault state Backward error recovery (BER) requires: –Checkpoint/recovery mechanism –Error detection mechanisms

8 DSN 2003 – Daniel Sorin slide 8 SafetyNet Checkpoint/Recovery SafetyNet: all-hardware scheme [ISCA 2002] –Periodically take logical checkpoint of multiprocessor MP State: processor registers, caches, memory –Incrementally log changes to caches and memory –Consistent checkpointing performed in logical time E.g., every 3000 broadcast cache coherence requests –Can tolerate >100,000 cycles of error detection latency time Active execution CP 4CP 3CP 2CP 1 Validated execution Pending validation – Still detecting errors

9 DSN 2003 – Daniel Sorin slide 9 Error Detection Error model: mostly due to transient faults Example error detection mechanisms: –Parity bit on cache line –Checksum on incoming message –Timeout on cache coherence transaction But error detection for servers is still weak Why? –Error detection is often on critical path and must be fast –Fast error detection can’t incorporate info from other nodes

10 DSN 2003 – Daniel Sorin slide 10 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned

11 DSN 2003 – Daniel Sorin slide 11 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned Broadcast Request for Exclusive fault!

12 DSN 2003 – Daniel Sorin slide 12 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned Broadcast Request for Exclusive Invalid Data Response fault!

13 DSN 2003 – Daniel Sorin slide 13 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedModified Neither P1 nor P2 can detect that an error has occurred!

14 DSN 2003 – Daniel Sorin slide 14 Outline Background End-to-end invariant checking Two invariant checkers Evaluation Conclusions

15 DSN 2003 – Daniel Sorin slide 15 Distributed Signature Analysis Reduces long history of events into small signature –Signatures map almost-uniquely to event histories P1 Signature P2 Signature Event N at P1 : Event 2 at P1 Event 1 at P1 Event N at P2 : Event 2 at P2 Event 1 at P2 Checker P2’s signatureP1’s signature } Check periodically in logical time (every 3000 requests)

16 DSN 2003 – Daniel Sorin slide 16 Designing Signature Analysis Schemes Must devise two functions: Update and Check Signature(Pi) = Update(Signature(Pi), Event) Check(Signature(P1),…,Signature(PN)) = true if error Simple example: check that message inflow=outflow –Assume only unicast messages –Update: +1 for receive, -1 for send –Check: true if sum of all signatures doesn’t equal 0

17 DSN 2003 – Daniel Sorin slide 17 Implementing Distributed Signature Analysis All components cooperate to perform checking –Component = cache controller or memory controller Each component contains: –Local signature register –Logic to compute signature updates System contains: –System controller that performs check function Use distributed signature analysis for dynamic verification –Verify end-to-end invariants

18 DSN 2003 – Daniel Sorin slide 18 Outline Background End-to-end invariant checking Two invariant checkers –Message invariant –Cache coherence invariant Evaluation Conclusions

19 DSN 2003 – Daniel Sorin slide 19 A Message-Level Invariant Checker Context: symmetric multiprocessor (SMP) –Cache coherence with broadcast snooping protocol Invariant: all nodes see same total order of broadcast cache coherence requests Update: for each incoming broadcast, “add” Address –Not quite this simple (e.g., doesn’t detect reorderings) Check: error if all signatures aren’t equal

20 DSN 2003 – Daniel Sorin slide 20 Aliasing Aliasing occurs if two histories have same signature 3 possible sources of aliasing –Finite resources – b bits can only distinguish 2 b histories –Fault in signature analysis hardware itself –Inherent flaw in scheme Examples of inherent aliasing in previous scheme –Arrival of message with Address=0 doesn’t change signature –Reordering of messages doesn’t change signature –We solve aliasing issues in paper Tricks: hash more than 1 field of message, use LFSRs, etc.

21 DSN 2003 – Daniel Sorin slide 21 A Cache Coherence Invariant Checker Invariant: all coherence upgrades cause downgrades –Upgrade: increase permissions to block (e.g., none  read) –Downgrade: decrease permissions (e.g., write  read) Update: add Address for upgrade subtract Address for downgrade Check: error if sum of all signatures doesn’t equal 0 Challenges –Can be more than one downgrade per upgrade –Upgrader doesn’t know how how many downgraders exist –See paper for solutions to these challenges

22 DSN 2003 – Daniel Sorin slide 22 Outline Background End-to-end invariant checking Two invariant checkers Evaluation Conclusions

23 DSN 2003 – Daniel Sorin slide 23 Methodology Full-system simulation of 16-processor machine –Simics provides functional simulation of everything –We added timing simulation for memory system & SafetyNet Commercial workloads running on Solaris 8 –Database: IBM’s DB2 running online transaction processing –Static web server: Apache –Dynamic web server: Slashdot –Java middleware

24 DSN 2003 – Daniel Sorin slide 24 Detection Coverage How do we know if our checkers work? Inject errors periodically –Corrupt messages –Drop messages –Reorder messages –Improperly process cache coherence messages Global invariant checkers detected all errors

25 DSN 2003 – Daniel Sorin slide 25 Performance Error bars represent +/- one standard deviation

26 DSN 2003 – Daniel Sorin slide 26 Conclusions Goal: improve multiprocessor availability How? Dynamic verification of end-to-end invariants –Implemented with distributed signature analysis Results –Detects previously undetectable hardware errors –Negligible performance overhead for error-free execution Duke FaultFinder Project –http://www.ee.duke.edu/~sorin/faultfinder Wisconsin Multifacet Project –http://www.cs.wisc.edu/multifacet/


Download ppt "(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2."

Similar presentations


Ads by Google