Presentation is loading. Please wait.

Presentation is loading. Please wait.

NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri.

Similar presentations


Presentation on theme: "NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri."— Presentation transcript:

1 NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri Parthasarathy † Eric Rotenberg Dept. of Electrical & Computer Engineering North Carolina State University Raleigh, NC † Architecture Modeling Infrastructure Group Intel Corporation Hudson, MA

2 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 2 Transient Fault Tolerance Transient faults –Temporary hardware faults –Worsening with shrinking technology Soft errors Noise Prominent solution: Redundant Multithreading –Full program duplication –Complete fault tolerance –High overheads

3 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 3 Full Redundant Execution Full duplication 100% fault coverage fault

4 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 4 Partial Redundant Threading (PRT) Only partially duplicate a program Shorter the redundant thread, lesser the overhead Approach taken to create partial thread affects fault tolerance

5 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 5 fault Conventional PRT

6 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 6 fault Conventional PRT Arbitrary duplication Non-duplicated portions lose fault coverage

7 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 7 Prediction-based PRT Confident prediction of A Freq. Case: Correct prediction (e.g., 99.9% of the time) fault

8 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 8 fault Prediction-based PRT Confident prediction of A Freq. Case: Correct prediction (e.g., 99.9% of the time)

9 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 9 fault Prediction-based PRT Confident prediction of A Rare case: Incorrect prediction (e.g., 0.1% of the time)

10 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 10 Prediction-based PRT Confident predictions are good proxies for redundant execution Predictions break thread inter-dependence Near-100% fault coverage

11 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 11 Relaxing checking constraints - S(p) and T(p) are removable

12 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 12 Relaxing checking constraints - S(p) and T(p) are removable - Such removal can shorten partial thread significantly - But, no predictions to replace them fault

13 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 13 PRT Spectrum Case study: PRT on Slipstream M. Gomaa T. N. Vijaykumar ISCA 2005 Z. Purser K. Sundermoorthy E. Rotenberg MICRO 2000 ASPLOS 2000 N. Wang S. J. Patel DSN 2004

14 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 14 Slipstream Overview Delay Buffer Branch + Value outcomes verify SMT Processor Verified Architectural State R-stream Unverified Architectural State A-stream Outcome mismatch copy restart

15 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 15 Slipstream Overview Delay Buffer Branch + Value outcomes verify Verified Architectural State R-streamA-stream Unverified Architectural State Slipstream Components remove Detect predictable instructions Predict based on repeated detection (confidence) SMT Processor monitor

16 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 16 Slipstream Overview Delay Buffer Branch + Value outcomes verify Verified Architectural State R-streamA-stream Unverified Architectural State Slipstream Components remove SMT Processor Predictions act as proxies Detect predictable instructions Predict based on repeated detection (confidence) monitor

17 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 17 Slipstream Overview Delay Buffer Branch + Value outcomes verify Verified Architectural State R-streamA-stream Unverified Architectural State Slipstream Components remove Confident branches Confident dead writes Confident silent writes Slices whose leaves are any of the above SMT Processor Predictions act as proxies Detect predictable instructions Predict based on repeated detection (confidence) monitor

18 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 18 Confident Branch Correct Branch Prediction (Taken) Main (R-stream)Partial (A-stream) (Not Taken) fault

19 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 19 Confident Branch Main (R-stream)Partial (A-stream) (Not Taken) fault Incorrect Branch Prediction (Not Taken)

20 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 20 Main (R-stream)Partial (A-stream) fault Correct Prediction (Dead) Confident Dead Write

21 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 21 Main (R-stream)Partial (A-stream) fault Incorrect Prediction (Not Dead) ? Confident Dead Write

22 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 22 Main (R-stream)Partial (A-stream) Confident Silent Write/Store 72

23 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 23 Main (R-stream)Partial (A-stream) Confident Silent Write/Store 33 33

24 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 24 Main (R-stream)Partial (A-stream) Confident Silent Write/Store V VV Correct Prediction (Silent) fault X V X

25 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 25 Main (R-stream)Partial (A-stream) Confident Silent Write/Store stale Incorrect Prediction (Not Silent) fault X

26 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 26 Slipstream Fault Detection Coverage We showed confident predictions detect faults –Additional coverage = Correctly predicted confident instr. + Their backward slices –Mispredictions vulnerable, but rare in Slipstream –Mispredictions + backward slices = only 0.1% instr. Hence, non-duplicated instr. well covered Slipstream has high fault detection coverage (99.9%) Prior research: Only duplicated instr. covered

27 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 27 Fault Coverage: Detection vs. Recovery Detection Coverage Represents ability to detect faults Slipstream: 99.9% of instr. Recovery Coverage Ability to rollback to ‘golden’ state on fault detection Slipstream’s recovery coverage?

28 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 28 Slipstream Fault Recovery fault detected Flawed Rollback (Conventional Slipstream) Correct Rollback

29 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 29 Reorder Buffer (ROB) fault head commit to arch. state tail Flush Restart Arch. State Corrupted Base Slipstream Recovery detected

30 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 30 Reorder Buffer (ROB) head commit to arch. state tail Flush to ROB head Restart fault Enhancement: ROB-head Recovery detected

31 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 31 Reorder Buffer (ROB) head commit to arch. state tail Flush to ROB head fault Enhancement: ROB-head Recovery detected Arch. State Corrupted Limited rollback distance: R-stream retires quickly – accelerated by leading A-stream

32 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 32 Reorder Buffer (ROB) head commit to arch. state tail Enhancement: ROB occupancy management Threshold Delay in retirement, hence performance hit Increased rollback distance detected

33 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 33 Reorder Buffer (ROB) head commit to arch. state tail fault Enhancement: History Buffer head tail Undo changes Performance friendly: R-stream mispredictions are rare

34 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 34 Main (R-stream)Partial (A-stream) Indirect Check of Silent Write/Store V Correct Prediction (Silent) fault X Base Slipstream Recovery

35 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 35 Main (R-stream)Partial (A-stream) Direct Check of Silent Write/Store V Correct Prediction (Silent) fault X existing value new value V Direct Check Base Slipstream Recovery Need Enhanced Recovery

36 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 36 Novel Framework to Analyze Coverage Each instr. considered “candidate faulty” Coverage = # of instr. checked before committal to arch. state Mispredicted instr. and backward slices marked unchecked

37 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 37 Prior WorkNew Analysis Checkers Non-checkers Coverage Analysis

38 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 38 Logically masked Masked? Coverage Analysis Rollback point Masked?

39 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 39 Clarification Analysis framework is a coverage measurement tool Not in the actual hardware

40 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 40 Results: Microarchitecture models L1 I & D caches 64KB, 4-way, 64B line, LRU, L1hit = 1 cycle, L1miss/L2hit = 10 cycles L2 unified cache 1MB, 8-way, 64B line, LRU, L1miss/L2miss = 100 cycles superscalar core dispatch/issue/retire bandwidth: 8 (4) reorder buffer (ROB): 256 (128) load/store queue: 64 (32) issue queue: 64 (32) cache ports (read/write): 4 (2)

41 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 41 Breakdown of Instructions SPEC2K Integer Benchmarks 65% duplicated 35% non-duplicated > 50% non-duplicated on some bench.

42 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 42 Breakdown of Instructions 23% confident predictions 12% backward slices of confident predictions See paper for more detailed breakdown

43 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 43 Fault Coverage Slip + Direct + Recovery Enhancements Prior Work (65%): Only dup. instr. New result (78%): Correct confident predictions covered ROB-head recovery improves with occupancy threshold: 95% to 98% Direct silent write/ store checks improve coverage (88%) 65% 78% 88% 95% 97% 98% 99% History buffer schemes provide high coverage (98% to 99%)

44 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 44 Slipstream Performance (SMT 8-wide) Average slowdown: Full redundant execution:14% Slipstream:1.3%

45 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 45 Performance Impact of Enhanced Recovery ROB-occupancy management delays retirement, causes slowdown (gradual decrease from RH0 to RH48) History buffer approach is performance friendly (negligible slowdown)

46 NC STATE UNIVERSITY ASPLOS-XII© 2006 Vimal Reddy 46 Conclusions Confident predictions can replace duplication –Slipstream case study : Redundant thread reduced by up to 57% while retaining near-100% coverage Prediction-based PRT offers a new avenue for efficient fault tolerant computing


Download ppt "NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri."

Similar presentations


Ads by Google