Download presentation
Presentation is loading. Please wait.
Published bySusan Robertson Modified over 8 years ago
1
Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U. of Michigan Versions of this work have been presented at ISCA 2000 and ISCA 2002
2
2 Transient Faults from Cosmic Rays & Alpha particles + decreasing feature size - decreasing voltage (exponential dependence?) - increasing number of transistors (Moore’s Law) - increasing system size (number of processors) - no practical absorbent for cosmic rays
3
3 Fault Detection via Lockstepping (HP Himalaya) R1 (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1 (R2) microprocessor Replicated Microprocessors + Cycle-by-Cycle Lockstepping
4
4 Fault Detection via Simultaneous Multithreading R1 (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1 (R2) THREAD Replicated Microprocessors + Cycle-by-Cycle Lockstepping Threads ?
5
5 Simultaneous Multithreading (SMT) Functional Units Instruction Scheduler Thread1 Thread2 Example: Alpha 21464, Intel Northwood
6
6 Redundant Multithreading (RMT) Multithreading (MT) Redundant Multithreading (RMT) MultithreadedUniprocessorSimultaneous Multithreading (SMT) Simultaneous & Redundant Threading (SRT) Chip Multiprocessor (CMP) Multiple Threads running on CMP Chip-Level Redundant Threading (CRT) RMT = Multithreading + Fault Detection
7
7 Outline SRT concepts & design Preferential Space Redundancy SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis Summary Current & Future Work
8
8 Overview SRT = SMT + Fault Detection Advantages Piggyback on an SMT processor with little extra hardware Piggyback on an SMT processor with little extra hardware Better performance than complete replication Better performance than complete replication Lower cost due to market volume of SMT & SRT Lower cost due to market volume of SMT & SRT Challenges Lockstepping very difficult with SRT Lockstepping very difficult with SRT Must carefully fetch/schedule instructions from redundant threads Must carefully fetch/schedule instructions from redundant threads
9
9 Sphere of Replication Two copies of each architecturally visible thread Co-scheduled on SMT core Co-scheduled on SMT core Compare results: signal fault if different Memory System (incl. L1 caches) Sphere of Replication Output Comparison Input Replication Leading Thread Trailing Thread
10
10 Basic Pipeline Fetch DecodeDispatchCommit Execute Data Cache
11
11 Load Value Queue (LVQ) Keep threads on same path despite I/O or MP writes Keep threads on same path despite I/O or MP writes Out-of-order load issue possible Out-of-order load issue possible Load Value Queue (LVQ) Fetch DecodeDispatchCommit Execute Data Cache LVQ
12
12 Store Queue Comparator (STQ) Store Queue Comparator Compares outputs to data cache Compares outputs to data cache Catch faults before propagating to rest of system Catch faults before propagating to rest of system Fetch DecodeDispatchCommit Execute Data Cache STQ
13
13 Store Queue Comparator (cont’d) Extends residence time of leading-thread stores Size constrained by cycle time goal Size constrained by cycle time goal Base CPU statically partitions single queue among threads Base CPU statically partitions single queue among threads Potential solution: per-thread store queues Potential solution: per-thread store queues Deadlock if matching trailing store cannot commit Several small but crucial changes to avoid this Several small but crucial changes to avoid this st... st 5 [0x120] st... Store Queue Compare address & data to data cache st 5 [0x120]
14
14 Branch Outcome Queue (BOQ) Branch Outcome Queue Forward leading-thread branch targets to trailing fetch Forward leading-thread branch targets to trailing fetch 100% prediction accuracy in absence of faults 100% prediction accuracy in absence of faults Fetch DecodeDispatchCommit Execute Data Cache BOQ
15
15 Line Prediction Queue (LPQ) Fetch DecodeDispatchCommit Execute Data Cache LPQ Line Prediction Queue Alpha 21464 fetches chunks using line predictions Alpha 21464 fetches chunks using line predictions Chunk = contiguous block of 8 instructions Chunk = contiguous block of 8 instructions
16
16 Line Prediction Queue (cont’d) Generate stream of “chunked” line predictions Every leading-thread instruction carries its I-cache coordinates Every leading-thread instruction carries its I-cache coordinates Commit logic merges into fetch chunks for LPQ Commit logic merges into fetch chunks for LPQ – Independent of leading-thread fetch chunks – Commit-to-fetch dependence raised deadlock issues 1F8:add 1FC:load R1 (R2) 200:beq 280 204:and 208:bne 200 200:add Chunk 1: end of cache line Chunk 2: taken branch
17
17 Line Prediction Queue (cont’d) Read-out on trailing-thread fetch also complex Base CPU “thread chooser” gets multiple line predictions, ignores all but one Base CPU “thread chooser” gets multiple line predictions, ignores all but one Fetches must be retried on I-cache miss Fetches must be retried on I-cache miss Tricky to keep queue in sync with thread progress Add handshake to advance queue head Add handshake to advance queue head Roll back head on I-cache miss Roll back head on I-cache miss – Track both last attempted & last successful chunks
18
18 SRT concepts & design Preferential Space Redundancy SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis Summary Current & Future Work Outline
19
19 Preferential Space Redundancy SRT combines two types of redundancy Time: same physical resource, different time Time: same physical resource, different time Space: different physical resource Space: different physical resource Space redundancy preferable Better coverage of permanent/long-duration faults Better coverage of permanent/long-duration faults Bias towards space redundancy where possible
20
20 Base CPU has two execution clusters Separate instruction queues, function units Separate instruction queues, function units Steered in dispatch stage Steered in dispatch stage add r1,r2,r3 PSR Example: Clustered Execution Fetch DecodeDispatchCommit Exec 0IQ 0 Exec 1IQ 1 add r1,r2,r3 LPQ
21
21 Leading thread instructions record their cluster Bit carried with fetch chunk through LPQ Bit carried with fetch chunk through LPQ Attached to trailing-thread instruction Attached to trailing-thread instruction Dispatch sends to opposite cluster if possible Dispatch sends to opposite cluster if possible 0 PSR Example: Clustered Execution Fetch DecodeDispatchCommit Exec 0IQ 0 Exec 1IQ 1 000 add r1,r2,r3 [0] LPQ 0
22
22 99.94% of instruction pairs use different clusters Full spatial redundancy for execution Full spatial redundancy for execution No performance impact (occasional slight gain) No performance impact (occasional slight gain) add r1,r2,r3 [0] PSR Example: Clustered Execution Fetch DecodeDispatchCommit Exec 0IQ 0 Exec 1IQ 1 add r1,r2,r3 [0] add r1,r2,r3 LPQ
23
23 SRT concepts & design Preferential Space Redundancy SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis Summary Current & Future Work Outline
24
24 SRT Evaluation Used SPEC CPU95, 15M instrs/thread Constrained by simulation environment Constrained by simulation environment 120M instrs for 4 redundant thread pairs 120M instrs for 4 redundant thread pairs Eight-issue, four-context SMT CPU 128-entry instruction queue 128-entry instruction queue 64-entry load and store queues 64-entry load and store queues – Default: statically partitioned among active threads 22-stage pipeline 22-stage pipeline 64KB 2-way assoc. L1 caches 64KB 2-way assoc. L1 caches 3 MB 8-way assoc L2 3 MB 8-way assoc L2
25
25 SRT Performance: One Thread One logical thread two hardware contexts Performance degradation = 30% Per-thread store queue buys extra 4%
26
26 SRT Performance: Two Threads Two logical threads four hardware contexts Average slowdown increases to 40% Only 32% with per-thread store queues
27
27 SRT concepts & design Preferential Space Redundancy SRT Performance Analysis Single- & multi-threaded workloads Single- & multi-threaded workloads Chip-level Redundant Threading (CRT) Concept Concept Performance analysis Performance analysis Summary Current & Future Work Outline
28
28 Chip-Level Redundant Threading SRT typically more efficient than splitting one processor into two half-size CPUs What if you already have two CPUs? IBM Power4, HP PA-8800 (Mako) IBM Power4, HP PA-8800 (Mako) Conceptually easy to run these in lock-step Benefit: full physical redundancy Benefit: full physical redundancy Costs: Costs: – Latency through centralized checker logic – Overheads (misspeculation etc.) incurred twice CRT combines best of SRT & lockstepping requires multithreaded CMP cores requires multithreaded CMP cores
29
29 Chip-Level Redundant Threading CPU A Leading Thread A Trailing Thread B CPU B Trailing Thread A Leading Thread B LVQ Stores LPQ Stores LPQ LVQ
30
30 CRT Performance With per-thread store queues, ~13% improvement over lockstepping with 8-cycle checker latency
31
31 Summary & Conclusions SRT is applicable in a real-world SMT design ~30% slowdown, slightly worse with two threads ~30% slowdown, slightly worse with two threads Store queue capacity can limit performance Store queue capacity can limit performance Preferential space redundancy improves coverage Chip-level Redundant Threading = SRT for CMPs Looser synchronization than lockstepping Looser synchronization than lockstepping Free up resources for other application threads Free up resources for other application threads
32
32 More Information Publications S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” International Symposium on Computer Architecture (ISCA), 2000 S.K. Reinhardt & S.S.Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” International Symposium on Computer Architecture (ISCA), 2000 S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002 S.S.Mukherjee, M.Kontz, & S.K.Reinhardt, “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” International Symposium on Computer Architecture (ISCA), 2002 Papers available from: Papers available from: – http://www.cs.wisc.edu/~shubu – http://www.eecs.umich.edu/~stever Patents Compaq/HP filed eight patent applications on SRT Compaq/HP filed eight patent applications on SRT
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.