Presentation is loading. Please wait.

Presentation is loading. Please wait.

HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros.

Similar presentations


Presentation on theme: "HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros."— Presentation transcript:

1 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros Constantinides ‡ Stephen Plaza ‡ Jason Blome ‡ Bin Zhang † Valeria Bertacco ‡ Scott Mahlke ‡ Todd Austin ‡ Michael Orshansky † ‡ Advanced Computer Architecture Lab † Department of Electrical and Computer Engineering University of MichiganUniversity of Texas at Austin

2 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 2 Introduction Reliability is a critical aspect of any computer design System designers target for very small failure rates Today reliability targets are met by using fault-avoidance design techniques – use of conservative design margins For future process technologies it would be impossible to avoid system failures by using conservative design margins – need defect-tolerant design techniques Transistor Reliability Transistor Lifetime (years) Now Future

3 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 3 Need for cost- and performance-efficient techniques that can provide high reliability in the presence of unreliable components – “BulletProof” Reliable System Design Space MANUFACTURING DEFECTWEAR-OUT DEFECTTRANSIENT ERROR NO-DETECTIONUntestable Defects System fails in unpredictable way System glitch manifests in unpredictable way DETECTIONTesting Component terminates at first error Component terminates. Hard-reset restore DETECTION +CORRECTION Post-manufacturing recovery Online defect recovery Transient fault recovery DETECTION +CORRECTION +REPAIR Post-manufacturing reconfiguration Online repair DMR ECC - memory cache-line swap-out memory-array spares TMR Diva Razor ECC TMR BulletProof Mainstream Solutions High-end Solutions Specialized Solutions Research-stage Solutions TYPE OF DEFECT DESIGN FEATURE

4 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 4 CMP Switch Architecture Goal : A defect tolerant CMP switch design Baseline switch architecture is provided by Li-Shiuan Peh Implements the routing and flow-control functions required for transmitting packets in a 2D Torus network Wormhole switch pipelined at the flit level (32-bit flits) Dimensional order routing Specified in Verilog and synthesized to a gate-level netlist ~ 9K logic gates and 1700 sequential elements

5 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 5 Soft Errors (SEU) Vulnerability In earlier work we studied the vulnerability of the switch architecture to soft-errors – Only 3.2% of faults eventually cause an error Age-related wear-out silicon defects is a more challenging reliability threat for future technologies In this work we focus on solutions for in-field silicon defects These solutions also provide soft-error tolerance to the design

6 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 6 Self-Repairing Systems Defect-tolerant self-repairing systems need to support: – Error Detection – System Diagnosis (locate the origin of the error) – System Repair – System Recovery Key idea: – error detection must be performance efficient continuously check execution for errors – diagnosis, repair and recovery are insensitive on performance get invoked only when an error is detected (rare scenario) trade-off performance for more cost efficient techniques

7 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 7 Traditional Defect-Tolerant Techniques Traditional techniques for designing defect-tolerant systems: – Triple Modular Redundancy (TMR) Forward recovery Applicable to both combinational and sequential logic Can not tolerate more than one defective modules Area and power overhead ~ 3X – Error Correction Codes (ECC) Lower overhead solution Applicable only for state holding structures and busses M M M V R 1 R 2 D 1 R 3 D 2 D 3 D 4 R 4 D 5 D 6 D 7 D 8 ECC bits Data bits

8 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 8 The synthesized netlist of the added components account for ~10% of the total switch area Provide error detection for both hard and soft errors Buffer Checker Routing Logic ARB Cross-bar Controller Header Input Buffers Cross-bar ARB CRC Checker CRC Error Detection: Low-Cost Domain Specific Technique Error FLIT CRC Checker

9 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 9 Adding Defect Resiliency With Lower Cost Automatic Cluster Decomposition Balanced recursive min-cut heuristic algorithm Input : a) design’s gate-level netlist b) number of partitions Output : a partitioned netlist Goal : – Balance partition sizes: - smaller partition higher resilience – Minimize cut edges: - reduce cost overhead - reduce vulnerable logic Partitions can have both combinational and sequential logic A B C D E F G H J I A B C D E F G H J I

10 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 10 A B F A B F D E H D E H C G J I C G J I Partition sparing: – Only one spare is active for each partition of the switch – Replace voting logic with spare swapping logic – Lower power overhead – A defect is fatal if it hits the last spare of a partition or the spare swapping logic Silicon Protection Factor (SPF) = – The number of defect in a design are proportional to the design’s area – Enables to compare different defect tolerant designs SPF – Defect Tolerance 7.6X more defects tolerated per unit area Partition Sparing – Silicon Protection Factor 1 extra spare per partition Mean Defects to Failure Area Overhead 15.8X more defects tolerated

11 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 11 System Recovery Add a Recovery Pointer to each input buffer Recovery pointers advance 4 cycles after the input controller grants the requesting output channel – Guarantees that flit is CRC checked On error detection: – All CRC checkers drop outgoing flits – Switch pipeline is flushed – Head pointers are set to recovery pointers – Restart execution CRC Checker Interconnect Switch CRC Checker CRC Checker CRC Checker Recovery Logic CRC Checker Routed Flit Routed Flit Routed Flit Routed Flit Routed Flit Error Detection Signal abcde abcde Input Buffers TailHeadRecovery Head a: Correctly routed flit b, c: In the switch pipeline d: Next flit to be routed e: Last flit buffered ed

12 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 12 System Diagnosis and Repair Iterative trial-and-error technique Built-In-Self-Test (BIST) – For each partition keep automatically generated test vectors in ROM – Apply test vectors to each partition through scan chains to locate the defective partition Recover to the last correct state of the switch For partition i swap in the spare for the current copy and restart execution Error detected? i < # partitions? Continue Execution Increase i No Yes Fatal Defect

13 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 13 How does these techniques affect the system’s lifetime? Pareto Optimal Designs Pareto Sub-optimal Designs 12 partitions (cmps) 2/5 spare input controllers 1 spare per cmp. (rest) Iterative replay Area = 1.76X SPF = 2.53 12 partitions (cmps) 2/5 spare input controllers 1 spare per cmp. (rest) Iterative replay Area = 1.76X SPF = 2.53 206 partitions 2 spares per partition Iterative replay Area = 3.4X SPF = 11.1 206 partitions 2 spares per partition Iterative replay Area = 3.4X SPF = 11.1 206 partitions 1 spare per partition Built-In-Self-Test Area = 3.16X SPF = 5.54 206 partitions 1 spare per partition Built-In-Self-Test Area = 3.16X SPF = 5.54 206 partitions 1 spare per partition Iterative replay Area = 2.3X SPF = 7.6 206 partitions 1 spare per partition Iterative replay Area = 2.3X SPF = 7.6 12 partitions (cmps) TMR Area = 3.04X SPF = 1.54 12 partitions (cmps) TMR Area = 3.04X SPF = 1.54 more robust designs cheaper designs cheaper more robust designs Exploring Defect-Tolerant CMP Switch Designs

14 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 14 “Bathtub Curve”: A model for semiconductor hard failures The lifetime failure rate for semiconductor systems follows what is known as the bathtub curve Trend for future process technologies: – Failure rate of grace period gets larger – Breakdown period is earlier in system’s lifetime Grace Period Infant Period Breakdown Period Time Failure Rate (FIT) Future process technologies

15 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 15 System Lifetime – A Post 65nm Technology Case Scenario Failure Rate (FIT) 12000 24000 36000 48000 60000 72000 84000 96000 108000 120000 TMR SPF=1.54 TMR SPF=1.54 3/5 spare IC 1 spare rest SPF=3.01 3/5 spare IC 1 spare rest SPF=3.01 1 spare SPF=7.63 1 spare SPF=7.63 2 spares SPF=11.11 2 spares SPF=11.11 1 defect every two years every two years

16 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 16 Conclusions – Future Work Conclusions Traditional mechanisms are insufficient for tolerating moderate numbers of defects Domain-specific techniques along with resource sparing, iterative diagnosis and reconfiguration are more effective Decomposing the design into modest-sized partitions is the most effective granularity to apply redundancy Future Work Use of spare components based on component wear-out profiles Explore low-cost defect-tolerant techniques for microprocessors

17 HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 17 Questions?


Download ppt "HPCA, Austin, Texas February 13 2006 BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros."

Similar presentations


Ads by Google