A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.

Slides:



Advertisements
Similar presentations
Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Thank you for your introduction.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.
1 Saad Arrabi 2/24/2010 CS  Definition of soft errors  Motivation of the paper  Goals of this paper  ACE and un-ACE bits  Results  Conclusion.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri.
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
Self-Checking Carry-Select Adder Design Based on Two-Rail Encoding
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.
1 Multi-Level Error Detection Scheme based on Conditional DIVA-Style Verification Kevin Lacker and Huifang Qin CS252 Project Presentation 12/10/2003.
HPCA, Austin, Texas February BulletProof: A Defect-Tolerant CMP Switch Architecture 1 BulletProof: A Defect-Tolerant CMP Switch Architecture Kypros.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:
Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
Qiang XU CUhk REliable computing laboratory (CURE)
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Using Memory to Cope with Simultaneous Transient Faults Authors: Universidade Federal do Rio Grande do Sul Programa de Pós-Graduação em Engenharia Elétrica.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Fall 2012 Parallel Computer Architecture Lecture 13: Multithreading III Prof. Onur Mutlu Carnegie Mellon University 10/5/2012.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Computer Architecture: Multithreading (III)
nZDC: A compiler technique for near-Zero silent Data Corruption
Architecture & Organization 1
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Hyperthreading Technology
Hwisoo So. , Moslem Didehban#, Yohan Ko
Milad Hashemi, Onur Mutlu, Yale N. Patt
Architecture & Organization 1
Computer Architecture Lecture 4 17th May, 2006
Douglas Lacy & Daniel LeCheminant CS 252 December 10, 2003
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Guihai Yan, Yinhe Han, and Xiaowei Li
Dynamic Verification of Sequential Consistency
Presentation transcript:

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev

overview Motivation Current Techniques Proposed Mechanism for Online Fault Diagnosis Results Challenges Conclusion

Hard Faults Electron MigrationGate Oxide Breakdown background Transient Faults Single Event Upset

motivation Process Scaling

current fault handling techniques DIVA Redundancy

DIVA UTILIZE REDUNDANCY UTILIZE REDUNDANCY error detection and correction hybrid approach

online diagnosis Track Units DIVA ERROR deconfigure unit error_count++ If(error_count > threshold) YES NO No Action

ALU DIVA CHECKER Reorder Buffer Reservation Station Units that can be turned off in case of a fault Field Deconfigurable Units (FDU)

Deconfigure entries in circular bufferDeconfigure entries in tabular structure deconfiguring mechanism

Hard fault diagnosis latency Performance impact of losing component to hard fault analysis DIVA: 6% of an Alpha core Error counters (~1227 bits total) Instruction resource usage (19 wires in total) Deconfiguration logic Can be reduced using coarse granularity

challenges Error count threshold Related to resource usage Heavily used resources have higher counters Pipeline flushes before threshold is reached

challenges Error count threshold Related to resource usage Heavily used resources have higher counters Pipeline flushes before threshold is reached

Transient faults Independent resource usage ERROR HARD FAULT TRANSIENT FAULT ABC DEF Desired Observed DIVA CHECKER challenges

Certain structures cannot be protected Register File Issue logic Common Data Bus (CDB) Transient fault  False Deconfiguration Possibly masked by error counter Faults in the error counter or deconfiguration logic Periodically test counters Permanently configure or deconfigure FDU upon error Window of vulnerability DIVA produces errors until counter saturates limitations

As transistors shrink, hard fault rate increases Current reliability mechanisms Redundancy (TMR) Thread level redundancy Pre shipment testing and deconfiguration Low cost solutions such as DIVA Online diagnosis Low cost and hardware overhead Use FDUs along with DIVA to diagnose faults dynamically Increase yield  Binned to a lower performance bin conclusion

discussion What are the advantages of this hybrid scheme over using just a DIVA checker? As process technology gets smaller, can this mechanism help increase the lifetime of the processor a significant amount? As transistors shrink, the number of cores will increase, can this mechanism be used still as opposed to turning off a faulty core? How can we extend this mechanism to take care of the issue logic, singleton resources and CDB?

citations images Electron Migration. Digital image. Wikimedia.org. Wikimedia, 6 Mar Web.. Gate Oxide Breakdown. Digital image. Attopsemi Technology. Attopsemi Technology, n.d. Web.. Sawant, Minal. Single Event Upset. Digital image. COTS. Microsemi, Jan Web.. Sawant, Minal. Soft Error Rate. Digital image. CCCP. University of Michigan, 11 May Web.. Carr, Robert. Simultaneous Multithreading. Digital image. Prezi. Prezi, 31 Oct Web.. Wong, William. Out of Order Pipeline. Digital image. Electronic Design. Electronic Design, 19 Oct Web.. Mark Brehob, EECS 470 Lecture Slides Fred A. Bower, Daniel J. Sorin, and Sule Ozev. A Mechanism for Online Diagnosis of Hard Faults Microprocessors. In Proc. Of the 38 th Annual IEEE/ACM International Symposium on Microarchiteceture (MICRO’05), 2005 T.M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. Of the 32 nd Annual IEEE/ACM Int’l Symposium on Microarchitecture, pages , Nov papers