Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.

Slides:



Advertisements
Similar presentations
CSCE430/830 Computer Architecture
Advertisements

CP1610: Introduction to Computer Components Primary Memory.
April 30, Cost efficient soft-error protection for ASICs Tuvia Liran; Ramon Chips Ltd.
Microprocessor Reliability
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
Relex Reliability Software “the intuitive solution
® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce.
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,
RAID Technology. Use Arrays of Small Disks? 14” 10”5.25”3.5” Disk Array: 1 disk design Conventional: 4 disk designs Low End High End Katz and Patterson.
© 2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Data protection and disaster recovery.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Simulation of End-of-Life Computer Recovery Operations Design Team Jordan Akselrad, John Marshall Mikayla Shorrock, Nestor Velilla Nicolas Yunis Project.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Moore’s Law Kyle Doran Greg Muller Casey Culham May 2, 2007.
Cost-Effective Register File Soft Error reduction Pablo Montesinos, Wei Liu and Josep Torellas, University of Illinois at Urbana-Champaign.
PED Roadmapping Issues Vijaykrishnan Narayanan Dept. of CSE Penn State University GSRC Workshop, March 20-21, 2003.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
1 Software Fault Protection Allen Goldberg Kestrel Technology.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Redundant Array of Inexpensive Disks (RAID). Redundant Arrays of Disks Files are "striped" across multiple spindles Redundancy yields high data availability.
Software Reliability Categorising and specifying the reliability of software systems.
Advanced Computing and Information Systems laboratory Device Variability Impact on Logic Gate Failure Rates Erin Taylor and José Fortes Department of Electrical.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Lecture 4 1 Reliability vs Availability Reliability: Is anything broken? Availability: Is the system still available to the user?
1. 2 Electronics Beyond Nano-scale CMOS Shekhar Borkar Intel Corp. July 27, 2006.
Achieving Better Reliability With Software Reliability Engineering Russel D’Souza Russel D’Souza.
Alec Stanculescu, Fintronic USA Alex Zamfirescu, ASC MAPLD 2004 September 8-10, Design Verification Method for.
Why do so many chips fail? Ira Chayut, Verification Architect (opinions are my own and do not necessarily represent the opinion of my employer)
Space Radiation and Fox Satellites 2011 Space Symposium AMSAT Fox.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
CS 4001Mary Jean Harrold 1 Can We Trust the Computer?
AGBell-1- by Andrew G. Bell (260)
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
1. CAD Challenges for Leading-Edge Multimedia Designs Ira Chayut, Verification Architect (opinions are my own and do not necessarily represent the opinion.
Continuous Backup for Business CrashPlan PRO offers a paradigm of backup that includes a single solution for on-site and off-site backups that is more.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
- 1 - ©2009 Jasper Design Automation ©2009 Jasper Design Automation JasperGold for Targeted ROI JasperGold solutions portfolio delivers competitive.
Module 5: Managing Organizational Transformation Topic 11: Outsourcing.
Real-Time Systems, Events, Triggers. Real-Time Systems A system that has operational deadlines from event to system response A system whose correctness.
0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
1 University of Virginia Computer Science 2 NVIDIA Research A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics.
Eduardo L. Rhod, Álisson Michels, Carlos A. L. Lisbôa, Luigi Carro ETS 2006 Fault Tolerance Against Multiple SEUs using Memory-Based Circuits to Improve.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Fault tolerant Measures.
CS 61C: Great Ideas in Computer Architecture Dependability - ECC Nicholas Weaver & Vladimir Stojanovic 1.
A Novel, Highly SEU Tolerant Digital Circuit Design Approach By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Cisco Compliance Management and Configuration Service
Rad (radiation) Hard Devices used in Space, Military Applications, Nuclear Power in-situ Instrumentation Savanna Krassau 4/21/2017 Abstract: Environments.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Hardware & Software Reliability
ECE354 Embedded Systems Introduction C Andras Moritz.
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Presentation Title Greg Snider QSR, Hewlett-Packard Laboratories
Scott Mahlke University of Michigan
From Paintable Computing to Scale-free Architectures
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
ISCA 2000 Panel Slow Wires, Hot Chips, and Leaky Transistors: New Challenges in the New Millennium Moderator: Shubu Mukherjee VSSAD, Alpha Technology Compaq.
Definitions Cumulative time to failure (T): Mean life:
Seminar on Enterprise Software
Presentation transcript:

Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable Systems) Intel Corporation Workshop on Computer Architecture Research Directions (CARD) Feb. 11 th, 2007

2 What’s the Truth? There are three versions of the truth: My truth Your truth The truth

3 The Truth: Silicon is Becoming Unreliable Time dependent device degradation Extreme device variations Wider Cell Instability Is Increasing Soft Error FIT/Chip (Logic & Mem)

4 The End-User’s Truth End-users Care deeply about reliable systems May not be able to determine why their system failed Expect the industry to produce reliable systems for them Goal of silicon vendors Keep # silicon errors low enough (e.g., < 0.1% of all errors) Low enough that end-users don’t notice or don’t care Point Risks Individual corruption or crash may be critical (e.g., Windows 98 crash during a Gates demo) End-users may demand chip replacement, even if the error was not permanent

5 The IT Manager or Vendor’s Truth The Lightbulb Phenomenon A house with 48 lightbulbs, each with 4 year MTTF Will replace a lightbulb every month Negative Impact to Business  billions of dollars involved Increased total cost of ownership Product returns & replacement Loss of data and/or availability

6 The Designer’s Awakening Shock “SER is the crabgrass in the lawn of computer design” Denial “We will do the SER work two months before tapeout” Anger “Our reliability target is too ambitious” Acceptance “You can deny physics only for so long” Designers have accepted silicon reliability as a challenge they will have to deal with

7 The Designer’s Challenge Protection comes from Process – improved process technology Materials – shielding for alpha particles Circuits – rad-hard cells Architecture – ECC, parity, hardened gates, redundant execution Software – can provide detection & recovery at higher level Companies constantly making trade-offs for reliability Cost of protection (performance & die size) vs. chip reliability Products must meet the end-users reliability expectations Industry will produce reliably operating parts

8 Industry Needs Help with Research Academia has some misconceptions MTBF is only a rough estimate of an individual parts life A system hang does not protect from data corruption Adding protection without correction does not reduce the overall error rate … Research needed in different areas of silicon reliability How do we predict and/or measure error rate from radiation, wearout, & variability? How do we detect soft errors, wearout, variability on individual parts? Many traditional solutions exist, but how do we make them cheaper?