Reduced Cost Reliability via Statistical Model Detection Jon-Paul Anderson- PhD Student Dr. Brent Nelson- Faculty Dr. Mike Wirthlin- Faculty Brigham Young.

Slides:



Advertisements
Similar presentations
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Advertisements

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI
1ASM Algorithmic State Machines (ASM) part 1. ASM2 Algorithmic State Machine (ASM) ‏ Our design methodologies do not scale well to real-world problems.
10/14/2005Caltech1 Reliable State Machines Dr. Gary R Burke California Institute of Technology Jet Propulsion Laboratory.
Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 261 Lecture 26 Logic BIST Architectures n Motivation n Built-in Logic Block Observer (BILBO) n Test.
Scrubbing Approaches for Kintex-7 FPGAs
CMP238: Projeto e Teste de Sistemas VLSI Marcelo Lubaszewski Aula 2 - Teste PPGC - UFRGS 2005/I.
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Fault-Tolerant Systems Design Part 1.
DC/DC Switching Power Converter with Radiation Hardened Digital Control Based on SRAM FPGAs F. Baronti 1, P.C. Adell 2, W.T. Holman 2, R.D. Schrimpf 2,
Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
C. Walter, Data Integrity for Modular Arithmetic, CHES 2000 CHES 2000 Data Integrity in Hardware for Modular Arithmetic Colin Walter Computation Department,
Fault-Tolerant Softcore Processors Part I: Fault-Tolerant Instruction Memory Nathaniel Rollins Brigham Young University.
Slide 1/20 Fault Tolerant Approaches to Nanoelectronic Programmable Logic Arrays Authors: Wenjing Rao, Alex Orailoglu, Ramesh Karri Conference: DSN 2007.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
1 Oct 24-26, 2006 ITC'06 Fault Coverage Estimation for Non-Random Functional Input Sequences Soumitra Bose Intel Corporation, Design Technology, Folsom,
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
Carlos Arthur Lang Lisbôa, Erik Schüler, Luigi Carro SRC TechCon 2005 Dealing with Multiple Simultaneous Faults in Future Technologies in Future Technologies.
הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology department of Electrical Engineering Virtex II-PRO Dynamical.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
CHAPTER 6 PASS-BAND DATA TRANSMISSION
Elad Hadar Omer Norkin Supervisor: Mike Sumszyk Winter 2010/11, Single semester project. Date:22/4/12 Technion – Israel Institute of Technology Faculty.
1 Computing Software. Programming Style Programs that are not documented internally, while they may do what is requested, can be difficult to understand.
A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.
CS1Q Computer Systems Lecture 8
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
J. Christiansen, CERN - EP/MIC
FT-UNSHADES Analysis of SEU effects in Digital Designs for Space Gioacchino Giovanni Lucia TEC-EDM, MPD - 8 th March Phone: +31.
Black Box Testing Techniques Chapter 7. Black Box Testing Techniques Prepared by: Kris C. Calpotura, CoE, MSME, MIT  Introduction Introduction  Equivalence.
Fault-Tolerant Systems Design Part 1.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
MAPLD 2005/202 Pratt1 Improving FPGA Design Robustness with Partial TMR Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael.
CprE 458/558: Real-Time Systems
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Fault-Tolerant Systems Design Part 1.
Simulation is the process of studying the behavior of a real system by using a model that replicates the system under different scenarios. A simulation.
1 Software Reliability Analysis Tools Joel Henry, Ph.D. University of Montana.
2011/IX/27SEU protection insertion in Verilog for the ABCN project 1 Filipe Sousa Francis Anghinolfi.
Using Memory to Cope with Simultaneous Transient Faults Authors: Universidade Federal do Rio Grande do Sul Programa de Pós-Graduação em Engenharia Elétrica.
Evaluating Logic Resources Utilization in an FPGA-Based TMR CPU
TOPIC : Transition Count Compression UNIT 5 : BIST and BIST Architectures Module 5.4 Compression Techniques.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Condition Testing. Condition testing is a test case design method that exercises the logical conditions contained in a program module. A simple condition.
A4 1 Barto "Sequential Circuit Design for Space-borne and Critical Electronics" Dr. Rod L. Barto Spacecraft Digital Electronics Richard B. Katz NASA Goddard.
Coding No. 1  Seattle Pacific University Digital Coding Kevin Bolding Electrical Engineering Seattle Pacific University.
Paper by F.L. Kastensmidt, G. Neuberger, L. Carro, R. Reis Talk by Nick Boyd 1.
Mutation Testing Laraib Zahid & Mariam Arshad. What is Mutation Testing?  Fault-based Testing: directed towards “typical” faults that could occur in.
On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.
Chandrasekhar 1 MAPLD 2005/204 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
Self-Checking Circuits
MAPLD 2005 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan Dr. V. Kamakoti.
Chap 7. Register Transfers and Datapaths
VLSI Testing Lecture 6: Fault Simulation
CS137: Electronic Design Automation
CPE/EE 428/528 VLSI Design II – Intro to Testing (Part 3)
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES
MAPLD 2005 BOF-L Mitigation Methods for
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Sequential circuits and Digital System Reliability
Soft Error Detection for Iterative Applications Using Offline Training
Evaluation of Power Costs in Triplicated FPGA Designs
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
UNIT-4 BLACKBOX AND WHITEBOX TESTING
Presentation transcript:

Reduced Cost Reliability via Statistical Model Detection Jon-Paul Anderson- PhD Student Dr. Brent Nelson- Faculty Dr. Mike Wirthlin- Faculty Brigham Young University

Alternative Mitigation Techniques Triple-Modular Redundancy (TMR) is expensive  Area3-5x  Timing~20%  Power3-5x Need reduced-cost mitigation techniques  Trade off reliability for area/timing/power Motivating Example:  In-orbit experiment cannot be triplicated due to area cost Some mitigation is better than none Marking of which data is suspect would be useful 2

Smart Detection Duplication + Detection as lower cost alternative?  Duplication is 2/3 the size of TMR  Duplicate With Compare only detects errors - doesn’t mask them Can DWC be modified to mask?  Use of ‘smart detector’ to attempt to mask errors. Circuit copy A Circuit copy B Comparator Input Output Not_equal 3

Smart Detector Circuit copy A Circuit copy B Mux Smart Detector Input Output Not_equal 4

Smart Detector Circuit copy A Circuit copy B Mux Smart Detector Not_equal Input Output 5

Statistical Smart Detector Statistical detection  Use a histogram of data values to try and determine which branch is without error  3 possible outcomes Correct detection Incorrect detection Ambiguous outcome 6

Simple Statistical Example Volt meter has redundant probes  TMR would be too expensive so we use two with a statistical model Three possible outcomes  One probe reads 1.2V and other reads 5V Result – Ambiguous detection  Correct circuit reads 3.3V and circuit in error reads 15V Result – Statistical model chooses correct voltage  Correct circuit reads 15V and circuit in error reads 3.3V Result – Wrong voltage chosen. VoltageProbability 3.3V50% 1.2V20% 5V20% 15V10% 7

Statistical Example Probe1 Probe2 Mux 2 Probe1 Probe2 Mux Σ Statistical model with no history Statistical model with 8 deep sample history 8

System under test 9 H1(z)H1(z) ↓2 H2(z)H2(z) H3(z)H3(z) H4(z)H4(z) H5(z)H5(z) 100 samples/ symbol 50 samples/ symbol 25 samples/ symbol 12.5 samples/ symbol 6.25 samples/ symbol length-5 half-band filter length-5 half-band filter length-9 half-band filter length-13 half-band filter length-9 half-band filter

Initial tests – Stuck at faults Downsampler was created in System Generator  Matlab was then used to create artificial stuck at faults and tabulate the results Tests run for stuck at 1 and stuck at 0 faults for all bits in the 20 bit result 10

Stuck at results 11 Bit position

Fault simulator BYU/LANL fault injection tool  Based on SLAAC-1V board PCI card with 3 Virtex 1000 FPGAs  Previously validated with radiation testing  Sensitive configuration bits are tabulated and then tested one by one 12 Virtex SEU Emulator (LANL/BYU) DUT “Golden Design” Real-Time Comparator

Test Methodology Design is loaded onto the SLAAC board and the sensitive configuration bits are tabulated  Every bit in the configuration bitstream on the DUT is flipped individually and if there is a difference on the output with the golden copy then the bit is recorded as ‘sensitive’. Random numbers are fed through a QPSK modulator in Matlab to generate the input vector. The vector is then run through the original design without injecting faults to gather a golden output. 13

Test Methodology The histogram is generated with the golden data in Matlab by specifying the bin size.  If the bin size is too large, too many faults will map to the same bin resulting in ambiguity.  Small bin sizes cause multiple bins to have the same counts, once again resulting in ambiguity.  To simplify the hardware, bin sizes are constrained to powers of 2. 14

Test Methodology The input vector is then run through the design for each sensitive bit and the output captured.  This design has sensitive bits  The fault is inserted into the design roughly halfway through the execution to give a certain amount of fault free operation Matlab is then used to implement the smart detector and analyze the results. 15

Ambiguous contribution Ambiguous detection occurs in three ways Four possible ways to count ambiguous results  Don’t count them at all  Record all as a wrong choice  Record all as a right choice  Record half as correct Assuming it is fair, 50% of the time you should get it right 16 AB Map to bins with same value Map to the same bin AB BBAABBAA Σ History has equal choices for A and B

Results – Correct decisions for 1024 bins Number of samples considered for decision Percent decisions with ambiguous outcome No ambiguous decisions counted All ambiguous counted as wrong Half ambiguous counted as right All ambiguous counted as right %86.13%56.73%73.80%90.87% %88.60%74.60%82.50%90.40% %90.15%81.29%86.20%91.12% %90.96%84.09%87.87%91.64% %92.05%86.84%89.67%92.50% Lower Bound Upper Bound 17

Results – Correct decisions for 4096 bins Number of samples considered for decision Percent decisions with ambiguous outcome No ambiguous decisions counted All ambiguous counted as wrong Half ambiguous counted as right All ambiguous counted as right %85.49%61.30%75.45%89.59% %93.05%84.12%88.92%93.71% %95.29%90.60%93.06%95.53% %95.87%92.53%94.27%96.02% %96.57%94.12%95.39%96.65% Lower Bound Upper Bound 18

Results – Correct decisions for bins Number of samples considered for decision Percent decisions with ambiguous outcome No ambiguous decisions counted All ambiguous counted as wrong Half ambiguous counted as right All ambiguous counted as right %88.49%67.51%79.36%91.21% %97.04%92.78%94.97%97.17% %97.69%95.90%96.81%97.73% %97.93%96.77%97.36%97.96% %98.25%97.45%97.86%98.27% Lower Bound Upper Bound 19

Possible Hardware Implementation 20 Smart detector consists of  Small number of BlockRAMs  Counter  Small amount of logic A B =? Difference Single sample vote Golden Histogram FIFO Accumulator Decision Dual-port BlockRAM Counter Dual-port BlockRAM Logic

Conclusion Smart detection using a simple histogram was discussed  High accuracy with very low resource costs Future work  Expand statistical approach to more than just FIR filters.  Investigate using machine learning techniques for an even more accurate smart model 21