Fault-Tolerant Softcore Processors Part I: Fault-Tolerant Instruction Memory Nathaniel Rollins Brigham Young University.

Slides:



Advertisements
Similar presentations
Giuseppe De Robertis - INFN Sez. di Bari 1 SEU – SET test structures.
Advertisements

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Sana Rezgui 1, Jeffrey George 2, Gary Swift 3, Kevin Somervill 4, Carl Carmichael 1 and Gregory Allen 3, SEU Mitigation of a Soft Embedded Processor in.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
10/14/2005Caltech1 Reliable State Machines Dr. Gary R Burke California Institute of Technology Jet Propulsion Laboratory.
Scrubbing Approaches for Kintex-7 FPGAs
Discussion of: “Terrestrial-based Radiation Upsets: A Cautionary Tale” CprE 583 Tony Kuker 12/06/05.
Multi-Bit Upsets in the Virtex Devices Heather Quinn, Paul Graham, Jim Krone, Michael Caffrey Los Alamos National Laboratory Gary Swift, Jeff George, Fayez.
Fault-Tolerant Systems Design Part 1.
HPEC 2012 Scrubbing Optimization via Availability Prediction (SOAP) for Reconfigurable Space Computing Quinn Martin Alan George.
Complex Upset Mitigation Applied to a Re-Configurable Embedded Processor EEL 6935 Lu Hao Wenqian Wu.
1 Fault Tolerant FPGA Co-processing Toolkit Oral defense in partial fulfillment of the requirements for the degree of Master of Science 2006 Oral defense.
ICAP CONTROLLER FOR HIGH-RELIABLE INTERNAL SCRUBBING Quinn Martin Steven Fingulin.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
DSD 2007 Concurrent Error Detection for FSMs Designed for Implementation with Embedded Memory Blocks of FPGAs Andrzej Krasniewski Institute of Telecommunications.
Micro-RDC Microelectronics Research Development Corporation A Programmable Scrubber for FPGAs ACKNOWLEDGMENT OF SUPPORT: This material is based upon work.
Configuration. Mirjana Stojanovic Process of loading bitstream of a design into the configuration memory. Bitstream is the transmission.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
With Scott Arnold & Ryan Nuzzaci An Adaptive Fault-Tolerant Memory System for FPGA- based Architectures in the Space Environment Dan Fay, Alex Shye, Sayantan.
Radiation Effects and Mitigation Strategies for modern FPGAs 10 th annual workshop for LHC and Future experiments Los Alamos National Laboratory, USA.
Nadpis 1 Nadpis 2 Nadpis 3 Jméno Příjmení Vysoké učení technické v Brně, Fakulta informačních technologií v Brně Božetěchova 2, Brno
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
Reduced Cost Reliability via Statistical Model Detection Jon-Paul Anderson- PhD Student Dr. Brent Nelson- Faculty Dr. Mike Wirthlin- Faculty Brigham Young.
A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.
FPGA IRRADIATION and TESTING PLANS (Update) Ray Mountain, Marina Artuso, Bin Gui Syracuse University OUTLINE: 1.Core 2.Peripheral 3.Testing Procedures.
Computers Internal Communication. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
FORMAL VERIFICATION OF ADVANCED SYNTHESIS OPTIMIZATIONS Anant Kumar Jain Pradish Mathews Mike Mahar.
Fault-Tolerant Systems Design Part 1.
MAPLD 2005/202 Pratt1 Improving FPGA Design Robustness with Partial TMR Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
Wang-110 D/MAPLD SEU Mitigation Techniques for Xilinx Virtex-II Pro FPGA Mandy M. Wang JPL R&TD Mobility Avionics.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
Stores the OS/data currently in use and software currently in use Memory Unit 21.
Fault-Tolerant Systems Design Part 1.
2011/IX/27SEU protection insertion in Verilog for the ABCN project 1 Filipe Sousa Francis Anghinolfi.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
Using Memory to Cope with Simultaneous Transient Faults Authors: Universidade Federal do Rio Grande do Sul Programa de Pós-Graduação em Engenharia Elétrica.
Evaluating Logic Resources Utilization in an FPGA-Based TMR CPU
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
A4 1 Barto "Sequential Circuit Design for Space-borne and Critical Electronics" Dr. Rod L. Barto Spacecraft Digital Electronics Richard B. Katz NASA Goddard.
Paper by F.L. Kastensmidt, G. Neuberger, L. Carro, R. Reis Talk by Nick Boyd 1.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Xilinx V4 Single Event Effects (SEE) High-Speed Testing Melanie D. Berg/MEI – Principal Investigator Hak Kim, Mark Friendlich/MEI.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
Wu, Jinyuan Fermilab May. 2014
Unit Microprocessor.
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
CFTP ( Configurable Fault Tolerant Processor )
CoBo - Different Boundaries & Different Options of
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
SEU Mitigation Techniques for Virtex FPGAs in Space Applications
Fault Tolerance In Operating System
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Computer Architecture & Operations I
MAPLD 2005 BOF-L Mitigation Methods for
M. Aguirre1, J. N. Tombs1, F. Muñoz1, V. Baena1, A. Torralba1, A
Computer System Overview
Sequential circuits and Digital System Reliability
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Evaluation of Power Costs in Triplicated FPGA Designs
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Xilinx Kintex7 SRAM-based FPGA
Computer Operation 6/22/2019.
Seminar on Enterprise Software
Presentation transcript:

Fault-Tolerant Softcore Processors Part I: Fault-Tolerant Instruction Memory Nathaniel Rollins Brigham Young University

Overview Strong interest in FT softcore processors in space  LEON processor used by European space program  Microblaze, PicoBlaze, 8051, ERC32, etc. Rad-hard processors are expensive, big, and slow Softcore processors are flexible, fast, and cheap Overall Goal: identify low cost SEU mitigation techniques for softcore processors  Goal of Part I study: Identify low cost SEU mitigation techniques for softcore processor instruction memories 2

Approach TMR is the most common mitigation technique  Expensive and slow Other hardware techniques  Detection isn’t good enough – must correct DWC alone isn’t good enough EDC alone isn’t good enough 3 BRAM1 BRAM2 BRAM3 voter ECC BRAM1 BRAM2 Compare Decode / Parity ECC BRAM Decode & Correct ECCEDC with DWC Scrubbing Study Approach  Compare different softcore processor instruction memory fault-tolerant techniques in terms of:  Area, speed, power, reliability  Remaining processor protection: plain TMR

Fault Model BYU/LANL SLAAC1V fault injection tool used to insert single bit upsets into Virtex FPGAs  BRAM bits in Virtex bitstream are treated differently  Task: upgrade fault injection tool to support: Upsets in BRAM Readback of BRAM bits Next studies use SEAKR XRTC board with Virtex4 FPGA  SEAKR board borrowed from LANL  Fault injection tool also upgraded to upset BRAMs and detect critical failures 4

Critical Failures Critical Failures: upsets that cannot be fixed with a reset (lead to a SEFI)  Different memory structures are susceptible to critical failures: BRAMs LUTRAMs SRLs Registers that are not tied to a global reset  Example: WE port on a BRAM 5 BRAM Data In WE Data Out Addr

Critical Failures Critical Failures: upsets that cannot be fixed with a reset (lead to a SEFI)  Example: WE port on a BRAM 6 BRAM 0x x3111 0x07 AF01E32D 39A13AA D10 210F F 64D1234D... Instruction memory should never be written to → BRAM is treated as a ROM input data lines tied low WE tied low

Critical Failures Critical Failures: upsets that cannot be fixed with a reset (lead to a SEFI)  Example: WE port on a BRAM 7 BRAM 0x x07 AF01E32D 39A13AA D10 210F F 64D1234D... Upsetting the WE port overwrites the BRAM contents

Critical Failures Critical Failures: upsets that cannot be fixed with a reset (lead to a SEFI)  Example: WE port on a BRAM 8 BRAM 0x x1D AF01E32D 39A13AA D10 210F Especially bad for processors since BRAM address continually increments

Critical Failures Critical Failures: upsets that cannot be fixed with a reset (lead to a SEFI)  Example: WE port on a BRAM Mitigation techniques need to eliminate critical failures 9 BRAM 0x x Resetting the device will restart the processor, but will not restore the BRAM contents (program is lost)!

Fault-Tolerant Techniques Original processor design: Xilinx PicoBlaze 10 ROM PicoBlaze Instruction Address Output Processor Fault-tolerance determined by examining the PC and current instruction as faults are injected Instruction memory fault-tolerant techniques:  TMR:  Single voter  Triple voter  Feedback  BLTMR  Scrubber  ECC:  SEC/DED  SEC/DED with DWC  SEC/DED with DWC and scrubbing  EDC & DWC:  CD with DWC  CD with DWC and scrubbing

Fault-Tolerant Techniques: TMR 11 Processor voter Processor voter v v v Pico ROMROM ROMROM v v v ROMROM ROMROM ROMROM ROMROM Top-Level TMR – 1 voterTop-Level TMR – 3 voters Feedback TMR BYU/LANL TMR Tool Pico ROMROM ROMROM BLTMR

FT Techniques: TMR with Scrubbing 12 BYU/Sandia BRAM scrubber with TMR  Each BRAM scrubbing WE must be independent of other BRAM WEs  Scrubbing address counters MUST be kept in sync  Scrubbing counter must be 2x slower than BRAM clock  Must prevent read/write address conflicts Without scrubbing overlapping errors will cause TMR to fail v FSM v v PicoBlaze v v v v v v Triplicated counter EN Eliminating critical failures is difficult when BRAM WEs are upset

FT Techniques: SEC/DED 13 Decode v encoded ROM (top half) Decode v v encoded ROM (bottom half) PicoBlaze v v v SEC/DED on 16-bit word:  Use (22, 6) code on 16-bit word  Use 2 BRAMS: 1 for top half encoded word (11 bits) 1 for bottom half encoded word (11 bits) Complete fault tolerance difficult when crossing from triplicated to non-triplicated  Logic and routing coming into and out of BRAMs are single point of failure  SEC/DED:  Detects and corrects any single-bit upset  Detects any double-bit upset  Triple+ upsets may or may not be detected

FT Techniques: SEC/DED with DWC 14 Improve SEC/DED reliability with DWC Still susceptible to critical failures when BRAM WE is upset 0 1 Decoder v encoded ROM (top half) Decoder encoded ROM (bottom half) SEC/DED Module v v v PicoBlaze v v v BRAM ado

FT Techniques: SEC/DED DWC Scrub 15 Scrubbing uses dual ported BRAMs  Scrub address counter runs ½ speed of BRAM clock Scrubbing cannot fix all errors (only single-bit/double-bit guaranteed)  Scrub trigger: single error correction(SEC) or double error detection (DED) on current instruction – more than 2 errors may or may not be caught  When triggered, a scrub copies entire BRAM contents of good BRAM into bad BRAM PicoBlaze v v v Decoder encoded ROM a do a di we do Triple CNT EN FSM x3 v v v encoded ROM a do a di we do SEC/DED Module instruction0 instruction1 instruction2 addr scrubAddr err we scrubData

FT Techniques: CD with DWC 16 Complement Duplicate (CD) duplicates and inverts (complements) the original BRAM contents  Detects errors by comparing the original with the complemented CD CD only detects upsets so DWC is used to correct upsets 0 1 CD check CD Module v v v PicoBlaze v v v BRAM ado BRAM ado CD BRAM ado  CD detects:  Any single-bit upset  66% double-bit upsets  Any multiple adjacent unidirectional upset

FT Techniques: CD DWC Scrub 17 Scrubbing uses dual ported BRAMs  Scrub address counter runs ½ speed of BRAM clock Scrubbing will fix critical failures  Scrubbing trigger: inverse of current instruction doesn’t match CD contents  When triggered, a scrub copies entire BRAM contents of good BRAM into bad BRAM  There are other scrubbing design strategies with CD – but this one removes all critical failures PicoBlaze v v v CD Check BRAM a do a di we do Triple CNT EN FSM x3 v v v CD BRAM a do a di we do CD Module instruction0 instruction1 instruction2 addr scrubAddr err we scrubData

FT Techniques: Results 18 DesignSlicesBRAM BitsClock Rate (MHz) Power (mW)Sensitive BitsCritical Failures Original voter2273.2x16803x x661.35x8473.4x3 3 voters2523.6x16803x x751.53x3680.0x3 Feedback2503.6x16803x x731.49x6842.4x3 BLTMR2974.2x16803x x761.55x5255.4x3 TMR Scrub3485.0x16803x x821.67x x0 SEC/DED3404.9x7701.4x x821.67x7114.1x16 SEC/DED DWC3735.3x x x891.82x4736.1x3 SEC/DED DWC Scrub x x x x3268.8x0 CD DWC2353.4x22404x x721.47x x2 CD DWC Scrub3955.6x22404x x901.84x x0 Clock and reset lines are NOT triplicated

Conclusions Reliability  For instruction memories, TMR with scrubbing provides the best protection Fewest sensitivities Eliminates critical failures  Scrubbing is required to eliminate critical failures Costs  TMR is more effective than SEC/DED and CD with DWC Better protection Lower area, speed, and power costs  SEC/DED and CD with DWC scrubbers are very expensive 19

FT Softcore Processors: Moving Forward Next General Studies:  Memory Study: BRAMs & LUTRAMs  Software fault-tolerant techniques study Create different fault models for SEAKR board  Multi-bit upset model  Temporal fault-tolerant techniques model Combinations of different fault-tolerant techniques 20 Memory Stack Reg File ALU IR PC CC Control Logic Control Flow Monitoring Checkpointing PicoBlaze Memory Stack Reg File ALU IR PC CC Control Logic SEC/DED DWC & Scrubbing TMR & Scrubbing PicoBlaze