P189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs Carl Carmichael 1, Brendan Bridgford 1, Gary.

Slides:



Advertisements
Similar presentations
FPGA Configuration. Introduction What is configuration? – Process for loading data into the FPGA Configuration Data Source Configuration Data Source FPGA.
Advertisements

Sana Rezgui 1, Jeffrey George 2, Gary Swift 3, Kevin Somervill 4, Carl Carmichael 1 and Gregory Allen 3, SEU Mitigation of a Soft Embedded Processor in.
10/14/2005Caltech1 Reliable State Machines Dr. Gary R Burke California Institute of Technology Jet Propulsion Laboratory.
Scrubbing Approaches for Kintex-7 FPGAs
Radiation Effects on FPGA and Mitigation Strategies Bin Gui Experimental High Energy Physics Group 1Journal Club4/26/2015.
HPEC 2012 Scrubbing Optimization via Availability Prediction (SOAP) for Reconfigurable Space Computing Quinn Martin Alan George.
Complex Upset Mitigation Applied to a Re-Configurable Embedded Processor EEL 6935 Lu Hao Wenqian Wu.
ICAP CONTROLLER FOR HIGH-RELIABLE INTERNAL SCRUBBING Quinn Martin Steven Fingulin.
The 8085 Microprocessor Architecture
Microprocessor and Microcontroller
DC/DC Switching Power Converter with Radiation Hardened Digital Control Based on SRAM FPGAs F. Baronti 1, P.C. Adell 2, W.T. Holman 2, R.D. Schrimpf 2,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
L189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs (“Birds-of-a-Feather”) Carl Carmichael 1, Brendan.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
Configuration. Mirjana Stojanovic Process of loading bitstream of a design into the configuration memory. Bitstream is the transmission.
CS-334: Computer Architecture
Radiation Effects and Mitigation Strategies for modern FPGAs 10 th annual workshop for LHC and Future experiments Los Alamos National Laboratory, USA.
12004 MAPLD: 141Buchner Single Event Effects Testing of the Atmel IEEE1355 Protocol Chip Stephen Buchner 1, Mark Walter 2, Moses McCall 3 and Christian.
A comprehensive method for the evaluation of the sensitivity to SEUs of FPGA-based applications A comprehensive method for the evaluation of the sensitivity.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
2004 MAPLD, Paper 190 JJ Wang 1 SEU-Hardened Storage Devices in a 0.15 µm Antifuse FPGA – RTAX-S J. J. Wang 1, B. Cronquist 1, J. McCollum 1, R. Gorgis.
Presented by Anthony B. Sanders NASA/GSFC at 2005 MAPLD Conference, Washington, DC #196 1 ALTERA STRATIX TM EP1S25 FIELD-PROGRAMMABLE GATE ARRAY (FPGA)
Top Level View of Computer Function and Interconnection.
FORMAL VERIFICATION OF ADVANCED SYNTHESIS OPTIMIZATIONS Anant Kumar Jain Pradish Mathews Mike Mahar.
ATMEL ATF280E Rad Hard SRAM Based FPGA SEE test results Application oriented SEU Sensitiveness Bernard BANCELIN ATMEL Nantes SAS, Aerospace Business Unit.
P173/MAPLD 2005 Swift1 Upset Susceptibility and Design Mitigation of PowerPC405 Processors Embedded in Virtex II-Pro FPGAs.
Fault-Tolerant Systems Design Part 1.
MAPLD 2005/202 Pratt1 Improving FPGA Design Robustness with Partial TMR Brian Pratt 1,2 Michael Caffrey, Paul Graham 2 Eric Johnson, Keith Morgan, Michael.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
EEE440 Computer Architecture
Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181.
MooreC142/MAPLD Single Event Effects (SEE) Test Results on the Virtex-II Digital Clock Manager (DCM) Jason Moore 1, Carl Carmichael 1, Gary Swift.
Chapter 4 MARIE: An Introduction to a Simple Computer.
2011/IX/27SEU protection insertion in Verilog for the ABCN project 1 Filipe Sousa Francis Anghinolfi.
Electronic Analog Computer Dr. Amin Danial Asham by.
Petrick_P2261 Virtex-II Pro SEE Test Methods and Results David Petrick 1, Wesley Powell 1, James Howard 2 1 NASA Goddard Space Flight Center, Greenbelt,
LaRC MAPLD 2005 / A208 Ng 1 Radiation Tolerant Intelligent Memory Stack (RTIMS) Tak-kwong Ng, Jeffrey Herath Electronics Systems Branch Systems Engineering.
ALU (Continued) Computer Architecture (Fall 2006).
By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.
Evaluating Logic Resources Utilization in an FPGA-Based TMR CPU
Introduction to Microprocessors - chapter3 1 Chapter 3 The 8085 Microprocessor Architecture.
Lecture 4 General-Purpose Input/Output NCHUEE 720A Lab Prof. Jichiang Tsai.
A Simplified Approach to Fault Tolerant State Machine Design for Single Event Upsets Melanie Berg.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
8085 INTERNAL ARCHITECTURE.  Upon completing this topic, you should be able to: State all the register available in the 8085 microprocessor and explain.
Full Design. DESIGN CONCEPTS The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to.
Xilinx V4 Single Event Effects (SEE) High-Speed Testing Melanie D. Berg/MEI – Principal Investigator Hak Kim, Mark Friendlich/MEI.
P201-L/MAPLD SEE Validation of SEU Mitigation Methods for FPGAs Carl Carmichael 1, Sana Rezgui 1, Gary Swift 2, Jeff George 3, & Larry Edmonds 2.
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
Basic Computer Organization and Design
Chapter 2: Computer-System Structures
The 8085 Microprocessor Architecture
The 8085 Microprocessor Architecture
SEU Mitigation Techniques for Virtex FPGAs in Space Applications
ECE 448 Lecture 6 Finite State Machines State Diagrams vs. Algorithmic State Machine (ASM) Charts.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
An Introduction to Microprocessor Architecture using intel 8085 as a classic processor
MAPLD 2005 BOF-L Mitigation Methods for
Interfacing Memory Interfacing.
Computer-System Architecture
Module 2: Computer-System Structures
Registers.
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
The 8085 Microprocessor Architecture
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Presentation transcript:

P189/MAPLD2004Carmichael 1 A Triple Module Redundancy Scheme for SEU Mitigation of Static Latch-Based FPGAs Carl Carmichael 1, Brendan Bridgford 1, Gary Swift 2, Matt Napier 3 1 Xilinx Corporation, San Jose CA 2 Jet Propulsion Laboratory, Pasadena CA 3 Sandia National Laboratories, Albuquerque NM "This work was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration." "Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology."

P189/MAPLD2004Carmichael 2 ABSTRACT “Xilinx Triple Module Redundancy,” or XTMR, is an SEU mitigation technique and design methodology intended to remove all single points of failure within the configuration control cells and user logic elements, including those in the voting circuitry, as well as preventing the propagation of single event transients, by “triplicating” all inputs, outputs, logic, clock domains and voters. Voters are also inserted on all state logic feedback paths, conferring full SEU and SET immunity while allowing for autonomous re-synchronization of just-reconfigured state logic to the redundant domains. This paper presents the fundamental philosophy of the XTMR method, the automated implementation of XTMR provided by the new release of the “Xilinx TMRTool”, as well as Single Event Effects testing and analysis of the combined SEU mitigation technique of XTMR and autonomous partial re-configuration (scrubbing). The SEE test analysis demonstrates that this combined SEU mitigation technique pushes the cross- section for functional error for any design in any orbit to at least one order of magnitude below the established cross-sections for device level Single Event Functional Interrupts (SEFI). This study has the potential to alleviate the requirement for many users of having to perform independent SEE testing on individual design implementations.

P189/MAPLD2004Carmichael 3 XTMR SEU Mitigation Xilinx Triple Module Redundancy (XTMR) – Single Point Failures are eliminated by triplication of every logic node (gates & nets). – XTMR confers SEU and SET immunity – XTMR does not protect against SEFIs! – Any digital design can be XTMRed by: “Triplication” of throughput (combinational & sequential) logic “Triplication” of feedback logic and inserting majority voters Adding redundant IO (outputs with minority voters) Design cleanup (removing half-latches, SRL16s, etc.)

P189/MAPLD2004Carmichael 4 XTMR State-Machines “Pre-TMR” “Post-XTMR” XTMR provides autonomous re-synchronization of the separate redundant domains of a state-machine by inserting majority voters at the origin of any registered feed-back “Looped” path. When a configuration upset disables one domain, the other two domains continue to operate providing a correct majority representation of state data and functionality. When “Scrubbing” fixes the configuration of the upset domain, the embedded redundant voters automatically correct the state of the upset domain without any external intervention. As long as the scrub rate is greater than the upset rate, a single bit upset cannot disturb more than one redundant domain.

P189/MAPLD2004Carmichael 5 XTMR Inputs Effective SEU Mitigation requires the use of triple redundant input pins for every input signal. Not triplicating input Global signals (clk, rst, etc) can seriously compromise SEU resistance. Triplication of input data paths can be traded for EDAC. SEU resistance is sometimes a trade- off for resource utilization.

P189/MAPLD2004Carmichael 6 XTMR Outputs with Minority Voters Outputs can be triplicated, using three pins for each output signal. Minority voters monitor each of the triplicated design modules. If one module is different from the others, its output pin is driven to High-Z Voters are triplicated Minority Voter P TR0 TR1 TR2 Minority Voter P P Convergence point is outside FPGA, at trace

P189/MAPLD2004Carmichael 7 Xilinx TMRTool The Xilinx TMRTool is a graphical application that automates the implementation of XTMR for FPGA designs. The designer is provided the flexibility to selectively apply XTMR to their design at the instance, component, and hierarchical levels. Use of custom mitigation methods may be employed for specific portions of the design with the use of user created library macros. Designs are imported from a Xilinx netlist (NGO/NGC) and exported as a single standard EDIF project source.

P189/MAPLD2004Carmichael 8 XTMR SEE Testing Validation of mitigation of architectural resources by superposition. – Separate experiments were created to cover the major elements of the Virtex-II architecture: Configurable Logic Block – Combinatorial Logic, Sequential Logic, Arithmetics, Multiplexing. – Design implementation is an array of state-machines. Multipliers – Dedicated 18 x 18 bit multiply function blocks. – Design implementation is array of Multiply and Accumulate functions. Block Memories – Synchronous Dual Port 18k bit RAM blocks. – Design implemented as a single large memory space for high speed store and fetch functions. Input Output Blocks – Multi-standard discrete & bi-directional un/registered device IO. – Design implemented as feed-thru channels from IOB to IOB. Digital Clock Managers – Clock frequency synthesis and phase delay re-allignment. – This will be tested in future work.

P189/MAPLD2004Carmichael 9 2V6000 Dynamic SEU Test Inside target room Configuration Monitor/ Strip Chart Functional Monitor/ Strip Chart Back Side Front Side BEAM Thinned DUT

P189/MAPLD2004Carmichael 10 CLB Test Design MUX 32x1x mod0 mod15 MUX 32x1x DUT MODULE Configuration Manager Core SelectMAP SERVICE Functional Monitor FSM mod + Error Counters

P189/MAPLD2004Carmichael 11 CLB Test Functional Description The CLB test “pre-TMR” design consists of 512 (32 bit) counters created as 16 modules of 32 counters per module. Each counter in the module increments by a different value. The output of each module is a multiplex of the 32 counters. The outputs of all the modules are again multiplexed to a single 16 bit bus. A 10 bit address bus is used to select the output of a specific counter and select between the upper and lower 16 bit banks of the 32 bit module outputs. The Xilinx TMRTool software is used to process the design into a fully XTMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT. All counters are running continuously. Each counter is selected sequentially for sampling of it’s current state and operation. For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC. When counters are observed to be permanently in the wrong state the design is reset to regain a fully functioning test. The final error count is calculated as the number of events that a counter either lost it’s state or moved to the wrong state.

P189/MAPLD2004Carmichael 12 Multiplier Test Design +1x1 +1x11 MUX 3x2x mod0 mod15 MUX 32x1x DUT MODULE Configuration Manager Core SelectMAP SERVICE Functional Monitor FSMmod + Error Counters + x Constant MAC +1x10 36 MAC

P189/MAPLD2004Carmichael 13 Multiplier Test Functional Description The Mutliplier test “pre-TMR” design consists of 48 (18x18x36 bit) Multiply and Accumulate (MAC) blocks created as 16 modules of 3 MACs per module. Each MAC in the module increments by 1 and multiplies by a different constant (1, 10, and 11, respectively). The output of each module is a multiplex of the 3 MACs and a select of the lower 32 bits and upper 4 bits of the 36 bit registered multiplier output. The outputs of all the modules are again multiplexed to a single 16 bit bus. An 8 bit address bus is used to select the output of a specific MAC and select between the upper and lower 16 bit banks of the 32 bit module outputs. The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT. All MACs are constantly accumulating. Each MAC is selected sequentially for a periodic sampling of it’s sequence. For each module sample taken, the actual and expected values are recorded along with sequential count of state errors and the running count of event errors into a strip chart file on the host PC. When MACs are observed to be permanently in the wrong state the design is reset to regain a fully functioning test. The final error count is calculated as the number of events that a MAC lost it’s state or produced an incorrect result.

P189/MAPLD2004Carmichael 14 BRAM Test Design 16 DUT Configuration Manager Core SelectMAP SERVICE Functional Monitor FSM + Error Counters k byte RAM ADDRESS DATA

P189/MAPLD2004Carmichael 15 BRAM Test Functional Description The Block Memory test “pre-TMR” design consists of single large 128k byte single port memory space created from 64 memory blocks of 16k bits each. The Xilinx TMRTool software is used to process the design into a fully TMR mitigated design. Both the TMR and pre-TMR designs undergo active scrubbing (partial reconfiguration for SEU correction) for the configuration of the DUT. Separate WRITE and READ routines are executed to all memory address locations. The data is derived from a decrement of the address value. The entire memory space is refreshed with a write operation and then the data is retrieved with a read operation. During the read operation the retrieved data is compared against the expected value. For each data sample taken, the actual and expected values are recorded with the running count of event errors into a strip chart file on the host PC. Each error event is measured for it’s total word error size in bits: 1, 32, 64, 512, 1024, etc. The final error count is calculated as the number of separate events of word errors.

P189/MAPLD2004Carmichael 16 Configuration Error Detection and Correction Algorithm CONFIGURE START DONE READBACK CONFIG CRC CHECKPORT SEFI SCRUB READBACK SCRUB CRC CONFIG = SCRUB CRC ERROR +1 CRC ERROR = 2 CRC ERROR = 0 DONE NO YES NO YES NO PREV = SCRUB YES NO PREV CRC Configure target FPGA with configuration data stored in the configuration PROM(s). Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Config-CRC”. Perform a Write/Read check on the internal Frame Address Register of target FPGA. Scrub (background refresh) configuration data of target FPGA. Read back configuration programming data from target FPGA and calculate 16 bit CRC. Store CRC value as “Rdbk-CRC” and perform bit-for-bit error detection of configuration data. Compare “RDBK CRC” with “Config-CRC If CRC values mismatch a second time then assert SEFI_ERROR and RECONFIGURE

P189/MAPLD2004Carmichael 17 Previous SEE Test Methodology for Mitigation The assertion of the combined mitigation method of XTMR & Scrubbing is that the complete removal of Single Event Functional Errors in the user logic confers any user design to an overall error rate determined by the remaining Single Event Functional Interrupts. Therefore, a successful mitigation test is expected to produce zero errors other than SEFIs. Since the effectiveness of TMR is dependent upon no accumulation of errors in the configuration, experiments were attempted to maintain an upset rate that did not exceed the scrub rate. This methodology had two significant flaws: – One is an impracticality of testing at such low fluxes requiring unreasonably long run times and thus being incapable of reaching sufficient fluence for acceptable statistical significance of data. – The other flaw is that a zero error rate result is not useful for making any calculations or extrapolations. These issues raise concerns over the validity of any results.

P189/MAPLD2004Carmichael 18 Improved SEE Test Methodology for Mitigation There is an expected physical relationship between functional error rate of a mitigated system as a function of upset rate. The expected relationship is a function that predicts the increasing probability of upsetting bit combinations that will cause a mitigated (TMR) system to fail as a function of bit upset rate: MER = (1/2)(N B C A /T S )R U 2 – MER = Mitigation Error Rate – N B = Number of Relevant Bits – C A = Average Cluster Size – T S = Scrub Time – R U = Upset Rate of Relevant Bits. Therefore, testing at extremely high fluxes over several orders of magnitude variation can be performed to reveal this functional relationship between mitigation error rate and bit upset rate. This function can then be extrapolated to make predictions at the much lower upset rates of earth orbits.

P189/MAPLD2004Carmichael 19 Plot Definitions Predicted SEFI cross-section – Static and Dynamic SEE Characterization of the Virtex-II FPGA revealed several Single Event Functional Interrupt Modes: POR (2.5E-06), SMAP (1.72E-06), IOB (4.2E-06) – These combined cross-sections represent the minimum functional error cross-section for a single Virtex-II (XQR2V6000) device on orbit. Worst Case Orbital Upset Rate – CREME96 calculation of the worst case orbital upset rate for a XQR2V6000 is 7,740 bit-errors/day (9E-02 bit-errors/sec) in a GEO orbit at 36,000km during the worst day of an Anomalously Large Solar Flare accounting for both Heavy Ion and Proton. In a 40MeV Kr beam the exact same upset rate is achieved with a Flux of 1.25E-01 p/cm 2 /s. This denotes that the equivalent upset rates for all other orbits and solar conditions would reside to the LEFT of this line. Single Event Functional Interrupts – This is the average cross-section of the observed SEFI(s) while collecting the data represented in the plot. This cross-section is not Flux dependent. Variations from the predicted value are due to statistical significance of the total accumulated fluence during each test. Functional Errors – Data plot of the observed events when the Device Under Test returned an incorrect result. Cross-section is determined by the number of error events divided by total fluence at the specified flux. TMR denotes that the DUT design was fully mitigated with XTMR and scrubbing. The Unmitigated results were obtained with an identically functional design without XTMR, however scrubbing was also used for the unmitigated test. Extrapolation – A derived function describing the relation between Mitigation failure as a function of upset rate. Extension of the function predicts functional error cross-sections at worst case orbital upset rates to be less than SEFI cross-sections.

P189/MAPLD2004Carmichael 20 PLOT 1 36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day All other orbits SEFIs drive error rate for all designs and all orbits. Mitigation errors on orbit are always less than SEFI errors by orders of magnitude 3.5E-023.5E-013.5E+003.5E+013.5E+023.5E+03 Configuration Bit Errors per Scrub Cycle 40 MeV Kr LET= 22.3 MeV/cm 2 /mg

P189/MAPLD2004Carmichael 21 PLOT 2 36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day All other orbits SEFIs drive error rate for all designs and all orbits. Mitigation errors on orbit are always less than SEFI errors by orders of magnitude 3.5E-023.5E-013.5E+003.5E+013.5E+023.5E+03 Configuration Bit Errors per Scrub Cycle 40 MeV Kr LET= 22.3 MeV/cm 2 /mg 3.5E+03

P189/MAPLD2004Carmichael 22 PLOT 3 36,000km GEO Orbit Worst Day Solar Flare 8,000 bit-errors/day All other orbits SEFIs drive error rate for all designs and all orbits. Mitigation errors on orbit are always less than SEFI errors by orders of magnitude 3.5E-023.5E-013.5E+003.5E+013.5E+023.5E+03 Configuration Bit Errors per Scrub Cycle 40 MeV Kr LET= 22.3 MeV/cm 2 /mg 3.5E+03

P189/MAPLD2004Carmichael 23 SEE Test Analysis The experiments were conducted over a flux range of 7E+00 to 4E+04 (p/cm 2 /s). The Flux rates have been normalized in the secondary (top) x-axis of the plots to “average bit upsets per scrub cycle” (R S ). Each experiment demonstrated a drop in failure cross-section over several orders of magnitude, crossing the SEFI cross-section at upset rates that are still several orders of magnitude above worst case orbital upset rates. Extrapolating this data for each experiment clearly demonstrates a mitigation error cross-section at least 1 or more orders of magnitude below the SEFI cross-section at worst case orbital upset rates. By Superposition of the data fit functions, the total effective mitigated error rate cross-section is Sigma TOTAL = Sigma BRAM + Sigma CLB + Sigma MULT + Sigma SEFI Sigma TOTAL = 5.0E-8(1.4 R S ) (2) + 5.0E-6(0.7 R S ) (0.5) E-6(1.4 R S ) (0.35) E-6 (cm 2 ) Therefore, at the worst case orbital upset rate of 9E-2 upsets/sec (R S =4.5E-2 upsets/scrub) the effective total cross-section for functional error is calculated: Sigma TOTAL = 1.05E-5 (cm 2 /device) {Orbital Worst Case}

P189/MAPLD2004Carmichael 24 Conclusions Efficiency and accuracy of the validation of mitigation techniques is greatly improved by demonstrating the upset rate dependency of the mitigation method by testing at Flux rates that overwhelm the mitigation. The static SEFI cross-section is the dominating factor for calculating orbital error rates for any Virtex-II design when mitigated with Full XTMR & Scrubbing. Future Work – The authors recognize an anomaly in the data fit functions in that they were not all expressed as a square function. It is anticipated that this is due to the complexity of the bit clusters of the experimental designs. Additional research is called for to derive the separate coefficients for the MER equation for each design and explain their functional associations.