Presentation on theme: "Complex Upset Mitigation Applied to a Re-Configurable Embedded Processor EEL 6935 Lu Hao Wenqian Wu."— Presentation transcript:
Complex Upset Mitigation Applied to a Re-Configurable Embedded Processor EEL 6935 Lu Hao Wenqian Wu
Outline Issues of SRAM-based FPGA used for space application Upset mitigation solutions Resource usage and performance analysis Summary
System on Programmable Chip Soft-core processor implemented in SRAM based FPGA is very attractive to spacecraft designer. A complete computer system can be created on a single FPGA chip.
MicroBlaze core MicroBlaze is a soft processor core designed for Xilinx FPGAs. Many aspects of the MicroBlaze can be user configured: cache size, pipeline depth (3-stage or 5-stage), embedded peripherals, memory management unit, and bus- interfaces. Local memory bus Onchip peripheral bus
Space application issues Radiation environment In space, high energy ionizing particles exist as part of the natural background. In addition, solar particle events and high energy protons trapped in the Earth's magnetosphere (Van Allen radiation belts). These electro-magnetic radiation brings potential threats to electronic devices. Single Event Upset (SEU) SEU is a change of state caused by ions or electro-magnetic radiation striking a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. The state change is a result of the free charge created by ionization in or close to an important node of a logic element (e.g. memory "bit"). FPGA is susceptible to SEU data/instruction stored in block memory configuration bits stored in distributed RAM Upsets mitigation technique is one of key issues for SRAM-based FPGA design for space application
Proposed upset mitigation To ensure reliable space application based on SRAM-FPGA, the author investigates 3 level of upset mitigation: –Functional-block design triplication –Continuous external configuration scrubbing –Independent internal BRAM scrubbing (also triplicated)
Tool, device and environment Tools: Xilinx TMR: easily trade off maximum radiation effect immunity against area, pinout, and board layout consideration. Device : Xilinx Virtex II XQR2 V6000 FPGA Program running in MicroBlaze: Integer-based FFT Test environment: Crocker Nuclear Laboratory at University of California at Davis using a proton beam of 63.3 MeV. Test borad Two FPGAs, one is device under test (DUT), the other is service FPGA
DUT and Service FPGA Service FPGA performs two functions: 1) configuration readback and scrubbing DUT when there is readback error 2) control and monitoring of the functional operation of the MicroBlaze running the FFT program Program (FFT) is stored in internal BRAM each time the DUT is configured Data is sent to DUT internal BRAM by service FPGA. The result of FFT program are returned to service FPGA and compared to the expected result. Service FPGA DUT uBlaze BRAM
TMR Triple Module Redundancy 3 modules performing the same task, only the majority will be pick up as output by the Voter. If any one of the three systems fails, the other two systems can correct and mask the fault. If the voter fails then the complete system will fail. However, in a good TMR system the voter is a critical component and should be much more reliable than the other components. TMR
External Configuration Scrubbing Configuration scrubbing is the process of rewriting the configuration memory of an FPGA for the purpose of correcting any errors that may have accumulated since the device was last configured. Service FPGA will detect readback error, and scrub the configuration by reloading bitstream to correct upsets. Transparent process normal device operation runs concurrently and without interruption Configuration scrubbing frequency: 16 MHz, i.e. 4 scrub-cycles per sec
BRAM Triplication Port B: counter connected; used for error detection and correction Port A: used for MicroBlaze processor
BRAM Triplication TMR counter –Allow continuous refreshing of the BRAM contents –Cycle through the memory addresses incrementing the BRAM address of the second port –In case the first port of the BRAM is not being used, it rewrites the BRAM content at this specific address with the voted value from the associated voter (TRV16). BRAM –Conventional BRAM Associated voter (TRV 16) –Compares three values from the same address of three BRAMs, selects the majority and writes back to the corresponding address.
Testing Two mitigated versions of the MicroBlaze design architecture have been implemented and tested: –with the BRAM scrubber. –without the BRAM scrubber. Error types: –Type 1 errors: FFT outputs were wrong. Type 1a: Corrected after a configuration scrub cycle Type 1b: Not corrected after a scrub cycle, even after a reset of the DUT design –Type 2 errors: Nonresponsiveness of the DUT, requiring a reset and synchronization Type 2a: Corrected by scrubbing and hence referred to as a recovering reset Type 2b: Not corrected by scrubbing and referred to as a runaway reset. –This type of error (runaway reset) is an uncorrected error condition that causes the functional monitor to continually attempt to reset the MicroBlaze processor each time the watchdog timer set for the handshaking between the two FPGAs reaches its limit value. –Type 3 errors: Occurrence of an exception or interrupt detection. This is what we emphasis on
Is BRAM code corruption the main reason of runaway resets? (No BRAM scrubber) (BRAM scrubber)
Standalone test To make sure that the BRAM code corruption is likely to be the cause of these runaway resets, the BRAM mitigation design has been implemented in standalone mode and tested under proton beams at similar fluxes and at the same facility.
Runaway Resets Caused by BRAM Corruption At a flux (1.70×10 8 ), at least 17% (1.21× /6.82× ) of the runaway resets are due to errors in the BRAM code, while at a (1.70×10 9 ) flux, 23% of them are caused by code corruption.
Exceptions Caused by BRAM Runaway Resets Design 1: An average of 64% of the unrecovered resets (due to BRAM code corruption) has been detected by exceptions (64% at the flux 1 and 80% at the flux 2). Design 2: exceptions were observed only after an increase of two orders of magnitude of the flux (1.70×10 9 ) and only 25% of the runaway resets have been detected. Not all the illegal states are detected by the exception mechanism. –At a lower flux (1.70×10 8 ), although seven resets have been observed, no exceptions have been detected The MicroBlaze was optimized to fit in the Xilinx FPGAs and the exception circuitry has been designed to detect only major illegal operations.
Conclusion Issues of SRAM-based FPGA used for space application –Single Event Upset (SEU) can be caused by radiation environment –So we need fault tolerance system Complete solution of upset mitigation implemented on Xilinx Virtex II FPGA –continuous external configuration scrubbing –functional-block design triplication –Independent internal BRAM scrubbing (also triplicated) Testing results –BRAM code corruption is the main reason causing runaway resets
Reference  F. Lima, C. Carmichael, J. Fabula, R. Padovani, and R. Reis, “A fault injection analysis of virtex FPGA TMR design methodology,” presented at the Radiation and Its Effects on Components and Systems, Sep  F. Lima(de), S. Rezgui, E. F. Cota, M. Lubaszewski, and R. Velazco, “Designing and testing a radiation hardened 8051-like micro-controller,” presented at the Military and Aerospace of Programmable Devices and Technologies Conf., Laurel, MD, Sep  G. Swift et al., “Dynamic testing of xilinx virtex-II field programmable gate array’s (FPGA’s) Input Output Blocks (IOB’s),” IEEE Trans. Nucl. Sci., vol. 51, no. 6, pp. 3469–3474, Dec  C. Carmichael, B. Bridgford, and J. Moore, “Triple module redundancy scheme for static latch-based FPGAs,” presented at the Military and Aerospace of Programmable Devices and Technologies Conf., Laurel, MD, Sep  Triple Module Redundancy Design Techniques for Virtex FPGAs, Xilinx Appl. Note XAPP197, C. Carmichael. (2001, Nov.). [Online]. Available:  MicroBlaze Processor Reference User Guide, Xilinx, Inc., Aug Embedded Development Kit (EDK 6.3), UG081, Version 4.0.  FFT C Code, T. Roberts and M. Slaney. (1994, Dec.). [Online]. Available:  TMR Tool User Guide, Xilinx, Inc., UG156, Version (2004, Sep.). [Online]. Available:  Triple Module Redundancy Design Techniques for Virtex FPGAs, Nov Xilinx Appl. Note 197.