Presentation is loading. Please wait.

Presentation is loading. Please wait.

Radiation Tolerance Studies using Fault Injection on the Readout Control FPGA Design of the ALICE TPC Detector Johan Alme Bergen University College, Norway.

Similar presentations


Presentation on theme: "Radiation Tolerance Studies using Fault Injection on the Readout Control FPGA Design of the ALICE TPC Detector Johan Alme Bergen University College, Norway."— Presentation transcript:

1 Radiation Tolerance Studies using Fault Injection on the Readout Control FPGA Design of the ALICE TPC Detector Johan Alme Bergen University College, Norway on behalf of the ALICE-TPC collaboration TWEPP 2012, Oxford, UK17. - 21. September 2012

2 Overview System overview Fault injection Purpose of the fault injection test Test setup and test flow Results What did we learn? 1 TWEPP 2012, Johan Alme

3 ALICE detector TPC detector 2 TWEPP 2012, Johan Alme

4 ALICE TPC Readout Electronics 216 Readout Control Units (RCUs) One RCU: –RCU Motherboard –Detector Control System (DCS) Board –Source Interface Unit card (SIU) –2 branches of backplanes with up to 25 Front End Cards. Commercial FPGAs used. –The radiation environment is a concern 3 TWEPP 2012, Johan Alme

5 RCU main FPGA The RCU main FPGA sits in the datapath Data readout is handled by the Readout Node –92% CLBs –75% BRAM blocks (Remaining 25% BRAM can not be used due to the Active Partial Reconfiguration) –Result: TMR or any other mitigation techniques are almost not applicable Readout NodeControl Node 4 TWEPP 2012, Johan Alme

6 Reconfiguration Network Consists of: –A radiation tolerant flash memory, a radiation tolerant flash based FPGA and the DCS board – an Embedded PC with Linux. Corrects SEUs in the Xilinx Virtex-II pro vp7 Why it works: –Active Partial Reconfiguration How it works: –RCU support FPGA reads one frame at the time from the flash memory and Xilinx configuration memory. –The frames are compared bit by bit. –If a difference is found, the faulty frame is overwritten. Keyword: Flexibility 5 TWEPP 2012, Johan Alme

7 What is Fault Injection? In context of FPGA design: –Fault injection means injecting bitflips in the configuration memory of the FPGA. Purpose: Simulation of radiation related effects. Pros: –Low cost –Simple –Great tool to heighten radiation tolerance during development phase Cons –Sensitivity of the technology is not possible to measure –A systematic test including all possible bit-locations takes time –Not all elements in the FPGA can be tested. 6 TWEPP 2012, Johan Alme

8 Purpose of the Fault Injection Test 1.Estimate the radiation sensitivity of the RCU main FPGA design 2.Estimate an expected rate of functional failures in the RCU main FPGA as a function of integrated luminocity Two categories of functional failures are recognized: –Reliability faults: System crashes leading to a stop in the data taking for the complete ALICE detector. –Performance faults: Errors in the datastream from the RCU experiencing an SEU. Loss of data Corrupted data 7 TWEPP 2012, Johan Alme

9 Fault Injection – Test Setup Injection of bitflips* are done using Active Partial Reconfiguration Software for fault injection on the DCS board enables a test setup identical to real life * K. Røed, J. Alme, and D. Fehlker et al., Fault injection as a test method for an FPGA in charge of data readout for a large tracking detector, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, vol. 629, no. 1, pp. 260 – 268, 2011 8 TWEPP 2012, Johan Alme

10 Fault Injection – Test Procedure Data from real events recorded at the ALICE detector are uploaded to Front End Cards One bitflip per event All events & bitflips are logged 9 TWEPP 2012, Johan Alme

11 RCU Main FPGA Configuration File GCLK, IOB, IOI & LE BRAM interconnect FPGA Editor: Gives an impression of the logic resources used in the RCU main FPGA The black square is the embedded hardcore CPU. 10 TWEPP 2012, Johan Alme

12 Injected Bitflip Distribution SEUs leading to reliability faultSEUs leading to performance fault BRAM IC frames BRAM IC frames Plots shows bitflips leading to observable functional faults. 11 TWEPP 2012, Johan Alme

13 Results (I) Type of ErrorTotal # FaultsFault/SEU[%]SEUPI [SEUs/fault] All103415,02~19,9 Reliability22101,07~93,5 Performance81313,94~25,4 Loss of data24991,21~82,6 Data corrupted56322,73~36,6 Number of bitflips injected: 206151 –Coverage 6.5% Xilinx conservative estimate: SEUPI = 10 SEUs/fault –Result is in the expected range SEUPI Reliability faults = ~93.5 SEUs/fault –Most functional faults are not critical for the operation of the ALICE detector! 12 TWEPP 2012, Johan Alme

14 Results (II) Same distribution for each fault type. 53 SEUs gives*: >90% risk to get any functional failure >15% risk to get a reliability fault *Run 2010 (09. Aug 2011): Integrated Luminocity 107.816 nb -1  Number of SEUs = 107.816 nb -1 * 0.49 SEUs/nb -1 = 52.8 SEUs 13 TWEPP 2012, Johan Alme

15 Functional faults vs Integrated Luminocity RCU support FPGA offers an opportunity to gather statistics of SEUs experienced during operation May – August 2011*: –Total number of SEUs (216 RCUs): 1552 –Clear correlation between SEU count and integrated luminocity –Mean value: 0.49 SEU/nb -1 Estimated number of functional failures in the period May – August –16.6 reliability faults –Error rate: 0.13 reliability faults/day * K. Røed, J. Alme, and D. Fehlker et al., First measurement of single event upsets in the readout control FPGA of the ALICE TPC detector, Journal of Instrumentation, vol. 6, no. 12, p. C12022, 2011 Logged and analyzed faults in the period 1. May – 16. June: –~5 faulty situations with high probability that an SEU is the cause –Error rate: 0.11 reliability faults/day 14 TWEPP 2012, Johan Alme

16 What did we Learn? (present) With the current RCU main FPGA design: –Statistically, 93.5 SEUs are needed in the RCU main FPGA to crash the ALICE data taking –Expected number of reliability faults (crashes): –The fault injection study is important to understand and interpret error situations that happen during daily operations of the ALICE detector. Fault injection is an excellent tool for increasing the robustness of the design against radiation related errors. –Changes to design  Repeat fault injection 15 TWEPP 2012, Johan Alme

17 What did we Learn? (future) Upgrades of the Electronics are currently being discussed –Focus: Higher data rate – ”more physics” The fault injection study shows that the functional failure rate can become a limiting factor given that the Integrated luminocity increases & the electronics are not upgraded Conclusion: The radiation environment must be taken into account concerning upgrades of the Electronics. Could FPGAs be recommended for the discussed upgrade or similar applications in the future? –Given the fact that close to no mitigation techniques are implemented in the RCU due to area constraints the answer is YES. –And: New FPGA series/tools are more mature concerning radiation tolerance However : –If using FPGAs it might be wise to move as much as possible of complex functionality outside the radiation area. 16 TWEPP 2012, Johan Alme

18 Thanks for Listening Johan Alme (johan.alme@hib.no) (Bergen University College)johan.alme@hib.no Ketil Røed (University of Oslo) Dominik Fehlker, Kjetil Ullaland, Attiq Ur Rehman (University of Bergen) Christian Lippmann, Magnus Mager (GSI Frankfurt) 17 TWEPP 2012, Johan Alme


Download ppt "Radiation Tolerance Studies using Fault Injection on the Readout Control FPGA Design of the ALICE TPC Detector Johan Alme Bergen University College, Norway."

Similar presentations


Ads by Google