Presentation on theme: "Mathew Napier(1), Jason Moore(2), Kurt Lanes(1), Sana Rezgui(2),"— Presentation transcript:
1 Mathew Napier(1), Jason Moore(2), Kurt Lanes(1), Sana Rezgui(2), MAPLD 2004 SINGLE EVENT EFFECT (SEE) ANALYSIS, TEST, MITIGATION & IMPLIMENTATION OF THE XILINX VIRTEX-II INPUT OUTPUT BLOCK (IOB)Mathew Napier(1), Jason Moore(2), Kurt Lanes(1), Sana Rezgui(2),Gary Swift(3)(1)Sandia National Laboratories, Albuquerque NM, USA(2)Xilinx, San Jose, CA, USA(3)JPL/Caltech, Pasadena, CA, USA"This work was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration." "Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology."
2 Purpose & OutlineAnalyze and Evaluate the different types of TMR IOB Mitigation structures. Discuss the trade offs: SEE, electrical/timing and resources, and how these trades off effect the operation and MTBF of a system.OUTLINEIOBSEE IOB MitigationTriple Module Redundant IOBJPL Dual-MRSEE Trade offsCross SectionSignal Integrity and TimingSystem ImplementationTMR, EDAC, I/O CountHigh-speed Interfaces
3 SEU Hazards for Xilinx Technology Configuration MemoryConfiguration memory controls logic function and routingConfiguration Memory Upsets CauseChanges logic functionChanges routingChanges IO ConfigurationTransient and Static Bit ErrorsChanges data and control statesSingle Event Functional Interrupt (SEFI)Power On State Machine Upsets (POR Upset)Causes power on reset to occurSelect Map and JTAGDisables part configuration/scrubEffective mitigation techniques exist for each of these error modesSRAM Configuration Memory Controls Logic Function Look-up TablesInternal Registers Store State DataSRAM Configuration Memory Controls Routing Switch Matrix
4 Input Output Buffer (IOB) IOB are used to interconnect the Xilinx FPGA fabric with external devices.Support a wide range of I/O operating standards.Differential – LVDS… ECLSingle Ended – LVCMOS…HSTLSilicon features greatly increasing system performance.Flip Flops in the IOBDouble Data Rate Flip FlopsDigital Impedance controlAn IOB consists of the following partsInput pathTwo DDR registersOutput pathTwo 3-state DDR registersSeparate clocks for I & OSet and reset signals are sharedSeparated sync/asyncSeparated Set/Reset attribute per registerRegDDR mux3-StateOCK1OCK2OutputPADInputICK1ICK2IOB
6 Xilinx Triple Module Redundancy (XTMR): Inputs SEU Immunity requires the use of triple redundant input pins for every input signal.Not triplicating input Global signals (clk, rst, etc) can seriously compromise SEU resistance.Triplication of input data paths can be traded for EDAC.Reduce I/O countSEU resistance is sometimes traded-off for resource utilization.Xilinx input Capacitance is 10pF per I/O so user needs to verify that interfacing parts can drive 30pF at speed.
7 XTMR : Triplicated Outputs with Minority Voters Outputs can be triplicated, using three pins for each output signal.Minority voters monitor each of the triplicated design modulesIf one module is different from the others, its output pin is driven to High-ZVoters are triplicatedPMinority VoterTR0PMinority VoterTR1PMinority VoterTR2Convergence point isoutside FPGA, at trace
8 XTMR: Triplicated Output Operation - Datapath SEU Minority VoterPTR0TR1TR2ZIf a datapath SEU occurs, minority voter places its pin in high-ZRemaining valid outputs drive output to correct value.If an SEU occurs on the Minority voter, the worst it can do is disable a valid output.To pass an incorrect output, two upsets would have to occur on the same pathActive Scrubbing of the part will eliminate the accumulation of double SEUs in Configuration LogicMinority VoterPTR0TR1TR2Z
9 XTMR : Duplicated Outputs with Minority Voters (JPL) TR0TR1TR2Convergence point isoutside FPGA, at traceIn this scheme (by Gary Swift at JPL), triplicated design domains are driven on to two pinsTwo minority voters monitor each of the triplicated design modulesIf a module is different from the others, its output pin is driven to High-ZVoters are duplicatedIf an SEU occurs on the datapath without a pin, the outputs continue operating as normal.Minority VoterPTR0TR1TR2
10 XTMR: Duplicated Output Operation - Datapath SEU(2) Minority VoterPTR0TR1TR2ZIf an SEU occurs on the datapath with a pin, that pin is driven to high-Z.The main advantage of this technique is that it uses 2 rather than 3 pins thus reducing pin count and maintaining SEU immunity.If an SEU occurs on the Minority voter, the worst it can do is disable a valid output. Same as XTMRMinority VoterPTR0TR1TR2Z
11 XTMR: Single output pin If a design is pin-limited, you can elect not to triplicate some outputs.A single Majority Voter can be placed in series with a single output.This will cause additional output delay and leave the output path susceptible to SEUTR0TR1Majority VoterTR2OBUF
12 XTMR Output AnalysisHow many configuration bits in TMR I/O after Minority Voter?Errors in these bits will change the IOB function and NOT be caught by the voter.How many one bit upsets will really change the Function?Does a Stuck at High, Stuck at Low or Inverted IOB Failure in a XTMR structure still function correctly? Can two I/O overdrive the failed one?Voltage output HighVoltage output LowTiming Rise/FallHow does this change for different I/O types and switching speeds.How to design a system that balancesSEE sensitivitySystem performance and speedResource Utilization
13 Schematic AnalysisDetermine the number of Configuration Memory Cells (CMC) needed to configure unprotect and TMR I/O Configuration by analyzing Xilinx schematics.Guidelines/AssumptionsNot all SEUs will be catastrophic – therefore there are two types of SEUs (Hard and Soft Failures)Hard Failure : 100% certainty that when it occurs – will cause a system failureCausing the output to become invertedCausing the output to be either stuck high/lowChanging the signaling standard to something completely different (e.g. LVCMOS to HSTL)Causing the output to be tri-statedSoft Failure: Uncertain as to the effectChanging the signaling standard to something similar (LVCMOS to LVTTL)Changing the drive strength or slew rateChanging the termination
14 Schematic Analysis Results CLB LUTRouting to IOBIOBSchematic Analysis of this path = 109 bits (but only 92 “essential)26 Hard Failures66 Soft Failures
15 TMR Output Results Schematic Analysis of this configuration = 173 bits CLB and RoutingIOBSchematic Analysis of this configuration = 173 bits27 Hard Failures122 Soft FailuresTMR has larger cross section then unprotected . AC analysis will determine which type is more robust.
16 SEE Mitigated IOB Signal Integrity and Timing MEMEC Insight MB-2000 board used as test platform to test Electrical and Timing Characteristics of XTMR.Tied Three I/O together and ran through four different cases:Normal, Stuck at High, Stuck at Low, InvertedFor Each Case the following measurements were measured.Voh, Vol, Tr, Tf4GHz Scope PicturesI/O Types Evaluated included1.8V/2.5V/3.3V LVCMOS & LVTTL, LVDCI (Impedance control) & LVDS.Fast and Slow Slew Rate.Hyperlinx Simulations were preformed on all of the above cases to verify correlation between measured and simulated data.JPLs dual-redundant minority voters mitigation scheme will fail all of the above operating conditions if one of the I/Os fail.
17 SEE Mitigated IOB Signal Integrity and Timing XTMR 1.8V LVCMOSOne output InvertedVoh downto 1.4V down from 1.8VVol upto .4V up from 0VNoise do to lack of terminationNormalInverted
18 SEE Mitigated IOB Signal Integrity and Timing Stuck at HighLVCMOS1.8VMeasuredVoh = 1.72VVol = .4VTr = .58nsTf = .51nsSimulatedVoh = 1.79VVol = .54VTr = .80nsTf = .60nsHyperlynx IBIS ModelStuck at High SimulationStuck at LowSimulatedVoh = 1.26VVol = -.06VTr = .60nsTf = .70nsLVCMOS1.8VMeasuredVoh = 1.44VVol = -.04VTr = .62nsTf = .52nsSimulation data correlates with measured dataStuck at Low Simulation
19 SEE Mitigated IOB Signal Integrity and Timing Measured Data Spread SheetNormalStuck At LowStuck At LowINVSAH Failure limits V output low margin or violates level
20 CMC Failure Comparison How does Naked I/O compare to TMR in dynamic test in the beam and Fault Injection?Test will show CMC sensitivity do to switching failures large enough to break output switching state.TMR displayed zero failures at 3.3V and 1.8VNaked I/O has much larger CMC failure cross section then TMR setup.I/O test design is only running at 30MHz. TMR failures may show up at higher speeds.Inverted
21 System Goals & Implimentation Xilinx FPGA technology is a Mission Enabling TechnologySEU Goal – Develop a design that produces the SEU performance comparable to that of a fully hardened design while exploiting the capabilities of state-of-the-art CMOS process technologiesSEU Result – System Upset rate is superior to that which could be achieved with unmitigated SEU hard logicIMPLIMENTATIONCommand and control logic is implemented in SEU hard logicProcessor Memory includes Parity protectionFail over to boot codeSEU detection and recovery for SEU soft devices is automatic and occurs without ground interventionSEU induced outages that do not require ground intervention are booked against mission availabilityAlthough not a specific requirement good SEU performance under nominal solar flare conditions is desired
22 SEU Mitigation and Error Control Mitigate IO UpsetsTMR of IO for clocks and address signalsEDAC for data path signalsMitigate Configuration Memory UpsetsTMR internal logicConfiguration memory scrubbing to prevent error accumulationDesign approach does not include POR upset mitigationUse of shadow devices effective against POR errorsPOR Error rate is very lowThe flight system makes extensive use of several techniques to exploit the advantages of nano-meter CMOS technology while maintaining excellent SEU performanceMultiple bit Reed-Solomon forward error correction codesSingle bit error correcting codesSimple parity error detectionCyclic-Redundancy-Check for burst error correctionTriple Modular RedundancyError ScrubbingMitigation technique is selected based upon error rate, vulnerability, system impact, and implementation complexityMitigation techniques provide coverage for dynamic SEU errorsError Correction Techniques Implemented for SEU Mitigation Improve the Overall Design Robustness and Reliability
23 Mitigation Overview – Sensor Data Processor (SDP) Processes 8Gbps of Data.Outputs 340Mbits of Processed Data.ArchitectureFiber Receiver and SERDES link, 4 channels at a maximum of 160Mpix ea.Four Quadrant Processors for data processing. Contains 640 Mbytes of SDRAM for data storage320 bit 85Mhz SDRAM 1.8VCan generate upto 340Mbits/s of Source Packet DataOne Central Virtex For Data NetworkingDe-mux data from Serdes chips outputs to 4 processing channels/Quadrant XilinxControls Frame Summation Rates and Reference Frame Generation Rates.Transfer Source Packets to downlink modules at up to 340Mbits/s MaxUSES Compresses source Packets.
24 Mitigation Overview – Sensor Data Processor (SDP) RS-ECCRS-ECCXC2V3000640MB+ECCTMRFiberInputTMRXC2V3000640MB+ECCECCECC320320PIX/PacketSERDESOscI2CJTAGPIX/PacketJTAGI2CECC/CRCTMRXC2V3000InterfaceControlGilgameshA-I2C CTMVoltageTemp.XC2V3000640MB+ECCTMRTMRXC2V3000640MB+ECCPIX/PacketJTAGPIX/Packet320320ECC/TMRI2C TIME System CLKPacketsI2CJTAGJTAGI2CSDPPXSTo DLM/DLCCTM
25 SDP- SDRAM SDRAM interface, 1 per Quadrant Virtex Test 20 1.8V Micron Mobile SDRAM1.8V LVTTL I/O320 Bit Data Bus – 240 Pixel DATA, 80 ECCData is Reed Solomon EncodedTMR'd outputs from Virtex: address,control and ClockAddress and control signals are AC Terminated.TMR’d input to Virtex: Clock Feedback – Used to de-skew the SDRAM ClockCurrently running at 85MHz designed to operate at 100MHzTestMeasured TMR SDRAM Addr, RAS and CAS signals for the following cases.Inverted, Stuck High, Stuck LowMeasured Voh, Vol, Tr and Tf.Count the Number of Reed Solomon Errors, If any.SDRAM ADDRESS & CONTROL
26 SDP- SDRAM(2)SDRAM Address NormalSDRAM Address One I/O Inverted
27 No SDRAM Errors for All Three Failure Cases SDP- SDRAM(3)No SDRAM Errors for All Three Failure Cases
28 Upset Rates for Various SEU Mitigated IO Configurations
29 Lessons LearnedTriple redundant outputs for >2.5V LVCMOS or LVTLL achieve correct Vol and Voh levels for all failure casesFor low voltage I/O <1.8V Thresholds are very close to margins for failure conditions and may violate other parts spec.For SDRAM interface 1.8V I/O tolerated all three failure cases at room temperature.Double redundant outputs will not meet the correct Vol and Voh levels under I/O failure.Rise and/or Fall times are lengthened do to I/O failure. May cause more failures at higher speeds.RecommendationIf resources permit XTMR output for all control signals is recommended regardless of I/O type.High Speed, Jitter or Duty Cycle Sensitive Devices Outputs need special considerationEDAC on Data busses are ideal for IOB failure protection.