Presentation is loading. Please wait.

Presentation is loading. Please wait.

GLAST LAT Instrument 1 Summary of Progress  Completed TVAC with no additional reboots  Ran refresh rate test showing that the refresh rate was not an.

Similar presentations


Presentation on theme: "GLAST LAT Instrument 1 Summary of Progress  Completed TVAC with no additional reboots  Ran refresh rate test showing that the refresh rate was not an."— Presentation transcript:

1 GLAST LAT Instrument 1 Summary of Progress  Completed TVAC with no additional reboots  Ran refresh rate test showing that the refresh rate was not an issue  Reviewed historical data  LAT level, found two reboots on EPU2 that had similar symptoms, scrub still in process – Eliminates SC as cause – One of those was at room ambient conditions  Box level had no applicable NCRs  Vendor level data package in review  Reviewed all memory errors reported in telemetry, none related to this issue  Met with BAE  No similar problems observed before – Single bit errors are typically refresh interval too long or problems with 3.3V, but uncorrectable memory  They are reviewing the symptoms  Gunther asked for a list of potential causes and for copies of X-rays  Waiting for BAE response  Estimated data loss if EPU2 is used in orbit with existing reboot rates  Started looking at single EPU processor performance  Side effects  Review of our memory chips and processor settings confirmed that the LAT processor chips are configured appropriately  The bridge chip in the dataflow lab matches the flight processor chip, so we can enable the 60X machine checks and maintain consistency between the dataflow lab and flight processors

2 GLAST LAT Instrument 2 Reboots Overview  EPU2 has rebooted three times at TVAC hot:  EPU2 2008-01-15 18:00:22, Configuration 4 – Boot type: watchdog – LIM mode: Quiescent (077016772), 00:18:12 minutes after power up – EPU2 box temperature, junction temperature: 35.4, 67 – In primary boot for 5:19:40 – Analysis: uncorrectable data error  EPU2 2008-01-16 03:25:25, Configuration 4 – Boot type: exception – LIM mode: Physics (077016781) – EPU2 box temperature, junction temperature: 38.5, 67 – In primary boot for 5:18:26 – Coherent dumps available – Analysis: the first and last of the 4 64-bit words in a cache line were replaced by zeroes, evidence of memory errors  EPU2 2008-01-16 11:20:49, Configuration 6 – Boot type: watchdog – LIM mode: Quiescent (077016787), during PowerOnCals.py, 00:08:31 minutes after power up – EPU2 box temperature, junction temperature: 35.0, 63 – In primary boot for 3:35:14 – Analysis: uncorrectable data error at address 0x38c9e0  Additional test time in configuration 6 yielded no reboots  2/3 reboots show evidence of uncorrectable memory errors  Memory scrub ran ~186 times on EPU2, but reported no correctable memory errors

3 GLAST LAT Instrument 3 Reboots (cont’d)  Environment 2008-01-15 to 2008-01-18:  Voltage: No significant variations  Temperature: – Max box temperature: 39.3 – Max junction temperature: 71  LAT configuration: – Configs 4 and 6 have opposite SIU, PDU, GASU, EPU0/1  On time for EPU2 during TVAC hot: – Total time in secondary boot: 15.5 hours – Config 4 = 15.8 hours = 5.1 in secondary + 10.7 in primary 2 reboots – Config 6 = 6.2 hours = 2.6 in secondary + 3.6 in primary 1 reboot – Config 8 = 0.4 hours in secondary – Config 6 = 5.8 hours in secondary Config No. SIU Feed DAQ Feed VCHP Feed Unreg Feed GBMSIUPDUGASUEPU0EPU1EPU2ACD HV 2RRRRPRRR-On HV2 4RPRPRRPROn- HV2 6PPPRPPRP-On HV1

4 GLAST LAT Instrument 4 Reboots vs Temperature Config-> 4 6 8 1 6 Reboot 1 Reboot 2 Reboot 3

5 GLAST LAT Instrument 5 Historical Reboots  The list of reboots that was addressed by the Reboot Resolution Team was reviewed for any reboots with similar symptoms  Two EPU2 reboots with associated memory errors were found  EPU2 2006-08-29 06:09, Configuration 2 – During LAT level TVAC – LIM mode: Charge Injection (077009297) – Boot type: Checkstop, possibly preceeded by a watchdog – EPU2 box temperature: 41.2 – Analysis: Single correctable and multibit uncorrectable errors  EPU2 2006-04-10 21:34, Configuration 5 – During integration room ambient testing – Boot type: Watchdog – LIM mode: Charge Injection – EPU2 box temperature approx 24 – Analysis:Uncorrectable memory errors

6 GLAST LAT Instrument 6 Memory Errors  Summary: reboots 1 and 3 show clear evidence of uncorrectable memory errors, reboot 2 shows zeroed memory  Clearest history is in Reboot 3 (write through was enabled so the trace was coherent)  Observed uncorrectable memory error while referencing the interrupt stack  Subsequent dump of the interrupt stack found multiple uncorrectable memory errors in the same region as the first memory error  Reboot 1 has less information since write through was not enabled  Evidence of a correctable memory error  The processor encountered uncorrectable memory errors in the very early stages of reporting the correctable memory error  Timeline cannot be reconstructed in the same level of detail as reboot 3  Reboot 2  Immediate cause was execution of a zeroed instruction memory  Dump showed there were 2 64-bit words zeroed out in the same cache line, suggesting a connection to the memory error detection/correction function  No memory errors associated with reading those zeroed words  Saw single-bit and correctable nibble errors, but the processor crashed while the processor was in the later stages of reporting the errors in diagnostic telemetry  Reboot 5  Both single bit correctable and multi bit uncorrectable memory error indicators  Reboot 4  Stored memory check information shows uncorrectable data error in VXWorks interrupt dispatch routine  Memory status register shows correctable single bit and nibble errors as well as uncorrectable memory errors

7 GLAST LAT Instrument 7 Remainder of TV Test Results  No additional reboots seen for the remainder of TV (about 60 hours of EPU2 operations above 35 degrees) Reboots at first hot plateau Second hot plateau Return to ambient

8 GLAST LAT Instrument 8 Refresh test  The refresh rate was set to 30 microseconds then to 7.5 microseconds at the end of the hot plateau  Temperatures were – SIU box 37.5, junction temperature 79 – EPU 1 bos 37.5, junction temperature 67 – EPU 2 box 39, junction temperature 67  No memory errors or reboots were observed  Refresh rate of 30 microsecond test conditions  Started on 1/31 at 09:35 through 14:30  Included 1 hour of muon runs (1/2 hour was the high rate run)  Memory scrub interval was 5 minutes  Refresh rate of 7.5 microsecond test conditions  started at 14:30 through loadshed at 21:15  3 hours of muon runs (1.5 hours at the high rate)  Memory scrub interval was 5 minutes

9 GLAST LAT Instrument 9 BACKUP SLIDES

10 GLAST LAT Instrument 10 Spacecraft Considerations  The spacecraft provides power and timing information to the EPU  Power – Bus voltages and EPU internal converted voltages were examined – No significant variations were seen – Working with the spacecraft team to get higher fidelity (sampled more frequently) telemetry as further confirmation  Timing information from the spacecraft consists of the 1pps signal and a time message – No GPS anomalies or troubleshooting reported at the time of the reboots – The reconstructed timeline indicates that we were not expecting the 1pps signal when the EPU crashed – There is no indication that a 1pps arrived at an unexpected time and induced the reboots


Download ppt "GLAST LAT Instrument 1 Summary of Progress  Completed TVAC with no additional reboots  Ran refresh rate test showing that the refresh rate was not an."

Similar presentations


Ads by Google