Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181.

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Scrubbing Approaches for Kintex-7 FPGAs
Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
ICAP CONTROLLER FOR HIGH-RELIABLE INTERNAL SCRUBBING Quinn Martin Steven Fingulin.
CISCO NETWORKING ACADEMY PROGRAM (CNAP)
CS-334: Computer Architecture
DC/DC Switching Power Converter with Radiation Hardened Digital Control Based on SRAM FPGAs F. Baronti 1, P.C. Adell 2, W.T. Holman 2, R.D. Schrimpf 2,
FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert
Chapter 19: Network Management Business Data Communications, 4e.
Interrupts (contd..) Multiple I/O devices may be connected to the processor and the memory via a bus. Some or all of these devices may be capable of generating.
1 Fall 2005 Hardware Addressing and Frame Identification Qutaibah Malluhi CSE Department Qatar University.
Low Overhead Fault Tolerant Networking (in Myrinet)
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
COE Data and Computer Communications Data Communications & Networking Overview.
William Stallings Data and Computer Communications 7 th Edition (Selected slides used for lectures at Bina Nusantara University) Transport Layer.
1 Chapter 13 Embedded Systems Embedded Systems Characteristics of Embedded Operating Systems.
Group 7 Jhonathan Briceño Reginal Etienne Christian Kruger Felix Martinez Dane Minott Immer S Rivera Ander Sahonero.
1.  TCP/IP network management model: 1. Management station 2. Management agent 3. „Management information base 4. Network management protocol 2.
Managing DHCP. 2 DHCP Overview Is a protocol that allows client computers to automatically receive an IP address and TCP/IP settings from a Server Reduces.
S1.6 Requirements: KnightSat C&DH RequirementSourceVerification Source Document Test/Analysis Number S1.6-1Provide reliable, real-time access and control.
Chapter 10: Input / Output Devices Dr Mohamed Menacer Taibah University
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
 Communication Tasks  Protocols  Protocol Architecture  Characteristics of a Protocol.
Lec 3: Infrastructure of Network Management Part2 Organized by: Nada Alhirabi NET 311.
Ch.2 – Introduction to Routers
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
Application Block Diagram III. SOFTWARE PLATFORM Figure above shows a network protocol stack for a computer that connects to an Ethernet network and.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
EEE440 Computer Architecture
Timer Timer is a device, which counts the input at regular interval (δT) using clock pulses at its input. The counts increment on each pulse and store.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Lecture (Mar 23, 2000) H/W Assignment 3 posted on Web –Due Tuesday March 28, 2000 Review of Data packets LANS WANS.
CH10 Input/Output DDDData Transfer EEEExternal Devices IIII/O Modules PPPProgrammed I/O IIIInterrupt-Driven I/O DDDDirect Memory.
Fall 2000M.B. Ibáñez Lecture 25 I/O Systems. Fall 2000M.B. Ibáñez Categories of I/O Devices Human readable –used to communicate with the user –video display.
Accelerated Long Range Traverse (ALERT) Paul Springer Michael Mossey.
CHAPTER 4 PROTOCOLS AND THE TCP/IP SUITE Acknowledgement: The Slides Were Provided By Cory Beard, William Stallings For Their Textbook “Wireless Communication.
GPRS functionality overview in Horner OCS. GPRS functionality – Peer to Peer communication over GPRS – CSCAPE connectivity over GPRS – Data exchange using.
PPI-8255.
Ch.2 – Introduction to Routers CCNA 2 version 3.0 Rick Graziani Cabrillo College.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
1 CzajkowskiMAPLD 2005/138 Radiation Hardened, Ultra Low Power, High Performance Space Computer Leveraging COTS Microelectronics With SEE Mitigation D.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Self-Tuned Distributed Multiprocessor System Xiaoyan Bi CSC Operating Systems Dr. Mirela Damian.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Amdahl’s Law & I/O Control Method 1. Amdahl’s Law The overall performance of a system is a result of the interaction of all of its components. System.
Xilinx V4 Single Event Effects (SEE) High-Speed Testing Melanie D. Berg/MEI – Principal Investigator Hak Kim, Mark Friendlich/MEI.
Lecture 11. Switch Hardware Nowadays switches are very high performance computers with high hardware specifications Switches usually consist of a chassis.
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lecture # 07 Spring-2016.
Unit Hardware Troubleshooting
Lec 5: SNMP Network Management
Microprocessor Systems Design I
SEU Mitigation Techniques for Virtex FPGAs in Space Applications
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Data Link Issues Relates to Lab 2.
Computer System Overview
QNX Technology Overview
Overview of Computer Architecture and Organization
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Presentation transcript:

Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181

Overview Context Motivation Mitigation Scheme  Top Level Description  Protocol Gets Defined  Test Bed  Experimental Results Conclusions Maqbool 2 MAPLD 2005/P181

Single Event Functional Interrupts (SEFIs) A type of anomaly in microcircuits caused by a single ion strike Occurs in sensitive cross-section of the device User doesn’t have direct access to fault location Signatures  An upset rate higher than expected  Non responding device  In a communication network SEFI is an event, which stops communication  Variations in device current consumption During a SEFI, device is unavailable to the system Device is potentially recoverable  Recovery involves resetting or power cycling  System recovery requires restoring the device functionality followed by its state recovery Maqbool 3 MAPLD 2005/P181

Mitigation Levels Using Radiation Hardening Processes Built in Fault Tolerance Features Incorporating Redundancy within Device Error Detection And Corrections (EDACs) Redundancy Techniques, e.g. Voting, Lockstep etc. Configuration Scrubbing Data Handling Networks Device Level Unit Level System/Architectural Level Maqbool 4 MAPLD 2005/P181

Motivation Space applications demand: more…  Computational power  Standardization  Reusability but less…  Mass, volume and power budget  Cost  Development time Candidate architectures are heavily based on state-of-the-art COTS technology  Reliability  Availability SEFIs and single event transients are becoming dominant radiation hazards A unit level approach has usually been considered for SEFI mitigation Maqbool 5 MAPLD 2005/P181

System Architecture A fast data network interlinks all units  Scalable  Distributed  Reusable A system level SEFI mitigation A diagnosis and recovery (DAR) packet from each unit acts as an indicator of health status for the unit The supervisor intervenes when a packet does not arrive or it does not match expectation Maqbool 6 MAPLD 2005/P181

Why a System-Level Approach Cost-effective Adaptable Reusable Power cycling requirements associated with SEFIs, demands for an external entity to hold state data and to initiate a recovery procedure In case of a permanent failure, it can be switched off Supervisory functions, network and configuration management can be combined Maqbool 7 MAPLD 2005/P181

On-Board Computer Possible sources of fault  Processor  Memory  Network interface Processor EDAC Memory Interface FPGA OPC The OBC Subsystem Over-Current Protection Circuitry (OPC) Required underlying mitigations  EDAC  OPC Maqbool 8 MAPLD 2005/P181

SEFI Signatures Maqbool 9 MAPLD 2005/P181

Supervisory Protocol Two Types of packets  Screech Packet  Diagnosis And Recovery (DAR) Packet DAR task  Perform Testing of the Processor  Collects error count of the memory unit  Updates state data  Current consumption of the OBC module will be monitored System ID LengthFlags Diagnostic health data/Screech data Maqbool 10 MAPLD 2005/P181

Diagnosis And Recovery (DAR) Packet Flow SODARP Marker Start Sampling Current Perform Test DAR Packet Received Compare with Stored Values Enable Interrupts Collect SEU Count Send DAR Packet Waiting for Supervisor Response Command to Update Program Memory Update Memory OBC-ProcessorOBC- Interface FPGASupervisorCode Store DAR Process Starts Disable Interrupts Collect Current Value Maqbool 11 MAPLD 2005/P181

Recovery Method Fault TypeRecovery Procedure ScreechReload program memory Packet time-out (Network problems)Next Slide Packet time_out (Processor Problem) Next Slide Current consumption variationsPower cycle and reload memory SEU count exceeding thresholdReload memory Test task result mismatchReset and reload memory Maqbool 12 MAPLD 2005/P181

Recovery Method (2) In case of a processor reset and power cycle, the OBC should be allowed sufficient time for reinitialization The supervisor needs to keep a record of recoveries applied Consecutive recovery cycles needs to be avoided Maqbool 13 MAPLD 2005/P181

Test Bed Demonstration of the synchronization protocol  PC1 executes the OBC program  PC2 executes the supervisor program Maqbool 14 MAPLD 2005/P181 Synchronization Scheme 1Synchronization Scheme 2

Synchronization Scheme 1 OBC program receives a packet, checks source, if it is from the supervisor program, it sends a packet to the FPGA FPGA sends the packet to the supervisor program on Ethernet UDP/IP Packet from the supervisor Ethernet RC 203 board passes it as it is to the OBC-program Parallel Port Packet on Ethernet Configuration 1 UDP/IP Packet from the supervisor Ethernet RC 203 passes only data bytes OBC program receives data, it sends data bytes to the FPGA Parallel Port FPGA encodes received data into UDP/IP packet Packet on Ethernet Configuration 2 Maqbool 15 MAPLD 2005/P181

Time Measurement Method Maqbool 16 MAPLD 2005/P181 The ethereal graphical user interface (GUI) network protocol analyzer was used  It displays time when a packet was captured  It also displays IP source and destination, protocol type source and destination port for all captured packets.  Selecting a packet from the list of captured packets shows total bytes captured on the network medium, Ethernet source and destination addresses, and number of data bytes in the packet. Time was measured from the moment it captures packet sent by the supervisor to the point when it captures a return packet from the OBC for synchronization scheme 1. For synchronization scheme 2, time was measured between two consecutive packets from the OBC.

Results (1) Maqbool 17 MAPLD 2005/P181 Data bytes Average time between two packets in a pair (Supervisor packet and OBC packet in response) (ms) Time required for 1 byte to travel through the system Time measured in n run with N bytes – time measured in n+1 run with N+K bytes divided by K (  s) /18 = /36 = /28 = /400 =

Results (2) Maqbool 18 MAPLD 2005/P181 Data bytesExperimentTime measured between a supervisor packet and a response packet from the OBC (  s) 18Configuration 1 with RC200GetBlock Stall function Configuration 1 with RC200GetBlock function Configuration 2 with RC200GetBlock Stall function Ping program270

Synchronization Scheme 2 Maqbool 19 MAPLD 2005/P181 Configuration for Synchronization Scheme 2 OBC sends data bytes to the interface FPGA Interface FPGA encodes data into UDP/IP packet and writes it on Ethernet Parallel Port Ethernet ExperimentTime measured Synchronization scheme 2: Time measured between two consecutive packets from OBC (18 data bytes) 261  s Synchronization scheme 1: OBC program crashed and reinitialized manually FPGA cleared using FTU facility and OBC program reinitialized manually Time between last OBC packet prior to fault and first packet after recovery 12 s, 299 ms and 644  s 15 s, 339ms and 501  s Synchronization scheme 2: OBC program crashed and reinitialized manually FPGA cleared using FTU facility and OBC program reinitialized manually Time between last OBC packet prior to fault and first packet after recovery 11 s, 316 ms and 272  s 14 s, 971ms and 559  s

Conclusions A system-level approach has been presented to mitigate SEFIs in data handling architectures  Upset detection is not straightforward, limits effectiveness of currently available mitigation techniques  Increasing SEFI susceptibility in all major data handling device technologies  A system level intelligent supervisor allows monitoring of a wide range of devices with minimal overhead  Synchronization is straightforward  Two synchronization schemes have been demonstrated  Few simple experiments were performed to establish a time-out period for a packet from the OBC.  Once this information was achieved, the system behaved as expected and a synchronized packet communication was established between the OBC and the supervisor programs  In event of a SEFI, the supervisor program needs to wait until the OBC program is up again. Time-out for this wait period will depend on the recovery latency Maqbool 20 MAPLD 2005/P181