Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

Slides:

Advertisements

Similar presentations

Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,

Advertisements

INPUT-OUTPUT ORGANIZATION

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

Fault-Tolerant Systems Design Part 1.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.

SW-Based Fault Detection Mechanisms in Microprocessor Control Flow Execution.

Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.

CMPT 300: Operating Systems I Dr. Mohamed Hefeeda

COMP3221: Microprocessors and Embedded Systems Lecture 15: Interrupts I Lecturer: Hui Wu Session 1, 2005.

1 Lecture 2: Review of Computer Organization Operating System Spring 2007.

OS Fall ’ 02 Introduction Operating Systems Fall 2002.

Page 1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.

CSCE 351: Operating System Kernels

Chapter 11 Operating Systems

Computer Organization and Assembly language

MicroC/OS-II Embedded Systems Design and Implementation.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

INPUT-OUTPUT ORGANIZATION

What are Exception and Interrupts? MIPS terminology Exception: any unexpected change in the internal control flow – Invoking an operating system service.

Introduction to Embedded Systems

MICROPROCESSOR INPUT/OUTPUT

Instituto de Informática and Dipartimento di Automatica e Informatica Universidade Federal do Rio Grande do Sul and Politecnico di Torino Porto Alegre,

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

LOGO Soft-Error Detection Through Software Fault-Tolerance Techniques by Gökhan Tufan İsmail Yıldız.

Fault-Tolerant Systems Design Part 1.

EEE440 Computer Architecture

Fault-Tolerant Systems Design Part 1.

Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.

Virtual 8086 Mode  The supports execution of one or more 8086, 8088, 80186, or programs in an protected-mode environment.  An 8086.

Processor Architecture

Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.

80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.

MicroC/OS-II S O T R.  MicroC/OS-II (commonly termed as µC/OS- II or uC/OS-II), is the acronym for Micro-Controller Operating Systems Version 2.  It.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

1 VxWorks 5.4 Group A3: Wafa’ Jaffal Kathryn Bean.

EFLAG Register of The The only new flag bit is the AC alignment check, used to indicate that the microprocessor has accessed a word at an odd.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Lecture 1: Review of Computer Organization

Interrupt driven I/O Computer Organization and Assembly Language: Module 12.

1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.

بسم الله الرحمن الرحيم MEMORY AND I/O.

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.

Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

Copyright © Curt Hill More on Operating Systems Continuation of Introduction.

SOC Consortium Course Material SoC Design Laboratory Lab 8 Real-time OS - 1 Speaker: Yung-Chih Chen Advisor: Prof. Chun-Yao Wang November 17, 2003 Department.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

Chapter 13: I/O Systems.

Topics Covered What is Real Time Operating System (RTOS)

Soft-Error Detection through Software Fault-Tolerance Techniques

Process Description and Control

Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering

Md. Mojahidul Islam Lecturer Dept. of Computer Science & Engineering

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

System calls….. C-program->POSIX call

COMP3221: Microprocessors and Embedded Systems

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

Presentation transcript:

Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2

Summary 1.Introduction: targeting the problem 2.The Possible Solution 2.1. SW-Based Fault Detection Mechanisms 2.2. Migrating SW-Based Fault Detection Mechanisms into HW 3.Experimental Evaluation 4.Final Considerations

1. Introduction: targeting the problem The increasing # of computer-based critical applications rises questions about the techniques for guaranteeing sufficient degrees of reliability and to keep reasonable costs for design and manufacturing. ?

? Techniques commonly used (on-chip and system level): stand-alone solutions Fault-Tolerance Techniques (HW, SW, Time or Info domains) Duplication/Voter, TMR Layout-Driven Fault Avoidance Watch-Dogs Consistency Checks Capability Checks Re-computation EDAC 1. Introduction: targeting the problem

Duplication/Voter, TMR Layout-Driven Fault Avoidance Watch-Dog Timer ?? Techniques commonly used (on-chip and system level): stand-alone solutions Fault-Tolerance Techniques (HW, SW, Time or Info domains) Consistency Checks Capability Checks Re-computation EDAC  Impacts design: performance, weight, size/volume, power consumption, reliability.  Impacts design: performance, weight, size/volume, power consumption, reliability. 1. Introduction: targeting the problem

Duplication/Voter, TMR Layout-Driven Fault Avoidance Watch-Dog Timer ? Techniques commonly used (on-chip and system level): stand-alone solutions Fault-Tolerance Techniques (HW, SW, Time or Info domains) Consistency Checks Capability Checks Re-computation EDAC  Impacts design: performance, weight, size/volume, power consumption, reliability.  Impacts design: performance, weight, size/volume, power consumption, reliability. 1. Introduction: targeting the problem

HW Techniques: Disadvantages: High area overhead High development/fab cost SW Techniques: Disadvantages: Significant performance degradation Memory overhead 1. Introduction: targeting the problem

Development of a hybrid methodology (HW+SW redundancies) able to perform runtime detection of errors in μprocessor-based SoCs may have very good cost X benefit returns. 2. The Possible Solution

Returns: Minimization of area overhead and fab/development costs (benefits of SW-based redundancy techniques) Improvement of performance and minimization of memory overhead (benefits of HW-based redundancy techniques) In summary: Minimize fab cost and performance degradation, while improving reliability Target faults: Control flow errors Data handling errors 2. The Possible Solution

Hybrid methodology (HW+SW redundancies) explores: I-IP Core Architecture Software-Based Techniques 2. The Possible Solution

HW+SW SoC FT Architecture:  P IP Memory IP Custom IP I/O port WDT I-IP bus SoC Mismatchsignal Computes run-time and stores control flow signatures and data read from memory Stores a hardened program Information flow traveling on the bus Information flow traveling on the bus 2. The Possible Solution

 Faults Affecting Data:  Cerberus  Cerberus (Matteo et al.)  Faults Affecting Control:  ECCA  ECCA (Matteo et al.)  CFCSS  CFCSS (McCluskey et al.)  ECI  ECI (Miremadi et al.) 2. The Possible Solution SW-Based Fault Detection Mechanisms

Original Code:Modified Code: a = b;a0 = b0; a1 = b1; if(b0 != b1) error(); a = b + c;a0 = b0 + c0; a1 = b1 + c1; if (b0 != b1) || (c0 != c1) error(); Code modification for errors affecting data.  Faults Affecting Data: Cerberus (Matteo et al.) 2. The Possible Solution SW-Based Fault Detection Mechanisms

2. The Possible Solution Original Code:Modified Code: res = search(a);search(a0, a1, &res0, &res1);… int search(int p)void search(int p0, int p1, int *r0, int *r1) {int q;{int q0, q1;… q = p + 1;q0 = p0 + 1; …q1 = p1 + 1; return(1);if(p0 != p1) } error(); … *r0 = 1; *r1 = 1; return; } Code transformation for errors affecting procedure parameters.  Faults Affecting Data: Cerberus (Matteo et al.) SW-Based Fault Detection Mechanisms

2. The Possible Solution Original Code:Modified Code: /* Basic Block beginning *//* Basic Block beginning #371 */ …ecf = 371; /* Basic Block end */… if (ecf != 371) error (); /* Basic Block end */ Example of detection of errors affecting not allowed branches  Faults Affecting Control: ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.) SW-Based Fault Detection Mechanisms

2. The Possible Solution Original Code:Modified Code: If (condition)If (condition) {/* Block A */{/* Block A */ …if (!condition) }error(); else… {/* Block B */} …else }{/* Block B */ if (condition) error(); … } Code transformation for a test statement SW-Based Fault Detection Mechanisms  Faults Affecting Control: ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.)

2. The Possible Solution In summary To harden a given program this approach defines the following assertions introduced into each basic block v j : Test Assertion: it controls the signature of basic block v j checking if v i belongs to pred(v j ). Set Assertion: updates the signature setting it to the value B j associated to v j. B j = (B i  M1)  M2 SW-Based Fault Detection Mechanisms  Faults Affecting Control: ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.)

2. The Possible Solution 01: while(k1<DIM) 02: { 03: if(  != M1 &&  != M2 ) 04: //Error detected 05: A1 = matrixA1[i1][k1]; 06: B1 = matrixB1[k1][j1]; 07: C1 += A1*B1; 08: matrixC1[i1][j1] = C1; 09: k1++; 10:  j =(  i ^M1)^M2; 11: } SW-Based Fault Detection Mechanisms  Faults Affecting Control: ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.)

Principle: Modification of a Basic Block  Faults Affecting Control:  CFCSS (McCluskey et al.) 2. The Possible Solution SW-Based Fault Detection Mechanisms

2. The Possible Solution  Faults Affecting Control:  CFCSS (McCluskey et al.) Basically, the approach consists of six steps: Dividebasic blocks 1) Divide the program into basic blocks. A basic block is a minimal set of ordered instructions in which its execution begins from the first instruction and terminates at the last instruction. There is no branching instruction in a basic block except possibly for the last one. A basic block terminates at either an instruction branching to another basic block or an instruction receiving transfer of control flow (CF) from two or more places in the program. Notations: (a) V = {v i : i = 1, 2,…, n}: set of vertices denoting basic blocks. (b) E: set of edges denoting possible CF between basic blocks. Constructgraph 2) Construct a graph for the program according to the instructions flow (each node represents a basic block). Note that a program can be represented by a program- graph, P, where br i,j are not necessarily explicit branch instructions; they also represent fall-through execution paths, jumps, subroutine calls, and returns. Fig. 2.5 is an example. Notation: P: Program Graph {V, E}. Arbitrarily assign signatureeach node 3) Arbitrarily assign a signature for each node (compilation time). Computesignaturedifference 4) Compute the signature difference between the source and the destiny blocks. Computenew signatureeach node 5) Compute the new signature for each node (execution time). Comparesignatures 6) Compare both signatures. SW-Based Fault Detection Mechanisms

2. The Possible Solution  Faults Affecting Control:  CFCSS (McCluskey et al.) Sequence of instructions and its graph. Detection of illegal branch. General Form f = f(G, d i ) = G XOR d i G 2 = f(G 1, d 2 ) = G 1 XOR d 2 = s 1 XOR (s 1 XOR s 2 ) = s 2 G 4 = f(G 1, d 4 ) = G 1 XOR d 4 = G 1 XOR (s 3 XOR s 4 ) = s 1 XOR s 3 XOR s 4 ≠ s 4 SW-Based Fault Detection Mechanisms

2. The Possible Solution  Faults Affecting Control:  CFCSS (McCluskey et al.) Detection of an illegal branch: a numerical example SW-Based Fault Detection Mechanisms

2. The Possible Solution  Faults Affecting Control:  CFCSS (McCluskey et al.) Node v 1 and node v 3 have the same signatures: Branch Fan-in Nodes SW-Based Fault Detection Mechanisms

2. The Possible Solution  Faults Affecting Control:  CFCSS (McCluskey et al.) Node v 1 and node v 3 have different signatures: Adjusting Signature D SW-Based Fault Detection Mechanisms

2. The Possible Solution  Faults Affecting Control:  CFCSS (McCluskey et al.) Node v 1 and node v 3 have different signatures: Adjusting Signature D SW-Based Fault Detection Mechanisms G5 = f(G1, d5, D1) = G1 XOR d5 XOR D1 = s1 XOR (s1 XOR s5) EXOR “000” = s5 G5 = f(G3, d5, D3) = G3 XOR d5 XOR D3 = s3 XOR (s1 XOR s5) EXOR “s1 EXOR s3” = s5

2. The Possible Solution  Faults Affecting Control:  ECI (Miremadi et al.) Insertion of trap instructions in the program area, in the data area, and in the unused area of the memory. The ECIs are inserted in the main memory locations that are not used by the CPU during normal execution. Thus, the execution of an ECI is a indication that a control flow error has occurred. The task of an ECI is to initiate a recovery process. SW-Based Fault Detection Mechanisms

WDT / I-IP WDT / I-IP works in symbiosis with the processor which is not modified. WDT / I-IP WDT / I-IP continuously spies the information execution flow on the bus (which is computed to test and update signatures). WDT / I-IP If a mismatch is detected, WDT / I-IP outputs a mismatch signal. 2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW

01: while(k1<DIM) 02: { 03: IIPtest( BB1 ); 04: IIPtest( BB2 ); 05: A1 = matrixA1[i1][k1]; 06: B1 = matrixB1[k1][j1]; 07: C1 += A1*B1; 08: matrixC1[i1][j1] = C1; 09: k1++; 10: IIPset( BB2  ); 11: } 2. The Possible Solution Peace of code for control-flow faults detection (ECCA Partitioning): Migrating SW-Based Fault Detection Mechanism into HW 03: if(  != M1 &&  != M2 ) 04: //Error detected 10:  j =(  i ^M1)^M2;

WDT / I-IP Architecture: Three modules: - bus interface logic - consistency check logic - CAM memory Bus Interface Logic Consistency Check Logic bus MismatchSignal WDT / I-IP adx, data Compares flow signatures Detects signatures passing on the bus 2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW CAM Memory Stores flow signatures

Clk Reset Instruction_in Ram_data_in Ram_address_in WDT / I-IP Modulo 1 Bus Interface Logic Clk Reset Instrucion_in Ram_data_in Ram_address_in Data_memory_in Data_memory_out Adr_memory_out Ctrl_rw_out En_compare_out Data_1_out Data_2_out Modulo 2 CAM Memory Clk Reset Data_memory_out Data_memory_in Adr_memory_in Ctrl_rw_in Modulo 3 Consistency Check Logic Clk reset En_compare_out Data_1_out Data_2_out Mismatch Signal 2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW WDT / I-IP Architecture:

Consider now that the µprocessor-based SoC runs under an Operating System … 2. The Possible Solution The application code is only a fragment of the total time allocated during system operation! Migrating SW-Based Fault Detection Mechanism into HW ?

2. The Possible Solution Critical applications need operating systems (OS) which guarantee a correct and safe behavior despite the occurrence of errors. Faults can affect OS calls as well as the OS kernel: How does the system react in front of invalid or corrupted values operated by the kernel? Migrating SW-Based Fault Detection Mechanism into HW

µProcessor WDT / I-IP Application Address + Data Bus Status Register SoC Memory ( Operating System ) Driver HW-SW Partitioning for Fault-Detection in Complex Systems 2. The Possible Solution Memory (Application Code + Data) Error Indication Migrating SW-Based Fault Detection Mechanism into HW

µProcessor WDT / I-IP Application Address + Data Bus Status Register SoC Memory ( Operating System ) Driver HW-SW Partitioning for Fault-Detection in Complex Systems DragonBall, ARM, Pentium, 8086, 68K ProgrammableLogic SW Part HW Part SW Part 2. The Possible Solution Memory (Application Code + Data) Error Indication µCLinux, µCOS-II SW Part Com Channel Migrating SW-Based Fault Detection Mechanism into HW

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW MC68VZ328 Block Diagram CGM&Power Control Real-Time Clock In-Circuit Emulation Interrupt Controller Memory Controller Bootstrap Mode 8/16-Bit Bus Interface FLX6800 Static CPU 16-Bit Timers(2) 8-Bit PWM1 16-Bit PWM2 SPI 1 UART 2 IrDA1.0 UART 1 IrDA1.0 SPI 2 LCD Controller GPIO Ports Internal Bus Special Function Pins (CPU Space) Status Information

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Status Information

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Special Function Pins (CPU Space): FC2, FC1, FC0 Function Code Output Processor Cycle Type FC2FC1FC0 000Undefined, reserved 001User Data 010User Program 011Undefined, reserved Supervisor Data 110Supervisor Program 111CPU space (interrupt acknowledge) Status Information Die

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW – Dies A 16 - A 19 Pins Status Information FC2 = FC1 = FC0 = 1 indicate CPU operations other than interrupt acknowledge cycles (e.g. co-processor communications). Then, different CPU spaces are indicated in A16 - A19 pins, if properly decoded.

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Interrupt Control Pins: IPL2, IPL1, IPL0 Interrupt Processor Level Processor Cycle Type IPL2IPL1IPL0 000Lowest priority 001|||||||||||||||||| Highest priority Status Information Die

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Event-Ticking Pins – ETPs: PM0, PM1 Status Information Event-Ticking Pins – ETP associated with Model Specific Registers – MSR to monitor: # cache memory misses, # committed instructions, # interruptions executed, # taken branches,... Model Specific Registers – MSRs: Counters CRT0 and CRT1 programmed through the Control and Events Selector Register - CESR Pentium Die

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Status Information Instructions used to program counters CRT0 and CRT1 through the Control and Events Selector Register – CESR: WRMSR RDMSR The RDMSR instruction may be executed in all CPLs (Current Privileged Level), but the WRMSR instruction may only be executed in CPL0.

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Event-Ticking Pins – ETPs: d_i, s_u Status Information DragonBall Core If “0”: data; If “1”: instruction; If “z”: undefined. If “0”: supervisor mode; If “1”: user mode; If “z”: undefined. These pins were added to the processor core to serve as interface with the I-IP (watch-dog).

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Event-Ticking Pins – ETPs: d_i, s_u Status Information

2. The Possible Solution µCOS-IIOS error detection coverage has been measured and observations about OS critical data structures to be improved have been commented, in order to improve the final robustness of the µCOS-II operating system. Juan Pardo, 2004 Fault Tolerant Systems Group Polytechnic University of Valencia Spain Migrating SW-Based Fault Detection Mechanism into HW

2. The Possible Solution µC/OS-II Operating System Selection came motivated from the perspective that it is a system widely used in particular for embedded applications since several years ago. First Version µC/OS 1992 Industrial robots, motor control, medical instruments, etc. It is 99% compliant with the Motor Industry Software Reliability Association (MISRA) C Coding Standards. All Modified Condition Decision Coverage (MCDC) code in µC/OS-II has been removed, improving code quality for RTCA / EUROCAE DO-178B Level A-certified environments for avionics applications. Migrating SW-Based Fault Detection Mechanism into HW

2. The Possible Solution µC/OS-II: Characteristics Portable: uC/OS-II is written in highly portable ANSI C, with target microprocessor-specific code written in assembly language. ROMable: was designed for embedded applications. This means that if you have the proper tool chain (i.e., C compiler, assembler, and linker/locator), you can embed uC/OS-II as part of a product. Scalable: it’s possible to use only the services needed in the application. This allows to reduce the amount of memory (both RAM and ROM) needed. Scalability is accomplished with the use of conditional compilation (full version: 8KB). Preemptive: uC/OS-II is a fully preemptive real-time kernel. This means that uC/OS-II always runs the highest priority task that is ready. Multitasking: uC/OS-II can manage up to 64 tasks (Current version of the software reserves 8 of these tasks for system use. This leaves for application up to 56 tasks. Each task has a unique priority assigned to it, which means that uC/OS-II cannot do round-robin scheduling.) Migrating SW-Based Fault Detection Mechanism into HW

µC/OS-II: Characteristics Deterministic: Execution time of all uC/OS-II functions and services are deterministic. You can always know how much time uC/OS-II will take to execute a function or a service. Furthermore execution time of all uC/OS-II services do not depend on the number of tasks running in your application. Task Stacks: Each task requires its own stack (uC/OS-II allows each task to have a different stack size. This allows to reduce the amount of RAM needed for application). Services: system services such as mailboxes, queues, semaphores, fixed-sized memory partitions, time-related functions, etc. Interrupt Management: Interrupts can suspend the execution of a task. If a higher priority task is awakened as a result of the interrupt, the highest priority task will run as soon as all nested interrupts complete. Interrupts can be nested up to 255 levels deep. Robust and Reliable: uC/OS-II is based on uC/OS, which has been used in hundreds of commercial applications since The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW

Workload Design Characteristics: maximum system calls consume Worst case application: maximum system calls consume. Synchronization SemaphoresMemoryQueues MessagesTasksHandlingTiming Management System calls: Synchronization, Semaphores, Memory, Queues, Messages, Tasks Handling, Timing Management, etc. 2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW

The system workload is continuously running and consists of a series of tasks executing the application. The system workload is continuously running and consists of a series of tasks executing the application. Consistency checks are added to the application code and kernel to detect faults and invalid values at the kernel calls in order to improve system robustness. Consistency checks are added to the application code and kernel to detect faults and invalid values at the kernel calls in order to improve system robustness. monitor The WDT / I-IP is the monitor. Workload Design 2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Addition of Consistency Checks

void RandomNumberTask(void *pdata) { // Declare as auto to ensure reentrancy. auto OS_TCB data; auto INT8U err; auto INT16U RNum; OSTaskQuery(OS_PRIO_SELF, &data); while(1){ // Rand is not reentrant, so access must be controlled // via a semaphore. OSSemPend(RandomSem, 0, &err); RNum = (int)(rand() * 100); OSSemPost(RandomSem); printf("Task%02d's random #: %d\n",data.OSTCBPrio,RNum); // Wait 3 seconds in order to view output from each task. OSTimeDlySec(3);}} 2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW // 1. Define necessary configuration constants for uC/OS-II #define OS_MAX_EVENTS 2 #define OS_MAX_TASKS 20 #define OS_MAX_QS 0 #define OS_Q_EN 0 #define OS_MBOX_EN 0 #define OS_TICKS_PER_SEC 32 // 2. Define necessary stack configuration constants #define STACK_CNT_512 1 // initial program stack #define STACK_CNT_1K OS_MAX_TASKS // task stacks // 3. This ensures that the above definitions are used #use "ucos2.lib“ void RandomNumberTask(void *pdata); // Declare semaphore global so all tasks have access OS_EVENT* RandomSem; void main(){ int i; // Initialize OS internals OSInit(); for(i = 0; i < OS_MAX_TASKS; i++){ // Create each of the system tasks OSTaskCreate(RandomNumberTask, NULL, 1024, i); } // semaphore to control access to random number generator RandomSem = OSSemCreate(1); // 4. Set number of system ticks per second OSSetTicksPerSec(OS_TICKS_PER_SEC); // Begin multi-tasking OSStart();} OS Call (task waits for signal) OS Call (task sends a signal) Initializing Tasks Starting Tasks Workload Design

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Workload Design OS_ENTER_CRITICAL /*Code implemented for GNU-GAS*/ asm (" move.l #0x0100, -(%a0) | Write in “a0” the hexadecimal “0x0100” move.b #11, %a0 | Move the byte “11” to the address “a0” "); … asm (" move.l #0x0100, -(%a0) | Write in “a0” the hexadecimal “0x0100” move.b #00, %a0 | Move the byte “00” to the address “a0” "); OS_EXIT_CRITICAL Set an indication for the instant when the processor gets into the supervisor mode “OS_ENTER_CRITICAL” and when when it leaves this mode: “OS_EXIT_CRITICAL”. The signaling is done by writing to a specific memory address.

2. The Possible Solution Migrating SW-Based Fault Detection Mechanism into HW Workload Design /************************************************************* * PEND ON SEMAPHORE ************************************************************ */ UBYTE OSSemPend(OS_SEM *psem, UWORD timeout) { UBYTE x, y, bitx, bity; OS_ENTER_CRITICAL(); /*Code implemented for GNU-GAS*/ /*Code implemented for GNU-GAS*/ asm (" asm (" move.l #0x0100, -(%a0) | Write in “a0” the hexadecimal “0x0100” move.l #0x0100, -(%a0) | Write in “a0” the hexadecimal “0x0100” move.b #4, %a0 | Move the byte “4” to the address “a0” move.b #4, %a0 | Move the byte “4” to the address “a0” "); ");/*End*/ if (psem->OSSemCnt-- > 0) { OS_EXIT_CRITICAL(); return (OS_NO_ERR);} else { OSTCBCur->OSTCBStat |= OS_STAT_SEM; OSTCBCur->OSTCBDly = timeout; y = OSTCBCur->OSTCBPrio >> 3; x = OSTCBCur->OSTCBPrio & 0x07; bity = OSMapTbl[y]; bitx = OSMapTbl[x]; Systems Calls performed by Pend and Post through Semaphore, Mailbox and QUEUE if ((OSRdyTbl[y] &= ~bitx) == 0) OSRdyGrp &= ~bity; psem->OSSemTbl[y] |= bitx; psem->OSSemGrp |= bity; OS_EXIT_CRITICAL(); OSSched(); OS_ENTER_CRITICAL(); if (OSTCBCur->OSTCBStat & OS_STAT_SEM) { if ((psem->OSSemTbl[y] &= ~bitx) == 0) { psem->OSSemGrp &= ~bity; } OSTCBCur->OSTCBStat = OS_STAT_RDY; OS_EXIT_CRITICAL(); return (OS_TIMEOUT); } else { OS_EXIT_CRITICAL(); return (OS_NO_ERR); } Consistency Check

Matteo Sonza Reorda, Fault Tolerant Systems Group Politecnico di Torino 3. Experimental Evaluation An Intel 8051-based SoC was inspected. PANDORA I-IP: VHDL (~1500 lines).

3. Experimental Evaluation Fault detection capabilities evaluated via HW-based fault injection experiments (FPGA environment). Four benchmarks considered: –Matrix multiplication, Elliptical Filter, FIR Filter and Viterbi Algorithm.

3. Experimental Evaluation Detection capabilities: Transient faults (30,000 bit-flips) Number of wrong answers evaluated ( escape detection ). Matrix Ellipf FIR Viterbi CFCSS [%][%] ProgramPlain [%] Pandora [%][%] ECCA [%][%] Orig. SWIP (HW+SW)SW Sol.

3. Experimental Evaluation Memory overhead: Additional code lines required to implement the hybrid technique. Orig. SWIP (HW+SW)SW Sol.

3. Experimental Evaluation Execution time overhead: Orig. SWIP (HW+SW)SW Sol.

3. Experimental Evaluation Area overhead: PANDORA size  992 gates 8051 size  gates PANDORA introduces about 3.2% of area overhead Area overhead is expected to decrease when processor size increases.

4. Final Considerations Development of a hybrid methodology (HW+SW redundancies) able to perform runtime detection of errors in μprocessor-based SoCs may have very good cost X benefit returns.

Returns: Minimization of area overhead and fab/development costs (benefits of SW-based redundancy techniques) Improvement of performance and minimization of memory overhead (benefits of HW-based redundancy techniques) In summary: Minimize fab cost and performance degradation, while improving reliability Target faults: Control flow errors Data handling errors 4. Final Considerations

A hybrid methodology (HW+SW redundancies) explores: I-IP Core Architecture Software-Based Techniques 4. Final Considerations

4. Final Considerations  System architecture co-implemented in HW+SW to detect faults in control-flow and application data. The main characteristics of this architecture: SW-embedded structures at the application code level. Partial migration of the SW-embedded structures into HW: specific I-IIP monitors application processor such as a “watch-dog”. Communication channel between the HW+SW entities: driver embedded in the OS Kernel and specific signals used to communicate the I-IP with the application processor.