Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Similar presentations


Presentation on theme: "1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen."— Presentation transcript:

1 1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen

2 2 Introduction In traditional system, if a hardware, software, or operation fails, the system may fail and stop delivering service. In Tandem Nonstop system, parts of the system may fail but the rest of the system tolerate the fault and continue delivering service.

3 3 Tandem computers The basic philosophy of Tandem: no single hardware component should result in a system failure. Fault tolerance means being always available even if major components fail. Tandem Nonstop means errors to be detected and recovered without stopping the system. The Tandem Nonstop Cyclone system was introduced in 1989. It is the sixth generation of Tandem Nonstop system which was first introduced in 1976.

4 4 Tandem Nonstop Cyclone system A multiprocessors mainframe. The system consists of 4 to 16 processors interconnected by dual high speed buses (Dynabus). Sections of 4 processors are interconnected by dual fiber optic cables (Dynabus+). Each processor has CPU, memory, up to 4 I/O channels, executes a copy of the Guardian OS.

5 5 …Tandem Nonstop Cyclone system Fault detection is performed primarily by the hardware. Fault recovery is performed by the message-based OS. The system can tolerate a fault in a processor, peripheral controller, power supply, or cooling system. Failed components can be serviced online without disrupting processing.

6 6

7 7 Design principles Cyclone was designed for high performance with four principles: Fault detection by both hardware and software. Fail-fast. High reliability. Continuous system operation

8 8 Hardware architecture All major components are duplicated. Redundant hardware components allow continuous execution of processes in case of a component fails. If one fail, another take over. Components operate concurrently to enhance system performance and providing fault tolerance.

9 9 Dual-Bus path Each processor is connected to all other processors through a dual high speed buses (Dynabuses). When one bus fails, the other bus takes over without interruption.

10 10 Dual-Ported Controllers I/O Controllers are dual-ported to separate processors. So, each I/O device can be controlled by either of two processors. Only one processor controls the I/O device. An ownership bit in each controller selects which port is the primary. If the owning processor fails, the other processor takes over the control.

11 11 Dual-Ported Disks Disks are dual-ported to separate controllers. If one controller fails, the disks can communicate with the system through the other port.

12 12 Guardian OS Plays a vital role in detecting failures and performing recovery. Supports the managements of: processes, memory, interprocess communication, names, I/O, synchronization, debugging. Detects failures of: processors, buses, I/O channel. Performs recovery.

13 13 …Guardian OS Each processor executes a copy of Guardian OS. Two mechanisms that provide fault detection and recovery: - Process pair - I am alive protocol.

14 14 Process pair A primary process and a backup process concurrently executing in two different processors. The primary performs the work and sends a checkpoint message (state information) to the backup occasionally. If the primary fails, the backup takes over at the last checkpoint it received.

15 15 …Process pair… The primary is started first, then it starts the backup. The primary sends checkpoint message to the backup. Meanwhile, the backup enters a loop in which it receives checkpoint and error messages. If no error occurs, the backup updates its memory with the checkpoint message and continues the loop. If error occurs, the backup takes over at the last point it updated.

16 16 …Process pair When the backup becomes the primary. It creates a backup, then resumes processing at the last point it received If the backup fails, the primary creates a replacement backup

17 17 I am alive protocol All processors continually check on each other. Each processor transmits an I am alive message to other processors, every second. Each processor checks to see that it receives a message from every other processor. A missing message implies a processor failed. The backup process takes over the failed primary process.

18 18 Tandem Nonstop Cyclone system Some new features of Cyclone. superscalar, executes 2 instructions per clock cycle. implements dynamic branch prediction (strategy 2). supports direct memory access. multiple I/O channels. multiple, physically separate sections connected by dual fiber optic cables Dynabus+.

19 19 Fault detection Error detection in data paths is done by parity checking and parity prediction. Error detection in control paths is done by parity, illegal state detection, and self-checking. If processor hardware detects an error and can not recover, it shuts down itself in 2 clock cycles. And error is flagged in error identification registers.

20 20

21 21 …Fault detection… Faults in each processor are detected by hardware and microcode methods. Parity checking is used to detect single bit errors. Parity prediction is used on devices that change data. Hardware multiplier is protected by Recomputation with Shifted Operands (RESO).

22 22 …Fault detection Invalid state checking or duplication-and-comparison are used in sequential state machine. Checksum protects multiple word transmission. Microcode and OS perform consistency checks. If errors detected and unrecoverable, it executes HALT instruction and transmits error code to the Remote Maintenance Subsystem.

23 23 Fault recovery Each processor has online recovery mechanisms. Caches, main store, subsystem store can recovery data error by reloading from alternative copy. Main store and caches have spare RAMs that automatic replace failed RAMs.

24 24 Fault recovery A single error correcting and double error detecting code protect dynamic RAMs in the main memory. Asynchronous microcode process periodically checks for correctable memory errors.

25 25

26 26 Diagnosis facilities Each processor has a microprocessor that execute quick diagnostics. generate and collect signatures for quick fault detection and isolation. serves as the interface to a separate fault-tolerant Remote Maintenance Subsystem that monitors and logs events in a running system, support diagnosis of failures anywhere in the system, and report problems.

27 27 Conclusion Tandem Nonstop Cyclone system provide high level of performance, system availability, and data integrity. Redundant hardware components and process pair provide continuously service. Duplication of hardware component would cause the cost of the Tandem Nonstop system to be more expensive than the traditional system of the same size.


Download ppt "1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen."

Similar presentations


Ads by Google