Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.

Slides:



Advertisements
Similar presentations
Computer Systems Nat 4/5 Computing Science Computer Structure:
Advertisements

System Integration and Performance
Machine cycle.
Parul Polytechnic Institute
1 (Review of Prerequisite Material). Processes are an abstraction of the operation of computers. So, to understand operating systems, one must have a.
Fault-Tolerant Systems Design Part 1.
Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
Overview of Programming and Problem Solving ROBERT REAVES.
INTRODUCTION OF COMPUTER
FIU Chapter 7: Input/Output Jerome Crooks Panyawat Chiamprasert
Chapter 19: Network Management Business Data Communications, 4e.
Lecture 1: History of Operating System
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Operating System and Computer Organization Background CS 502 Spring 99 WPI MetroWest/Southboro Campus.
8. Fault Tolerance in Software
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
C++ Programming: From Problem Analysis to Program Design, Third Edition Chapter 1: An Overview of Computers and Programming Languages C++ Programming:
Operating Systems Lecture 1 Crucial hardware concepts review M. Naghibzadeh Reference: M. Naghibzadeh, Operating System Concepts and Techniques, iUniverse.
Chapter 1 Introduction. Computer Architecture selecting and interconnecting hardware components to create computers that meet functional, performance.
BLOCK DIAGRAM OF COMPUTER
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Input/OUTPUT [I/O Module structure].
Chapter 1. Introduction What is an Operating System? Mainframe Systems
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
1 CSE Department MAITSandeep Tayal Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
C++ Programming: From Problem Analysis to Program Design, Third Edition Chapter 1: An Overview of Computers and Programming Languages.
Fault-Tolerant Systems Design Part 1.
Lesson 3 — How a Computer Processes Data Unit 1 — Computer Basics.
2 nd Year - 1 st Semester Asst. Lect. Mohammed Salim Computer Architecture I 1.
General Concepts of Computer Organization Overview of Microcomputer.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
Computer Architecture And Organization UNIT-II General System Architecture.
CHAPTER 4 The Central Processing Unit. Chapter Overview Microprocessors Replacing and Upgrading a CPU.
Computer Organization & Assembly Language © by DR. M. Amer.
Input-Output Organization
1 Control Unit Operation and Microprogramming Chap 16 & 17 of CO&A Dr. Farag.
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
Fault-Tolerant Systems Design Part 1.
Computer Structure & Architecture 7b - CPU & Buses.
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 2 Computer-System Structures Slide 1 Chapter 2 Computer-System Structures.
1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.
Silberschatz, Galvin and Gagne  Applied Operating System Concepts Chapter 2: Computer-System Structures Computer System Architecture and Operation.
Chapter 6: Computer Components Dr Mohamed Menacer Taibah University
A.Abhari CPS1251 Topic 1: Introduction to Computers Computer Hardware Computer components Connecting Computers Computer Software Operating System (OS)
Lecture on Central Process Unit (CPU)
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Amdahl’s Law & I/O Control Method 1. Amdahl’s Law The overall performance of a system is a result of the interaction of all of its components. System.
1 Chapter 1 Basic Structures Of Computers. Computer : Introduction A computer is an electronic machine,devised for performing calculations and controlling.
Lecture 11. Switch Hardware Nowadays switches are very high performance computers with high hardware specifications Switches usually consist of a chassis.
Computer Organization
Microprocessor Systems Design I
UNIT – Microcontroller.
Introduction to Computing
An Introduction to Microprocessor Architecture using intel 8085 as a classic processor
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Seminar on Enterprise Software
Presentation transcript:

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in 40 years (i.e., 3 min/year), with less than 0.01% of the calls handled incorrectly. AT&T, on 70s...

Introduction High-Availability Systems: An Example In 1978, Bell Labs collected data on historic trends of causes of system downtime: 20% attributed to HW ( good diagnostics and trouble-location programs can help minimize HW-induced downtime ). 15% attributed to SW ( SW deficiencies included improper translation of algorithms into code or improper specifications ). 35% attributed to recovery deficiencies ( these deficiencies can be caused by undetected faults or incorrect fault isolation ). 30% attributed to human procedural error. AT&T

Introduction High-Availability Systems: An Example AT&T Other studies on the same direction...

Introduction High-Availability Systems: An Example AT&T However, there is a user aggravation level that must be avoided: users will redial as long as it does not happen to frequently. There is some natural redundancy in the telephone switching network: a telephone user will redial when he gets a wrong # or is disconnected.

Introduction High-Availability Systems: An Example AT&T Note, however, that the thresholds are different for failure to establish a call (moderately high frequency) and disconnection of an established call (very low frequency): Phase Recovery ActionEffect 1Initialize transient memory.Affects temporary storage, no calls lost. 2Reconfigure peripheral HW; initialize all transient memory. Lose calls in process of being established, calls in progress not affected. 3Verify memory operation; establish a workable processor configuration; verify program; configure peripheral HW; initialize all transient memory. 4Establish a workable processor configuration; configure peripheral HW; initialize all memory. All calls lost. Levels of recovery in a Telephone Switching System

Introduction High-Availability Systems: An Example AT&T Tasks of a Central Control Unit in a typical telephone switching system: Overall system control/administration Monitor calls, charge calls, generate reports Call processing Establish (route) calls, disconnect calls System maintenance Automatic isolation of faulty units Defensive SW strategies Support for rapid repair

Introduction High-Availability Systems: An Example AT&T Typical switching system diagram Central Control (CC) AU Bus Interface Program Store (PS) Call Store (CS) Auxiliary Unit (AU) Bus

Introduction High-Availability Systems: An Example AT&T CC instructions reside in the program store (PS) Transient (temporary) info (e.g., telephone calls, routing, equipment configuration) is held in the call store (CS) Auxiliary Unit (AU) Bus interfaces to disk and magnetic tape mass storage.

Introduction High-Availability Systems: An Example AT&T Duplex configuration for switching computer. (Assuming that only one of each component is required for a functional system, there are 64 possible system configurations.) Central Control 2 (CC) AU 2 Bus Interface 2 Program Store 1 (PS) Program Store 2 (PS) Call Store 1 (CS) Call Store 2 (CS) Auxiliary Unit (AU) Bus PSB1 PUB: Peripheral Unit Bus PSB2 Bus Interface 1 Central Control 1 (CC) AU 1 PUB1PUB2 PSB: Program Store Bus

Introduction High-Availability Systems: An Example AT&T: How the system works Both CCs operate in synchronism. Two matched circuits compare 24 bits of internal state at every 5.5us machine cycle. 2- There are 6 different sets of internal nodes that can be monitored, depending on the instruction being executed.

Introduction High-Availability Systems: An Example AT&T: How the system works A mismatch generates an interrupt which calls fault recognition programs to determine which part of the system is faulty. 4- After a fault has been detected and located, the system configuration logic attempts to establish various combinations of subunits. 5- A sanity program is then executed.

Introduction High-Availability Systems: An Example AT&T: How the system works... A- The OS employs Hamming code on the 37 data bits. B- There is parity check bits over address plus data bus: the CS has one parity bit on address and data, and another parity bit just on address. C- Both OS and CS automatically retry operations upon error detection Time Redundancy. In addition: In addition: Information can be sampled by the matchers and retained for later examination by diagnostic programs.

Introduction High-Availability Systems: An Example AT&T Summarizing some features of the FT system: Duplication of ALU. 30% of Control Logic devoted to Self-Checking. EDAC on disks. SW audits Acceptance Tests. Sanity timer (a Sanity Program is similar to a maze that the HW must traverse before the sanity timer times out. If a time-out occurs, the reconfiguration logic generates a new configuration to be tried) Quite important for RT-Systems!

Introduction High-Availability Systems: An Example AT&T Integrity monitor (Supervisor, samples and stores valuable information for later evaluation for diagnostics purposes). Byte parity on datapaths. Parity checking where parity preserved, duplication otherwise. Two-parity bits on registers. Modified Hamming Code on Main Memory. Maintenance Channel for observability and controlability.