Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.

Similar presentations


Presentation on theme: "Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in."— Presentation transcript:

1 Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in 40 years (i.e., 3 min/year), with less than 0.01% of the calls handled incorrectly. AT&T, on 70s...

2 Introduction High-Availability Systems: An Example In 1978, Bell Labs collected data on historic trends of causes of system downtime: 20% attributed to HW ( good diagnostics and trouble-location programs can help minimize HW-induced downtime ). 15% attributed to SW ( SW deficiencies included improper translation of algorithms into code or improper specifications ). 35% attributed to recovery deficiencies ( these deficiencies can be caused by undetected faults or incorrect fault isolation ). 30% attributed to human procedural error. AT&T

3 Introduction High-Availability Systems: An Example AT&T Other studies on the same direction...

4 Introduction High-Availability Systems: An Example AT&T However, there is a user aggravation level that must be avoided: users will redial as long as it does not happen to frequently. There is some natural redundancy in the telephone switching network: a telephone user will redial when he gets a wrong # or is disconnected.

5 Introduction High-Availability Systems: An Example AT&T Note, however, that the thresholds are different for failure to establish a call (moderately high frequency) and disconnection of an established call (very low frequency): Phase Recovery ActionEffect 1Initialize transient memory.Affects temporary storage, no calls lost. 2Reconfigure peripheral HW; initialize all transient memory. Lose calls in process of being established, calls in progress not affected. 3Verify memory operation; establish a workable processor configuration; verify program; configure peripheral HW; initialize all transient memory. 4Establish a workable processor configuration; configure peripheral HW; initialize all memory. All calls lost. Levels of recovery in a Telephone Switching System

6 Introduction High-Availability Systems: An Example AT&T Tasks of a Central Control Unit in a typical telephone switching system: Overall system control/administration Monitor calls, charge calls, generate reports Call processing Establish (route) calls, disconnect calls System maintenance Automatic isolation of faulty units Defensive SW strategies Support for rapid repair

7 Introduction High-Availability Systems: An Example AT&T Typical switching system diagram Central Control (CC) AU Bus Interface Program Store (PS) Call Store (CS) Auxiliary Unit (AU) Bus

8 Introduction High-Availability Systems: An Example AT&T CC instructions reside in the program store (PS) Transient (temporary) info (e.g., telephone calls, routing, equipment configuration) is held in the call store (CS) Auxiliary Unit (AU) Bus interfaces to disk and magnetic tape mass storage.

9 Introduction High-Availability Systems: An Example AT&T Duplex configuration for switching computer. (Assuming that only one of each component is required for a functional system, there are 64 possible system configurations.) Central Control 2 (CC) AU 2 Bus Interface 2 Program Store 1 (PS) Program Store 2 (PS) Call Store 1 (CS) Call Store 2 (CS) Auxiliary Unit (AU) Bus PSB1 PUB: Peripheral Unit Bus PSB2 Bus Interface 1 Central Control 1 (CC) AU 1 PUB1PUB2 PSB: Program Store Bus

10 Introduction High-Availability Systems: An Example AT&T: How the system works... 1- Both CCs operate in synchronism. Two matched circuits compare 24 bits of internal state at every 5.5us machine cycle. 2- There are 6 different sets of internal nodes that can be monitored, depending on the instruction being executed.

11 Introduction High-Availability Systems: An Example AT&T: How the system works... 3- A mismatch generates an interrupt which calls fault recognition programs to determine which part of the system is faulty. 4- After a fault has been detected and located, the system configuration logic attempts to establish various combinations of subunits. 5- A sanity program is then executed.

12 Introduction High-Availability Systems: An Example AT&T: How the system works... A- The OS employs Hamming code on the 37 data bits. B- There is parity check bits over address plus data bus: the CS has one parity bit on address and data, and another parity bit just on address. C- Both OS and CS automatically retry operations upon error detection Time Redundancy. In addition: In addition: Information can be sampled by the matchers and retained for later examination by diagnostic programs.

13 Introduction High-Availability Systems: An Example AT&T Summarizing some features of the FT system: Duplication of ALU. 30% of Control Logic devoted to Self-Checking. EDAC on disks. SW audits Acceptance Tests. Sanity timer (a Sanity Program is similar to a maze that the HW must traverse before the sanity timer times out. If a time-out occurs, the reconfiguration logic generates a new configuration to be tried) Quite important for RT-Systems!

14 Introduction High-Availability Systems: An Example AT&T Integrity monitor (Supervisor, samples and stores valuable information for later evaluation for diagnostics purposes). Byte parity on datapaths. Parity checking where parity preserved, duplication otherwise. Two-parity bits on registers. Modified Hamming Code on Main Memory. Maintenance Channel for observability and controlability.


Download ppt "Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in."

Similar presentations


Ads by Google