Presentation is loading. Please wait.

Presentation is loading. Please wait.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three.

Similar presentations


Presentation on theme: "7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three."— Presentation transcript:

1 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three approaches: A) Adequate for scenarios where the applications that are executing on multiprocessors do need a specific ( fixed ) topology of interconnect to operate. For instance: the original topology without failures N x N Mesh, where under failure, one tries to identify a working set of M x M processors, where M < N. B) Adequate for scenarios where the applications that are executing on multiprocessors do not need a specific topology of interconnect to operate. For instance: consider an interconnection as a way to connect a large # of processors which may communicate by sending messages across the links. In such a situation, fault tolerance can easily be achieved by graceful degradation, by redistributing the computational load among the working processors and redirecting messages around the failed ones. Major research issue: to investigate novel fault tolerant routing algorithms that will allow messages to be routed among working processors, avoiding faulty processors.

2 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors C) Consider an augmented topology by adding spare processors and links to the original interconnection, such that under faults in processors and links, the original topology of working processors and links can be identified. Then, reconfiguration is performed by substitution of faulty elements with spare ones such that the performance is kept unchanged. Problem: a complete network would require O(n 2 ) connections, requiring n communication ports per processor. Solution (Research Goals): development of architectures with low degree of nodes; modular and scalable to large # of processors; low diameter (# of hops required to communicate between any two pairs of nodes); ability to support a large # of useful communication patterns such as one-dimensional and two dimensional arrays on the topology.

3 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.1 Bus-Based Systems The most common and cheapest form of interconnection. All the processors and memory modules are connected to a common resource. Drawback: produce a lot of contention on the single resource (bus). Advantage: faulty processor or memory nodes can be removed easily from such system. Faults in the bus are handled by the use of redundant buses. Redundant bus system: all processors and memory modules are connected to a number of buses. In multiple-bus-based systems, two levels of arbiters must be used: one at the bus level, the other at the memory module level. Therefore, a policy must be implemented in HW in the form of arbiters, to allocate the available buses to the processors requesting access to the shared memory. Two sources of conflicts: More than one request can be made to the same memory. Solution: use of M 1-out-of-N arbiters, one per memory module. The available bus capacity may be insufficient to accommodate all requests. Solution: use of B buses by B-out-of-M arbiters.

4 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.1 Bus-Based Systems Cache P1 Cache P2 Cache P3 M1M2M3 Am1Am2Am3Am4 B2 B1 Ab2 Ab1

5 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.2 Crossbar-Based Systems Implemented as an N x N crossbar switch using N 2 switches. Allows simultaneous connection between all processor-memory pairs. Quite expensive form large N, but lowest contention approach. Two means of providing higher reliability to such systems: Replication of entire crossbar (i.e., the switches). Implementation of extra row and column of switches, so as to use (N + 1) x (N + 1) crossbar with each port connected to two rows or two columns of switches. Under a failure of a switch, the entire row and column can be disconnected, and the spare column and row activated.

6 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.2 Crossbar-Based Systems Cache P1 M1 Cache P2 Cache P3 M2M3 FT-Crossbar by means of extra row and column of switches Cache P1 M1 Cache P2 Cache P3 M2M3 Crossbar

7 7. Fault Tolerance Through Dynamic or Standby Redundancy Forms a compromise in allowing for a large number of simultaneous requests, while using only log(N) stages of N/2 switches (each 2x2 ports). Routing tags are used to describe a path through the network and for providing distributed network control flow. Tags indicate the switch setting for the switch in the next stage along a desired path. 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN

8 7. Fault Tolerance Through Dynamic or Standby Redundancy Three basic topologies: One-to-One Connection passes information from one network port to another through a route called path. Permutation Connection is a set of one-to-one connections such that no connection has the same destination. Broadcast Connection is when a single source carries a signal to multiple destinations. 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN

9 7. Fault Tolerance Through Dynamic or Standby Redundancy Techniques for FT in MIN can be categorized by whether they involve modifying the topology of the system: Methods that do not modify the topology Methods that do modify the topology  any fault tolerance scheme that uses the notion of modifying the network topology usually involves the detection and location of faults in the network, followed by a subsequent reconfiguration. 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN

10 7. Fault Tolerance Through Dynamic or Standby Redundancy Techniques for FT in MIN can be categorized by whether they involve modifying the topology of the system: Methods that do not modify the topology: Use of error detecting codes *. Methods that do modify the topology: Replication of the entire network *; Addition of an extra stage of switches **; Adding extra links between switches of the same stage **; Adding extra ports to switches **. 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN  * : permanent and transient fault detection;  ** : only permanent fault detection;

11 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN Addition of an extra stage of switches **: There are two distinct paths of switches and links between any source and destination pair. Hence, a single faulty switch or link can be tolerated. See next slide...

12 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN 0 1 2 3 4 5 6 7 0 1 2 3 4 5 5 7 MIN 0 1 2 3 4 5 6 7 0 1 2 3 4 5 5 7 FT-MIN by means of extra stage of switches

13 Adding extra links between switches of the same stage **; Adding extra ports to switches **. See next slide... 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN

14 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.3 Multistage Interconnection Networks - MIN - Adding extra links between switches of the same stage; - Adding extra ports to switches.

15 7. Fault Tolerance Through Dynamic or Standby Redundancy Hypercube computers offer a cost-effective and feasible approach to supercomputing by connecting a large number (P = 2 d ) of processors with local memory, using direct links ( Intel Scientific Computers, NCUBE, Ametek ). 7.6 Reconfiguration in Multiprocessors –7.6.4 Hypercube Networks P : # of processors in the network. d : # of dimensions of the network. A 16-processor hypercube.

16 7. Fault Tolerance Through Dynamic or Standby Redundancy Fault-Tolerance can be explored by means of two techniques: 7.6 Reconfiguration in Multiprocessors –7.6.4 Hypercube Networks a) HW Redundancy b) SW Redundancy

17 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.4 Hypercube Networks Hypercube reconfiguration through extra port. a) Conceptual addition of spare to each processor through extra port. b) Implementation of extra port connection using crossbar switches. HW Redundancy Problem  Spare-processor degree: N = 2 d ! Solution  VLSI IC implementing two 8x5 crossbar switches to route the spare processor.

18 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.4 Hypercube Networks Hypercube reconfiguration by node sparing. a) Embedding of spares on a hypercube. b) Location of four faults on a hypercube. c) Reconfiguration graph. HW Redundancy

19 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.4 Hypercube Networks SW Redundancy With these techniques, FT can easily be achieved by graceful degradation, by redistributing the computational load around failed processors and redirecting messages. Under faults, predefined path-search algorithms reconfigure faulty hypercubes by redirecting messages around fault-free nodes.  Note  Rerouting algorithms will fail for more then n-1 faulty nodes in a 2 n processor hypercube.

20 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.4 Hypercube Networks Routing in fault-free and faulty hypercubes. a) E-cube routing on fault- free hypercube. b) Adaptative routing on a faulty hypercube. SW Redundancy

21 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.5 Mesh Networks A 16-processor network arranged as a two-dimensional Mesh. A modification of the Mesh is a Torus, where there are also end- around connections for the processors at the extreme X and Y directions. Commercial products: Ametek, Intel Scientific Computers, Cray T3D.

22 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.5 Mesh Networks Mesh reconfiguration by row or column elimination. FT technique for Mesh networks based on graceful degradation: deletion of a entire row or column of processors on which a faulty processor belongs. Such a scheme degrades performance significantly for a single fault since it wipes out a large number of processors during reconfiguration.

23 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.5 Mesh Networks Mesh reconfiguration by logical mapping. Another approach considers to add to the NxN array of normal processors, some extra rows and columns. Then, reconfiguration is attempted along a specific direction by performing a global renaming of all processors in which logical indices of processor (i,j) are are mapped onto physical indices (k,h) of working processors using a mapping function.

24 While: Hypercube Networks: Tree Networks: Connect P processors with degree Connect P-1 processors with degree 3 4 and maximum distance of P 1/2. and maximum distance of 2.(log(P) – 1). 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.6 Tree Networks Tree-based machines are configured as a binary tree where each node is connected to two children and to its parent in a recursive manner. Such an architecture is excellent for recursive parallel algorithms that use divide-and-conquer techniques.

25 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.6 Tree Networks Reconfiguration in Tree networks. a) Reconfiguration in tree networks. FT scheme to tolerate multiple faults: One spare is used per level of the tree on the right of the tree. Each node of the tree (including the spares) connects to its two children and also to the two children of its left neighbor and to one of the children of its right neighbor. When a processor fails, all links of a processor to the right of it are readjusted to the right neighbors.

26 In order to decrease the complexity of the redundant connections, two decoupling networks can be used. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors –7.6.6 Tree Networks b) Decoupling network in tree. Reconfiguration in Tree networks.


Download ppt "7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three."

Similar presentations


Ads by Google