Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel.

Similar presentations


Presentation on theme: "Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel."— Presentation transcript:

1 Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel

2 Introduction Presenter: Ka Hou Wong

3 Introduction RAIN Research collaboration between Caltech and Jet Propulsion Laboratory Goal Identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components

4 Hardware Platform Heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces through a network of switches C0C0 C1C1 C2C2 C3C3 C4C4 S0S0 S1S1 C5C5 C6C6 C7C7 C8C8 C9C9 S2S2 S3S3 C = Computer S = Switch

5 Software Platform Collection of software modules that run in conjunction with operating system services and standard network protocols Network Connections ApplicationMPI/PVM TCP/IP RAIN EthernetMyrinetATMServernet

6 Key Building Blocks For Distributed Computer Systems Communication Fault-tolerant communication topologies Reliable communication protocols Fault Management Group membership techniques Storage Distributed data storage schemes based on error-control codes

7 Features of RAIN Communication Provides fault tolerance in the network via the following mechanisms Bundled interfaces Link monitoring Fault-tolerant interconnect topologies

8 Features of RAIN (cont’d) Group membership Identifies healthy nodes that are participating in the cluster Data storage Uses redundant storage schemes over multiple disks for fault tolerance

9 Communication Presenter: Jahanzeb Faizan

10 Communication Fault-tolerant interconnect topologies Network interfaces

11 Fault-tolerant Interconnect Technologies Goal To connect computer nodes to a network of switches in order to maximize the network’s resistance to partitioning S S S SS S S S C C C C C C C C How do you connect n nodes to a ring of n switches?

12 Naïve Approach Connect the computer nodes to the nearest switches in a regular fashion S S S SS S S S C C C C C C C C 1-fault-tolerant The network is easily partitioned with two switch failures

13 Diameter Construction Approach Connect computer nodes to the switching network in the most non-local way possible Computer nodes are connected to maximally distant switches Nodes of degree 2 connected between switches should form a diameter

14 Diameter Construction Approach (cont’d) Construction (Diameters). Let d s = 4 and d c = 2.  i, 0 < i < n, label all compute nodes c i and switches s i. Connect switch s i to s (i+1)mod n, i.e., in a ring. Connect node c i to switches s i and s (i+  n/2  +1)mod n. S0S0 S1S1 S6S6 S2S2 S4S4 S3S3 S5S5 C0C0 C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 n = 7 S0S0 S1S1 S7S7 S6S6 S2S2 S4S4 S3S3 S5S5 C2C2 C1C1 C0C0 C7C7 C6C6 C5C5 C4C4 C3C3 n = 8 Can tolerate 3 faults of any kind without partitioning the network

15 Protocol for Link Failure Goal Monitoring of available paths Requirements Correctness Bounded Slack Stability

16 Correctness Must correctly reflect the true state of the channel Bi-directional Communication A B If one side sees timeouts… Both sides should mark the channel as being down

17 Bounded Slack Ensure that both have a maximum slack of n transactions Link History Time U = link up D = link down D D U D U D U U BA D D U D U D U D U D UU BA Node A sees many more transactions than node B Nodes A and B see tightly coupled views of the channel

18 Stability Each real channel event (i.e. time-out) should cause at most some bounded number state transactions at each endpoint

19 Consistent-History Protocol for Link Failures Monitor available paths in the network for proper functioning Modified Ping Protocol guarantees each side of communication channel sees the same history (bounded slack)

20 The Protocol Reliable Message Passing Implementation: Sliding window protocol Existing reliable communication layer not needed Reliable messaging built on top of ping messages

21 The Protocol (cont’d) Protocol Sending and receiving of token using reliable messaging Tokens are sent on request Consistent history maintained Sending and receiving of Ping messages using unreliable messaging Detect when link is up or down Implemented by Pings or hardware feedback

22 Demonstration Down t = 1 Up t = 2 Down t = 2 Down t = 0 Up t = 1 T/0 T/1 T/0 tout/1 T/1 t: token count T: token arrival event tout: time-out event trigger event / token sent Start

23 Group Membership Presenter: Jonathan Sippel

24 Group Membership Provides a level of agreement between non-faulty processes in a distributed application Tolerates permanent and transient failures in both nodes and links Based on two mechanisms Token Mechanism 911 Mechanism

25 Token Mechanism Nodes in the membership are ordered in a logical ring Token passed at a regular interval from one node to the next Token carries the authoritative knowledge of the membership Node updates its local membership information according to the received token

26 Token Mechanism (cont’d) Aggressive Failure Detection A D BC A D BC

27 Token Mechanism (cont’d) Conservative Failure Detection A D BC A D BC

28 911 Mechanism When is the 911 Mechanism used? Token Regeneration - Regenerate a token that is lost if a node or a link fails Dynamic Scalability - Add a new node to the system What is a 911 message? Request for the right to regenerate the lost token Must be approved by all the live nodes in the membership

29 Token Regeneration Only one node is allowed to regenerate the token Token sequence number is used to guarantee mutual exclusivity and is incremented every time the token is passed from one node to the next Each node makes a local copy of the token on receipt Sequence number on the node’s local copy of the token is added to the 911 message and compared to all the sequence numbers on the local copies of the token on the other live nodes 911 request is denied by any node with a more recent copy of the token

30 Dynamic Scalability 911 message sent by a new node to join the group Receiving node Treats the message as a join request because the originating node is not in the membership Updates the membership the next time it receives the token and sends it to the new node

31 Data Storage The RAIN system provides a distributed storage system based on a class of erasure-correcting codes called array codes that provide a mathematical means of representing data so lost information can be recovered

32 Data Storage (cont’d) Array codes With an (n, k) erasure-correcting code, k symbols of original data are represented with n symbols of encoded data With an m-erasure-correcting code, the original data can be recovered even if m symbols of the encoded data are lost A code is said to be Maximum Distance Separable (MDS) if m = n – k The only operations necessary to encode/decode an array code are simple binary XOR operations

33 Data Storage (cont’d) A+C+d+eF+B+c+dE+A+b+cD+F+a+bC+E+f+aB+D+e+f FDDCBA fedcba Data Placement Scheme for a (6, 4) Array Code

34 Data Storage (cont’d) A+C+d+eF+B+c+dE+A+b+cD+F+a+b?? FDDC?? fedc?? Data Placement Scheme for a (6, 4) Array Code A = C + d + e + (A + C + d + e) b = A + (E + A + b + c) + c + E a = b + (D + F + a + b) + D + F B = a + c + (F + B + c + d) + d

35 Data Storage (cont’d) Distributed store/retrieve operations For a store operation a block of data of size d is encoded into n symbols, each of size d/k, using an (n, k) MDS array code For a retrieve operation, symbols are collected from any k nodes and decoded The original data can be recovered with up to n – k node failures The encoding scheme provides for dynamic reconfigurability and load balancing

36 RAIN Contributions to Distributed Computing Systems Fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures Fault management techniques based on group membership Data storage schemes based on computationally efficient error-control codes

37 References Vasken Bohossian, Chenggong C. Fan, Paul S. LeMahieu, Marc D. Riedel, Lihao Xu, Jehoshua Bruck, “Computing in the RAIN: A Reliable Array of Independent Nodes,” IEEE Transactions On Parallel and Distributed Systems, Vol. 12, No. 2, February 2001 http://www.rainfinity.com/


Download ppt "Computing in the RAIN: A Reliable Array of Independent Nodes Group A3 Ka Hou Wong Jahanzeb Faizan Jonathan Sippel."

Similar presentations


Ads by Google