Clusters Part 4 - Systems Lars Lundberg The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present.

Clusters Part 4 - Systems Lars Lundberg The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present slides for chapter 12. This part is the most important one in Pfister’s book!

High Availability n What we today call high availability was previously called fault tolerance. n Traditionally there has been hardware fault tolerant systems. This means that faults are entirely handled by the hardware, and the software does not have to care. n Cluster systems offer fault tolerance in software, i.e. they use standard hardware.

Classes of Availability

Measuring Availability The availability is usually measured as the percentage of the time that a system is available. Assuming that a system can be either fully available or not available at all. Potential problems when measuring availability: n What if the system is partly available n Should we include periods when the system is not used n Should we include planned outages for maintenance etc. Planned outages can be a real problem in non-stop operation environments.

High Availability vs. Continuous operation If we separate the planned outages (maintenance, upgrades etc.) from the unplanned ones (crashes, faults etc.), we can make the distinction between: n High availability (few and short unplanned outages) n Continuous operation (few and short planned and unplanned outages) High availability and continuous operation are not always equally important.

Reasons for unplanned outages n Loss of power n Application software n Operating system software n Subsystem software (e.g. databases) n Hardware with moving parts (e.g. disks, fans, printers) n I/O-adapters n Memory n Processors, caches etc.

Outage Duration n Hardware does not “break” as often as software, but when it does it takes longer to repair. n Traditional hardware fault tolerance can recover from a fault faster than software fault tolerant cluster systems. n Very few clusters can recover from a fault in less than 30 seconds. It often takes much longer.

Definition of High Availability A system is highly available if: n No replaceable piece is a single point of failure. n The system is sufficiently reliable that you are likely to be able to repair or replace any broken parts before anything else breaks. Single point of failure is a single element of hardware or software which, if it fails, brings down the entire system.

Summary of High Availability n For 24x365 operation (24 hour 365 days per year), you must consider things like cooling, power supply, and also provide careful system management. n 24x365 operation also implies dealing with planned outages and disasters, not just breakage and errors. n Disregarding power failure, software causes the largest number of outages n The longest unplanned outages are caused as much by hardware as software (again disregarding power failure)

Summary of High Availability cont. n Avoid single point of failure n Clusters can help with planned outages and some unplanned errors in hardware and software. n Hardware based fault tolerance fails over instantaneous, but does not help with software errors and planned outages. n There is no industrial consensus on what “high availability” and “fault tolerance” means.

Failover Client I am OK Bozo Alice Failure Client BozoAlice One computer (Alice) is watching another computer (Bozo); if Bozo dies, Alice takes over Bozo’s work

Failover problems If Alice tries to take of the control at the same time as Bozo comes back up again, we will have two computers struggling for the control at the same time. This can cause a lot of problems.

Avoiding planned outages If we want to upgrade Bozo we can do the following: n Do a controlled (forced) fail over to Alice n Upgrade Bozo while Alice is taking care of business n Do a failback to Bozo n Alice can now also be upgraded Consequently, one of the advantages with clusters is that we do not have to take the system down during upgrades and maintenance. Problems may, however, occur when: n the upgrade includes change of data format on disk, or when n when the software runs in parallel across the cluster nodes

Moving resources when failing over n When an application is moved from one node to another the resources that it needs must also be moved, e.g. files and IP-addresses. n Early high-availability cluster system left this problem to the user, i.e. the user had to write a number of shell scripts that were executed during a failover. n One way to help the user is to define the dependencies between different applications and resources. The user then only has to define where a certain application should go,and the cluster software will move the necessary resources along with the application.

Potential problems when moving resources n Resources may depend on individual cluster nodes, e.g. a certain disk may only be accessible on a certain node. n The procedure for bringing resources on-line may depend on the node, e.g. a printer queue may already be defined on some nodes, and redefining it may cause problems. n The information about the resource dependencies must be available and consistent throughout the cluster nodes, even when the node responsible for updating this information crashes.

Moving data - replication vs. switchover Moving data from Bozo to Alice an be done in two ways: n Replication (separate disks/shared nothing, see Figure 108): u Bozo and Alice have their own separate disks, and the changes made on Bozo are continuously sent to Alice. u As an alternative, the changes in Bozo could be sent in batches at certain time intervals. n Switchover (shared disk, see Figure 109): u A disk (or other storage device) is connected to both Bozo and Alice, and when Bozo crashes, Alice takes control over the disk. Switchover is often preferred in high availability systems

Replication vs. switchover Replication advantages: n It is easier to add a new node when using replication. n It can be difficult to synchronize the disks in switchover configurations, e.g. the two systems must agree on disk partitions, volume names etc. n In switchover the disks are in one place. This limits the distance between the nodes and also be a problem with flooding of the room with the disk or other disasters. n Replication can use simpler storage units because: u The disks do not need to support dual access u The disks themselves are not a single point of failure

Replication vs. switchover Switchover advantages: n Easier to backup the disk n Less disk space is required n Less overhead, i.e. when using replication the Bozo must send copies of the change to Alice, and Alice must write these updates on the local disks. This uses CPU and I/O capacity. n If Bozo waits for Alice to signal that each update has been recorded correctly, the performance will be degraded. If Bozo does not wait, data may be lost when a failure occurs. n Failback is easier.

Avoiding corrupt data - transactions n When Bozo crashes, it might corrupt data or leave it in an inconsistent state. n Transactions are used for avoiding this problem n Transactions are usually implemented by having a log file on stable storage (e.g. mirrored disk) n No matter what happens (assuming the stable storage stays stable) a consistent state of the data can be recreated from the log file. n In replicated systems, transactions are implemented by a technique called two-phase commit.

Failing over communication n When Alice takes over the job from Bozo, the communication from the client is redirected using IP takeover n IP takeover is obtained by resetting one (or more) of the communication adapters on Alice to respond to the IP address(es) that Bozo was using. n Since most communication protocols have routines for retransmission after a time out limit, the client computes never know the difference. However, the people at the client computers probably have to log in again, i.e. their sessions are usually aborted at failover. n An alternative way of failing over communication is that each client have multiple IP addresses: the primary server, the secondary server and so on. If the primary server does not respond the client tries to contact the secondary server and so on.

Time for doing a failover n The time for reaching a fully operational state after a failover can be substantial. In best case scenarios the time can be as low as tens of seconds. n The failover times can be reduced by having pairs of processes: u There is one process on Alice for each process on Bozo. u Every time the process on Bozo changes its state that change is reflected on the process on Alice. u Tandem has claimed that by using this technique, sub- second failover is achievable.

Failover to where? n This question becomes interesting when there are more than two nodes in the cluster n Simple add-on high-availability systems often use static schemes, e.g. if Bozo dies, put jobs A and B on Alice n the rest on Clara. n Sophisticated cluster systems provide mechanisms for automatic load balancing (possibly also considering some user defined priorities). n Dynamic load balancing is easier is shared-disk clusters than in shared nothing clusters. In hared nothing clusters replication is used and this makes the backup order more static.

Global locks n In a shared-disk system, one must handle the problem of system wide locks when a node crashes n The processes on the node that crashed were probably holding resources that processes on other nodes will have to use. If the locks are not released the entire system will lockup. n There are two ways of handling this problem: u Letting the applications keep track of the locks that it was holding u Letting a global lock manager keep track of the locks that the applications on the crashed node were holding.

Heartbeats n Heartbeat messages are used for detecting when a node is dead. n Each node sends short messages to the other nodes, telling them that the node is alive n If a heartbeat message does not arrive within a time-out period, the node is declared dead. n One problem with this approach is that the message could be delayed for various reasons, and in that case a node which is declared dead may be OK. This can cause a lot of problems. n Another problem with this approach is that the node may be OK, but the communication link for the heartbeat is not OK. This could also lead to the dangerous conclusion that an Ok node is dead. n In order to improve the reliability of the heartbeat method the cluster might send heartbeat signals on a number of different channels, e.g. normal LAN, RS232 serial links, I/O links etc.

Actions when Bozo is declared dead n Establish a new heartbeat chain that excludes Bozo n Inform parallel subsystems that were running on Bozo; such as databases, of what has occurred and is about to happen n Fence Bozo off from its resources (e.g. disks) n Form a cluster-wide, consistent plan defining how Bozo’s resources should be redistributed n Execute the plan, i.e. move the resources etc. n Inform the subsystem that the resource reallocation has been completed n Resume normal operation

Alternatives to heartbeats n Instead of heartbeats, one can use the opposite approach: a liveness check. n This means that Alice will at certain points ask Bozo if he is OK. n A liveness check suffers from the same kind of problems as heartbeats, i.e. it is hard to guarantee a response within certain limits. n If a cluster node has reasons to believe that the rest of the system thinks that the node is dead, the node had better commit suicide. This could happen when a node detects that its heartbeat signals have been delayed beyond the time-out limit.

IBM RS/6000 Cluster Technology (Phoenix) n The purpose of Phoenix is to help the developer to build cluster- parallel applications that are highly available, i.e. Phoenix is a development tool and does not do anything by itself. n The product is highly scalable; designed for 512 nodes; it has been run on clusters with more than 400 nodes. n There are tree core services in Phoenix (see Figure 111): u Topology Services This service has no direct interface to the application. It manages heartbeats and maintains a dynamic map of the state of the other cluster nodes. u Group Services The key interface that helps the application to deal with high availability issues when some event happens. u Event Manager This service provides a way to inform a program running anywhere in the cluster when some thing interesting happens

Microsoft’s Clustering Services (MSCS) n MSCS is currently supporting only two-node clusters, later versions will however support a larger number of nodes. n MSCS is, unlike Phoenix, a self-contained high- availability cluster product n A key component is MSCS is the quorum resource, which is usually a disk. The purpose of the quorum resource is to make sure that only one of the two nodes thinks that it is in charge of the cluster. n Each node has access to a dynamic, but cluster-wide consistent, configuration database.

Scaling n The more there are in a cluster the less you pay for high availability, e.g.: u The additional cost for handling a node failure in a one- node system is 100%, i.e. we need two instead of one computers. u The additional cost of handling a node failure in a four- node system is 25%, i.e. we need five instead of four computers. n One implication of this that it is desirable to use computers that cannot individually fulfill the job requirements.

Disaster Recovery n Disasters differ from ordinary failures in that they are distributed over an area, e.g. flooding of a room, earthquakes etc. n Shared disk switchover solutions will not work for disasters. n Some crude and simple solutions are often used: u Sending away a backup tape to a remote location at certain intervals u Sending away a backup electronically to a remote location at certain intervals n The key difference between disaster recovery and normal clustering is the distance between the nodes. This causes delays which can strongly affect performance.

SMP and CC-NUMA Availability n If one processor node in an SMP or a CC-NUMA multiprocessor crashes, the entire system will crash. n There are a number of reasons for this, e.g.: u The caches on the processor nodes may contain the only valid copy of a certain variable. u The data structures in the operating system is shared between the processors, and if a processor crashes it may corrupt the shared data.

Clusters Part 4 - Systems Lars Lundberg The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present.

Similar presentations

Presentation on theme: "Clusters Part 4 - Systems Lars Lundberg The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clusters Part 4 - Systems Lars Lundberg The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present.

Similar presentations

Presentation on theme: "Clusters Part 4 - Systems Lars Lundberg The slides in this presentation cover Part 4 (Chapters 12-15) in Pfister’s book. We will, however, only present."— Presentation transcript:

Similar presentations

About project

Feedback