3 Selected LiteratureGregory Pfister: In Search of Clusters, 2nd ed., Pearson 1998Documentation for the Windows Server 2008 Failover Cluster (on the Microsoft Web Pages)Sven Ahnert: Virtuelle Maschinen mit VMware und Microsoft, 2. Aufl., Addison-Wesley 2007 (the 3rd edition is announced for June 26, 2009).
4 What is a Cluster?Wikipedia says: A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer.Gregory Pfister says: A cluster is a type of parallel or distributed system that:consists of a collection of interconnected whole computers,and is utilized as a single, unified computing resource.
5 Features (Goals) of Clusters High Performance ComputingLoad BalancingHigh AvailabilityScalabilitySimplified System ManagementSingle System Image
6 Basic Types of Clusters High Performance Computing (HPC) ClustersLoad Balancing Clusters (aka Server Farms)High-Availability Clusters (aka Failover Clusters)
9 Microsoft Network Load Balancing (1) ISA2006 Server: Internet Security and Acceleration ServerTry the next generation of Microsoft Internet Security & Acceleration (ISA) Server and experience key features that include Web antimalware, HTTPS inspection and the Network Inspection System.
12 Distributed Lock Manager Linux clusteringBoth Red Hat and Oracle have developed clustering software for Linux.OCFS2, the Oracle Cluster File System was added to the official Linux kernel with version , in January The alpha-quality code warning on OCFS2 was removed inRed Hat's cluster software, including their DLM and Global File System was officially added to the Linux kernel with version , in November 2006.Both systems use a DLM modeled on the venerable VMS DLM. Oracle's DLM has a simpler API. (the core function, dlmlock(), has eight parameters, whereas the VMS SYS$ENQ service and Red Hat's dlm_lock both have 11.)
16 Selected HA Cluster Products (1) VMScluster (DEC 1984, today: HP)Shared everything cluster with up to 96 nodes.IBM HACMP (High Availability Cluster Multiprocessing, 1991)Up to 32 nodes (IBM System p with AIX or Linux).IBM Parallel Sysplex (1994)Shared everything, up to 32 nodes (mainframes with z/OS).Solaris Cluster, aka Sun ClusterUp to 16 nodes.
17 Selected HA Cluster Products (2) Heartbeat (HA Linux project, started in 1997)No architectural limit for the number of nodes.Red Hat Cluster SuiteUp to 128 nodes. DLMWindows Server 2008 Failover ClusterWas: Microsoft Cluster Server (MSCS, since 1997).Up to 16 nodes on x64 (8 nodes on x86).Oracle Real Application Cluster (RAC)Two or more computers, each running an instance of the Oracle Database, concurrently access a single database.Up to 100 nodes.The HA Linux project's main software product is Heartbeat, a GPL-licensed portable cluster management program for high-availability clustering. Its most important features are:no fixed maximum number of nodes - Heartbeat can be used to build large clusters as well as very simple onesresource monitoring: resources can be automatically restarted or moved to another node on failurefencing mechanism to remove failed nodes from the clustersophisticated policy-based resource management, resource inter-dependencies and constraintstime-based rules allow for different policies depending on timeseveral resource scripts (for Apache, DB2, Oracle, PostgreSQL etc.) includedGUI for configuring, controlling and monitoring resources and nodesService locking and control is guaranteed through fencing and STONITH, more recent versions of Red Hat use a distributed lock manager (DLM), to allow fine grained locking and no single point of failure. Earlier versions of the cluster suite relied on GULM (Grand Unified Lock Manager) which could be clustered, but still presented a point of failure) if the nodes acting as gulm servers were to fail. GULM as a locking manager is available but deprecated in Red Hat Cluster Suite 5.Red Hat Cluster Suite: Technical DetailsSupport for up to 128 nodes ( 16 on Red Hat Enterprise Linux 3 and 4)NFS (Unix) /CIFS (Windows)/GFS (Multiple Operating systems) File system failover supportService failover supportFully shared storage subsystemComprehensive Data IntegritySCSI and Fibre Channel support
21 Cluster with Virtual Machines (1) One physical machine as hot standby for several physical machines:physical virtual cluster
22 Cluster with Virtual Machines (2) Consolidation of several clusters:physical virtual cluster
23 Cluster with Virtual Machines (3) Clustering hosts (failing over whole VMs):physical virtual cluster
24 iSCSI Internet Small Computer Systems Interface is a storage area network (SAN) protocol,carries SCSI commands over IP networks (LAN, WAN, Internet),is an alternative to Fibre Channel (FC), using an existing network infrastructure.An iSCSI client is called an iSCSI Initiator.An iSCSI server is called an iSCSI TargetRequirements for Failover Cluster:For iSCSI: If you are using iSCSI, each clustered server must have one or more network adapters or host bus adapters that are dedicated to the cluster storage. The network you use for iSCSI cannot be used for network communication. In all clustered servers, the network adapters you use to connect to the iSCSI storage target should be identical, and we recommend that you use Gigabit Ethernet or higher. For iSCSI, you cannot use teamed network adapters, because they are not supported with iSCSI. For more information about iSCSI, see the iSCSI FAQ on the Microsoft Web site (
25 iSCSI InitiatorAn iSCSI initiator initiates a SCSI session, i.e. sends a SCSI command to the target.A Hardware Initiator (host bus adapter, HBA)handles the iSCSI and TCP processing and Ethernet interrupts independently of the CPU.A Software Initiatorruns as a memory resident device driver,uses an existing network card,leaves all protocol handling to the main CPU.
26 iSCSI Target An iSCSI target waits for iSCSI initiators‘ commands,provides required input/output data transfers.Hardware Target: A storage array (SAN) may offer its disks via the iSCSI protocol.A Software Target:offers (parts of) the local disks to iSCSI initiators,uses an existing network card,leaves all protocol handling to the main CPU.
27 Logical Unit Number (LUN) A Logical Unit Number (LUN)is the unit offered by iSCSI targets to iSCSI initiators,represents an individually addressable SCSI device,appears to an initiator like a locally attached device,may physically reside on a non-SCSI device, and/or be part of a RAID set,may restrict access to a single initiator,may be shared between several initiators (leaving the handling of access conflicts to the file resp. operating system, or to some cluster software). Attention: many iSCSI target solutions do not offer this functionality.
28 CHAP Protocol iSCSI CHAP optionally uses the Challenge-Hand-shake Authen-tication Protocol (CHAP) for authentication of initiators to the target,does not provide cryptographic protection for the data transferred.CHAPuses a three-way handshake,bases the verification on a shared secret, which must be known to both the initiator and the target.
30 Preparing a Failover Cluster In order to build a Windows Server 2008 Failover Cluster you need to:Install the Failover Cluster Feature (in Server Manager).Conncect networks and storage.Public networkHeartbeat networkStorage network (FC or iSCSI, unless you use SAS)Validate the hardware configuration (Cluster Vali-dation Wizard in the Failover Cluster Management snap-in).
31 Preparing the Shared Storage All disks on a shared storage bus are automatically placed in an offline state when first mapped to a cluster node. This allows storage to be simultaneously mapped to all nodes in a cluster even before the cluster is created. No longer do nodes have to be booted one at a time, disks prepared on one and then the node shut down, another node booted, the disk configuration verified, and so on.
32 The Cluster Validation Wizard Run the Cluster Validation Wizard (in Failover Cluster Management).Adjust your configuration until the wizard does not report any errors.An error-free cluster validation is a prerequisite for obtaining Microsoft support for your cluster installation.A full test of the Wizard consists of:System configurationInventoryNetworkStorage
33 Initial Creation of a Windows Server 2008 Failover Cluster Use the Create Cluster Wizard (in Failover Cluster Management) to create the cluster. You will have to specifywhich servers are to be part of the cluster,a name for the cluster,an IP address for the cluster.Other parameters will be chosen automatically, and can be changed later.
35 Fencing(Node) Fencing is the act of forcefully disabling a cluster node (or at least keeping it from doing disk I/O: Disk Fencing).The decision when a node needs to be fenced is taken by the cluster software.Some ways of how a node can be fenced areby disabling its port(s) on a Fibre Channel switch,by (remotely) powering down the node,by using the SCSI-3 Persistent Reservation.
36 SAN Fabric FencingSome Fibre Channel Switches allow programs to fence a node by disabling the switch port(s) that it is connected to.The fencing SW logs into the switch, and disables the specific port(s).Available, for example, for Brocade, and for Vixel FC switches.Used both by Red Hat Global File System (GFS) and the PolyServe File System (PSFS).
37 STONITH “Shoot the other node in the head”. A special STONITH device (a Network Power Switch) allows a cluster node to power down other cluster nodes.Used, for example, in Heartbeat, the Linux HA project.
38 SCSI-3 Persistent Reservation Allows multiple nodes to access a SCSI device.Blocks other nodes from accessing the device.Supports multiple paths from host to disk.Reservations are persistent across SCSI bus resets, and node reboots.Uses reservations, and registration.To eject another system‘s registration, a node issues a pre-empt and abort command.Used by Sun Cluster, Veritas Advanced Cluster, Oracle Cluster. Implemented by EMC Symmetrix, Sun T3, and Hitachi storage systems.SCSI-3 PR uses a concept of registration and reservation. Systems that participate, register a key with SCSI-3 device. Each system registers its own key. Then registered systems can establish a reservation. With this method, blocking write access is as simple as removing registration from a device. A system wishing to eject another system issues a pre-empt and abort command and that ejects another node. Once a node is ejected, it has no key registered so that it cannot eject others. This method effectively avoids the split-brain condition.A quote from the SPC-3 (SCSI Primary commands-3) specs: "The PERSISTENT RESERVE OUT and PERSISTENT RESERVE IN commands provide the basic mechanism for dynamic contention resolution in systems with multiple initiator ports accessing a logical unit."
39 Fencing in Failover Cluster Windows Server 2008 Failover Cluster uses SCSI-3 Persistent Reservations.All shared storage solutions (e.g. iSCSI Targets) used in the cluster must use SCSI-3 commands, and in particular support persistent reserva-tions.(Many open source iSCSI targets do not fulfill this requirement, e.g. OpenFiler, or FreeNAS target.)
40 A Cluster Validation Error The Cluster Validation Wizard may report the following error:
41 Cluster Partitioning (Split-Brain) Cluster Partitioning (Split-Brain) is the situ-ation when the cluster nodes break up into groups which can communicate in their groups, and with the shared storage, but not between groups.Cluster Partitioning can lead to serious problems, including data corruption on the shared disks.
42 Quorum SchemesCluster Partitioning can be avoided by using a Quorum Scheme:A group of nodes is only allowed to run as a cluster when it has quorum.Quorum consists of a majority of votes.Votes can be contributed byNodesDisksFile Shareseach of which can provide one or more votes.
43 Votes in Failover Cluster In Windows Server 2008 Failover Cluster votes can be contributed bya node,a disk (called the witness disk),a file share,each of which provides exactly one vote.A Witness Disk or File Share contains the cluster registry hive in the \Cluster directory. (The same information is also stored on each of the cluster nodes but may be out of date).
44 Quorum Schemes in Windows Server 2008 Failover Cluster (1) Windows Server 2008 Failover Cluster can use any of four different Quorum Schemes:Node MajorityRecommended for a cluster with an odd number of nodes.Node and Disk MajorityRecommended for a cluster with an even number of nodes.
45 Quorum Schemes in Windows Server 2008 Failover Cluster (2) Node and File Share MajorityRecommended for a multi-site cluster with an even number of nodes.No Majority: Disk OnlyA group of nodes may run as a cluster if they have access to the witness disk.The witness disk is a single point of failure.Not recommended. (Only for backward compatibility with Windows Server 2003.)
47 Failover Cluster Terminology ResourcesGroupsServices and ApplicationsDependenciesFailoverFailbackLooks-Alive („Basic resource health check“, default interval: 5 sec.)Is-Alive („Thorough resource health check“, default interval: 1 min.)WNT:Standardmäßig kennt MSCS folgende Ressourcentypen:Generic application - Print spoolerGeneric service - TCP/IP addressIIS virtual root - Time servicePhysical disk - Distributed Transaction CoordinatorFile share - Microsoft Message Queue ServerNetwork nameW2K:Windows 2000 stellt zusätzlich folgende Ressourcentypen zur Verfügung:Distributed File System (Dfs)Dynamic Host Configuration Protocol (DHCP) serviceNetwork News Transfer Protocol (NNTP)Simple Message Transfer Protocol (SMTP)Windows Internet Service (WINS)
48 Services and Applications DFS Namespace ServerDHCP ServerDistributed Transaction Coordinator (DTC)File ServerGeneric ApplicationGeneric ScriptGeneric ServiceInternet Storage Name Service (ISNS) ServerMessage QueuingOther ServerPrint ServerVirtual Machine (Hyper-V)WINS Server
49 Properties of Services and Applications General:NamePreferred Owner(s) (Muss angegeben werden, wenn ein Failback gewünscht ist.)Failover:Period (Default: 6 hours) Number of hours in which the Failover Threshold must not be exceeded.Threshold (Default: 2 [?, 2 for File Server]) Maximum number of times to attempt a restart or failover in the specified period. When this number is exceeded, the application is left in the failed state.Failback:Prevent failback (Default)Allow failbackImmediatelyFailback between (specify range of hours of the day)
50 Resource TypesIn addition to all services and applications mentioned before:File Share Quorum WitnessIP AddressIPv6 AddressIPv6 Tunnel AddressMSMQ TriggersNetwork NameNFS SharePhysical DiskVolume Shadow Copy Service Task
51 Properties of Resources (1) General:Resource NameResource TypeDependenciesPolicies:Do not restartRestart (Default)Threshold: Maximum number of restarts in the period. Default: 1Period: Period for restarts. Default: 15 min.Failover all resources in the service/application if restart fails? Default: yesIf restart fails, begin restarting again after ... Default: 1 hourPending Timeout. Default: 3 minutes
52 Properties of Resources (2) Advanced Policies:Possible Owners.Basic resource health check interval / Thorough resource health check intervalDefault: Use standard time period for the resource typeUse specified time period (defaults: 5 sec. / 1 min.)Run resource in separate Resource Monitor. Default: no.Further parameters depending on the type of the resource.
54 New Cluster Architecture CluAdmin.mscValidateClusAPIWMIRHS.exeCPrepSrvClusSvc.exeClusRes.dll Disk ResourceVolumeC:\VolumeF:\UserKernelClusDisk.sysPartMgr.sysNetFTControl pathDisk.sysMajor change is that ClusDisk no longer is in the disk fencing businessMS MPIO FilterStorportMiniportHBAStorage enclosure
56 Cluster Service Components (1) Database ManagerManages the configuration database contained in the registry of each cluster node.Coordinates updates of the database.Makes sure that updates are atomic across the cluster nodes..Node Manager (or: Membership Manager)Maintains cluster membership.The node managers of all cluster managers communicate in order to determine the failure of a node.Event ProcessorIs responsible for communicating events to the applications, and to other components of the cluster service.
57 Cluster Service Components (2) Communication ManagerIs responsible for the communication between the cluster services on the cluster nodes, e.g. related tonegotiating the entrance of a node into the cluster,information about resource states,failover and failback operations.Global Update ManagerComponent for distributing update requests to all cluster nodes.Resource/Failover Manager: is responsible formanaging the depencies between resources,starting and stopping resources,initializing failover and failback.
58 Resource MonitorsResource Monitors handle the communication between the cluster service and resources.A Resource Monitor is a separate process, using resource specific DLLs.A Resource Monitor uses one „poller thread“ per 16 resources for performing the LooksAlive and IsAlive tests.
59 Routines in a Resource DLL The resource API for writing own resource DLLs knows two types of functions:Callback routines, which can be called from the DLL:LogEventSetResourceStatusEntry-point routines, which are called by the resource monitor:Startup (called once for every resource type)Open (executed when creating a new resource)Online (limit: 300 ms or asynch. in worker thread)LooksAlive (limit: 300 ms, recommended: < 50 ms)IsAlive (limit: 400 ms, recomm.: < 100 ms, or asynch.)Offline (limit: 300 ms, or asynch. in worker thread)Terminate (on error in offline or pending-timeout)Close (executed when deleting a resource)ResourceControl, and ResourceTypeControl (for „private properties“)