FT NT: A Tutorial on Microsoft Cluster Server™ (formerly “Wolfpack”)

FT NT: A Tutorial on Microsoft Cluster Server™ (formerly “Wolfpack”)
June 24, 1997 FT NT: A Tutorial on Microsoft Cluster Server™ (formerly “Wolfpack”) Joe Barrera Jim Gray Microsoft Research {joebar, microsoft.com Copyright (c) 1996, 1997 Microsoft Corp.

Outline Why FT and Why Clusters Cluster Abstractions
June 24, 1997 Outline Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A Copyright (c) 1996, 1997 Microsoft Corp.

DEPENDABILITY: The 3 ITIES
June 24, 1997 DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing. (also large MTTF) AVAILABILITY: Does it now. (also small MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time). Holistic vs. Reductionist view Security Integrity Reliability Availability Copyright (c) 1996, 1997 Microsoft Corp.

To Get 10 Year MTTF, Must Attack All These Areas
Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March (trans: Eiichi Watanabe). June 24, 1997 Vendor 4 2 % Tele Comm lines 1 2 % 1 1 . 2 Environment % 2 5 % Application Software 9 . 3 % Operations Vendor (hardware and software) Months Application software Months Communications lines Years Operations Years Environment Years 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas Copyright (c) 1996, 1997 Microsoft Corp.

Case Studies - Tandem Trends
June 24, 1997 MTTF improved Shift from Hardware & Maintenance to from 50% to 10% to Software (62%) & Operations (15%) NOTE: Systematic under-reporting of Environment Operations errors Application Software Copyright (c) 1996, 1997 Microsoft Corp.

June 24, 1997 Summary of FT Studies Current Situation: ~4-year MTTF => Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF class 4 today => class 6 tomorrow. Copyright (c) 1996, 1997 Microsoft Corp.

Fault Tolerance vs Disaster Tolerance
June 24, 1997 Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults RAID disks Uninterruptible Power Supplies Cluster Failover Disaster Tolerance: masks site failures Protects against fire, flood, sabotage,.. Redundant system and service at remote site. There have been a variety of technologies introduced to address your growing need for high-availability servers. The simplest of these is Data Mirroring, which continuously duplicates all disk writes onto a mirrored set of disks, possibly at a remote disaster recovery site. Today you can get Data Mirroring products for Windows NT Server from a few vendors, including Octopus ( and Vinca ( These solutions provide excellent protection for your data, even in the event of a metropolis-wide disaster. However, they’re not high-availability solutions that have the ability to detect all types of hardware or software failure, and they have at best limited abilities to automatically restart applications. (For example, users must manually reconnect to the new server, plus any applications running on the recovery server are canceled as if it had been the server that failed.) Server Mirroring like Novell SFT III (Server Fault Tolerance) is a high-availability capability that both protects your data and provides for automatic detection of failures plus restart of selected applications. Server Mirroring provides excellent reliability, but at a very high cost since it requires an idle “standby” server that does no productive work except when the primary server fails. There are also very few applications which can take advantage of proprietary server mirroring solutions like Novell SFT III. At the high end are true “fault tolerant” systems like the excellent “NonStop” systems from Tandem. These systems are able to detect and almost instantly recover from virtually any single hardware or software failure. Most bank transactions, for example, run on this type of system. This level of reliability comes with a very high price tag, however, and each solution is based on a proprietary, single-vendor set of hardware. And, finally, there’s another high-availability technology which seems to offer the best of all these capabilities: clustering... Copyright (c) 1996, 1997 Microsoft Corp.

The Microsoft “Vision”: Plug & Play Dependability
June 24, 1997 The Microsoft “Vision”: Plug & Play Dependability Transactions for reliability Clusters: for availability Security All built into the OS Integrity Security Integrity / Reliability Availability Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Goals Manageability Availability Scalability
June 24, 1997 Cluster Goals Manageability Manage nodes as a single system Perform server maintenance without affecting users Mask faults, so repair is non-disruptive Availability Restart failed applications & servers un-availability ~ MTTR / MTBF , so quick repair. Detect/warn administrators of failures Scalability Add nodes for incremental processing storage bandwidth Unlike the other limited and/or expensive high-availability technologies, clustering can provide a very cost-effective solution for resource availability, server manageability, and system scalability. The software which connects the nodes in a cluster can detect hardware and software failures, and typically has the ability to notify or even warn administrators of impending failures. In the event of an application or server failure, the cluster software can automatically restart the application on another still-running node in the cluster, usually in just a few seconds with minimal effect on users’ productivity. Most clusters include tools which allow administrators to manage the cluster as a single environment. For example, administrators can manually invoke the “failover” feature to move applications off of a server which needs maintenance, or to balance the workload across the available computing resources. Clustering can take many forms. A cluster may be nothing more than a set of standard desktop personal computers interconnected by an Ethernet. At the high end of the spectrum, the hardware structure may consist of high-performance SMP systems interconnected via a high-performance communications and I/O bus. In both cases, adding processing power and I/O bandwidth is done in small steps by the addition of another commodity system. Additional systems can be added to the cluster as needed to process more complex or an increasing number of requests from the clients. Thus, clusters can provide high availability, more flexible manageability, an incremental approach to scalability, and, because they can potentially use standard hardware and do not require idle “standby” servers, excellent cost effectiveness. Copyright (c) 1996, 1997 Microsoft Corp.

Fault Model Failures are independent So, single fault tolerance is a big win Hardware fails fast (blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot: Heisenbugs Operations tasks: major source of outage Utility operations Software upgrades

Cluster: Servers Combined to Improve Availability & Scalability
June 24, 1997 Cluster: Servers Combined to Improve Availability & Scalability Cluster: A group of independent systems working together as a single system. Clients see scalable & FT services (single system image). Node: A server in a cluster. May be an SMP server. Interconnect: Communications link used for intra-cluster status info such as “heartbeats”. Can be Ethernet. Client PCs Printers Server A Disk array A Disk array B Server B A cluster is a set of loosely coupled, independent computer systems that behave as a single system. Each system, or “node”, in a cluster can be a single-processor server or an SMP (symmetric multi processing) server. Nodes are connected by an additional communications link called an “interconnect” which can be anything from a standard ethernet cable up to specialized, high-speed technology optimized for cluster communications. Each node in a cluster typically has its own disk resources, plus there may be disk resources which are shared by two or more of the cluster nodes. Clients see a cluster as if it were a single high-performance, highly reliable server. System managers see a cluster much as they see a single server. Cluster technology is readily adaptable to low cost, industry-standard computer technology and interconnects. Interconnect Copyright (c) 1996, 1997 Microsoft Corp.

Microsoft Cluster Server™
June 24, 1997 Microsoft Cluster Server™ 2-node availability Summer 97 (20,000 Beta Testers now) Commoditize fault-tolerance (high availability) Commodity hardware (no special hardware) Easy to set up and manage Lots of applications work out of the box. 16-node scalability later (next year?) There have been a variety of technologies introduced to address your growing need for high-availability servers. The simplest of these is Data Mirroring, which continuously duplicates all disk writes onto a mirrored set of disks, possibly at a remote disaster recovery site. Today you can get Data Mirroring products for Windows NT Server from a few vendors, including Octopus ( and Vinca ( These solutions provide excellent protection for your data, even in the event of a metropolis-wide disaster. However, they’re not high-availability solutions that have the ability to detect all types of hardware or software failure, and they have at best limited abilities to automatically restart applications. (For example, users must manually reconnect to the new server, plus any applications running on the recovery server are canceled as if it had been the server that failed.) Server Mirroring like Novell SFT III (Server Fault Tolerance) is a high-availability capability that both protects your data and provides for automatic detection of failures plus restart of selected applications. Server Mirroring provides excellent reliability, but at a very high cost since it requires an idle “standby” server that does no productive work except when the primary server fails. There are also very few applications which can take advantage of proprietary server mirroring solutions like Novell SFT III. At the high end are true “fault tolerant” systems like the excellent “NonStop” systems from Tandem. These systems are able to detect and almost instantly recover from virtually any single hardware or software failure. Most bank transactions, for example, run on this type of system. This level of reliability comes with a very high price tag, however, and each solution is based on a proprietary, single-vendor set of hardware. And, finally, there’s another high-availability technology which seems to offer the best of all these capabilities: clustering... Copyright (c) 1996, 1997 Microsoft Corp.

Failover Example Server 1 Server 2 Server 1 Server 2 Browser Web site
June 24, 1997 Failover Example Browser Server 1 Server 2 Server 1 Server 2 Web site Web site Database Database Let’s say that there’s a web site running on Server A, and a database running on Server B. And let’s say these are mission-critical applications, supporting the on-line ordering system used by your customers, who access this system over standard IP networks using standard browsers on standard desktops. [MOUSE CLICK] What happens if one of these servers crashes? In your business today, when a server crashes, how long does it take for your operators to detect a crash and restart the workload? An hour? A day? More? A Wolfpack cluster will automatically detect the crash in a split-second, and then automatically restart the applications and data on the surviving server in the cluster, typically in around 30 seconds. Web site files Database files Copyright (c) 1996, 1997 Microsoft Corp.

! MS Press Failover Demo Client/Server Software failure Admin shutdown
Resource States Client/Server Software failure Admin shutdown Server failure - Pending - Partial - Failed - Offline !

Windows NT Server Cluster
Demo Configuration June 24, 1997 Server “Alice” SMP Pentium® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Server “Betty” Interconnect standard Ethernet Local Disks Local Disks SCSI Disk Cabinet Shared Disks Windows NT Server Cluster This is the basic two-node cluster supported in the initial release of Wolfpack. Each of the two cluster nodes -- Alice and Betty -- owns a private system volume plus jointly sharing one or more failover disk arrays. They are connected via dual Ethernet adaptors. The servers are running SQL Server 6.5 with a special prototype software to interface it to Wolfpack (a resource DLL). The two servers are also running the IIS web server. To show that all the servers can be operated in a lights-out configuration, and that many pairs of servers can be managed from a single operations console, we have configured an Administrative console running the Wolfpack Cluster Administrator and the Microsoft SQL Server Enterprise Manager. A client accesses the cluster via an Internet Explorer application. This is the real application used by Microsoft employees to order books from the MS Press book store over Microsoft’s Intranet. A copy of the database and the application has been captured for this demo and is running as the MSPress demo on Alice and Betty. Administrator Windows NT Workstation Cluster Admin SQL Enterprise Mgr Client Windows NT Workstation Internet Explorer MS Press OLTP app Copyright (c) 1996, 1997 Microsoft Corp.

Demo Administration June 24, 1997 Server “Alice” Runs SQL Trace Runs Globe Server “Betty” Run SQL Trace Local Disks Local Disks SCSI Disk Cabinet Shared Disks Windows NT Server Cluster To show that all the servers can be operated in a lights-out configuration, and that many pairs of servers can be managed from a single operations console, we configured an Administrative console running the Wolfpack Cluster Administrator and the Microsoft SQL Server Enterprise Manager. The Cluster Administrator allows the operator to configure, monitor, and operate one or more clusters. Each cluster consists of a set of resource groups containing resources. Each resource has a set of properties. The operator can add, edit, and query resources and resource groups. The Cluster Administrator monitors the cluster, orchestrates failover, and reports these events to all cluster nodes and to active cluster administrative processes. Operators can request that resource groups move from one node to another. Similarly, SQL Server Enterprise Manager can configure and operate many SQL Servers from a single operations console. To demonstrate the node activity, a SQL trace window is configured for the Alice and Betty servers. This window shows the local SQL activity for that server. To show the currently active IIS and SQL server for the application, a program that displays a spinning globe AVI file is configured as a resource. This service resource migrates with the resource group as the group fails over to one or another node. Cluster Admin Console Windows GUI Shows cluster resource status Replicates status to all servers Define apps & related resources Define resource dependencies Orchestrates recovery order SQL Enterprise Mgr Windows GUI Shows server status Manages many servers Start, stop manage DBs Client Copyright (c) 1996, 1997 Microsoft Corp.

Generic Stateless Application Rotating Globe
Mplay32 is generic app. Registered with MSCS MSCS restarts it on failure Move/restart ~ 2 seconds Fail-over if 4 failures (= process exits) in 3 minutes settable default

Demo Moving or Failing Over An Application
June 24, 1997 Alice Fails or Operator Requests move AVI Application X AVI Application X Local Disks Local Disks SCSI Disk Cabinet Shared Disks Windows NT Server Cluster Wolfpack was first demonstrated March, In that demo, Alice was running a web site using the built-in Internet Information Server. A client was accessing the web site, displaying a web page that had two objects: an AVI clip of a spinning globe, and an ActiveX control which, when clicked, displayed the name and IP address of the web server. The demo then did the following: 1. Used the ActiveX control to show that Alice was the web server. 2. Brought up the Wolfpack Administrator’s console and did an “administrative failover” of the web server application from Alice to Betty. This type of failover happens quickly -- in this case it took about 5 seconds -- since there’s no need for the software to verify whether there’s been a failure. 3. After a brief pause, the globe continued spinning, and now the ActiveX control displayed Betty’s name, but with the same IP address previously shown for Alice. That showed that the IP address had been transferred to Betty as part of the failover process. 4. Then Betty was forced to fail by pushing its “Reset” button. Wolfpack detected the failure and did failover processing, restarting the web page on Alice, recovering the web site disk files, and re-establishing the client connection. In this case, the globe stopped spinning for about 20 seconds. At that time, the ActiveX control then showed that Alice was the web site … still with the same IP address that had now been transferred back from Betty. This demonstrates how Wolfpack NT Clusters improves the availability of any server application, with no application modification. Wolfpack moves the service to a new node and restarts the service if a node fails. Copyright (c) 1996, 1997 Microsoft Corp.

Generic Stateful Application NotePad
Notepad saves state on shared disk Failure before save => lost changes Failover or move (disk & state move)

Demo Step 1: Alice Delivering Service
June 24, 1997 Demo Step 1: Alice Delivering Service SQL Activity No SQL Activity SQL SQL Local Disks Local Disks ODBC ODBC SCSI Disk Cabinet Shared Disks IIS IIS Windows NT Server Cluster IP Today you are going to see SQL Server do a similar trick: it will failover and failback. Except that SQL Server is cluster-aware and is cooperating with Wolfpack to bring databases online and offline. The client is using Internet Explorer to query the MSPress bookstore catalog and to order books. The client sends HTTP requests (HTML web pages) to the server and gets web pages back as answers. The IIS web server translates these HTTP requests into ODBC requests to SQL using the Internet Database Connectivity mechanism (an ISAPI DLL in IIS). These ODBC calls go to the local SQL Server. The Alice Server is currently serving the bookstore application. It has ownership of the database disks and the IP address for the application. Client requests are routed to the IIS, which translates the request to ODBC and passes it to the Alice SQL server. The SQL Server processes the request, and returns an ODBC response to the IIS’s ISAPI HTTPtoODBC DLL. The DLL translates the ODBC response to a web page and sends it via HTTP to the client. The SQL Trace screen on Alice shows this activity. The spinning globe on Alice shows that Alice is primary. The Cluster Administrator shows that Alice is serving the resource group. It shows the various resources in the group: the IP address, the IIS, the SQL server, the SQL Server Database, and the database disks. The SQL Enterprise Manager for Alice shows the MSPress1 database is owned by Alice. This is the initial and final state of the demonstration. Many other configurations are possible. HTTP Copyright (c) 1996, 1997 Microsoft Corp.

2: Request Move to Betty June 24, 1997 IIS SQL ODBC No SQL Activity IP SQL Activity IIS SQL ODBC IP Local Disks Local Disks SCSI Disk Cabinet Shared Disks Windows NT Server Cluster The operator, using the Cluster Administrator requests Alice to transfer the resource group to Betty. This is a controlled switch. Something that might be done to shed load from Alice or to allow a hardware or software upgrade on Alice. The Cluster Administrator window shows the resource group failing over. First the disks and the IP address fail over. Within five seconds, the spinning globe appears on Betty’s screen indicating that she is now serving the resource group. The Wolfpack cluster software has told Alice to take the database offline and Betty to bring the database online. The SQL Sever on Betty connects to the MSPress database (it disappears from the database list on Alice and appears in the database list on Betty). Betty’s SQL Server runs recovery on the database, and then begins offering service. This takeover time is dominated by the time needed to redo any SQL work done since the last checkpoint. The takeover time can range from a few seconds to a minute or two depending on how recent the SQL checkpoint was and how much update activity has transpired. SQL Server 7.0 is implementing new algorithms that will have much improved checkpoint and recovery times. In this demo, there is relatively little update activity, so the takeover is almost instant -- 5 to 10 seconds total takeover time. If the Internet Explorer submits a query during the takeover window, the request gets an error since IIS or SQL Server may not yet prepared to offer service. By resending the request a few seconds later, the client can vector it to the correct server and get the correct service. The demonstration is contrived to show this. The failover is requested and then the client immediately submits the request. Sometimes the failover is too fast; but usually the client gets a “404” error on the first submission and gets a correct answer when the client resends the request. HTTP Copyright (c) 1996, 1997 Microsoft Corp.

3: Betty Delivering Service
June 24, 1997 No SQL Activity IP . SQL Activity SQL SQL Local Disks Local Disks ODBC SCSI Disk Cabinet ODBC Shared Disks IIS IIS Windows NT Server Cluster The controlled takeover by Betty is complete. The Cluster Administrator shows the resource group (containing the IP address, the IIS service, the SQL Service, the MSPress1 database, and the disks) all owned by Betty. The Globe on Betty is spinning. Client requests are vectored to Betty and the SQL Trace window on Betty shows the SQL activity. SQL Enterprise Manager shows that the MSPress1 database has moved to Betty (from Alice). Alice is still “up” and still able to offer service to her local databases. Copyright (c) 1996, 1997 Microsoft Corp.

4: Power Fail Betty, Alice Takeover
June 24, 1997 No SQL Activity IP SQL Activity IIS SQL ODBC IP IIS SQL ODBC Local Disks Local Disks SCSI Disk Cabinet Shared Disks Windows NT Server Cluster Now un-expected failover is demonstrated in the most dramatic way. Betty is power-failed (we neglected to buy a UPS for her (a big mistake!)). So, now Betty is busy doing automatic reboot and will be unable to offer any service for several minutes. The Cluster Administrator on Alice notices Betty’s failure within a few seconds. The cluster service acquires the disks -- acquiring the “quorum disk” breaks any ties and allows Alice to know that Betty is really failed. Alice now takes ownership of the resource group(s) that were primaried by Betty when she failed. This causes the disks, the IP addresses, and the services in the resource groups to be brought online at Alice. The cluster software tells the Alice SQL server to bring the MSPress databases online. This process takes a few seconds. Within 15 seconds, the Globe is spinning on Alice and Alice’s SQL server is offering service to the MSPress database. The SQL Server trace window on Alice is busy. Meanwhile Betty is going through automatic reboot, restarting all the services (Cluster, SQL, IIS, and others) The Cluster Administrator shows this new state, and the SQL Server Enterprise Manager shows that Alice has the two MSPress databases and that Betty is down. As before, if the client submits a request during the takeover window, the client receives an error. But if the client just resubmits, the second request will be serviced by Alice. Copyright (c) 1996, 1997 Microsoft Corp.

5: Alice Delivering Service
June 24, 1997 SQL Activity No SQL Activity SQL Local Disks Local Disks ODBC SCSI Disk Cabinet Shared Disks IIS Windows NT Server Cluster IP The Wolfpack cluster code told Alice to take ownership of the MSPress databases. Alice recovers them and activates them in the system tables. Now Alice is ready to offer service to the MSPress application. This whole process takes 10 or 20 seconds. Meanwhile, Betty is recovering. HTTP Copyright (c) 1996, 1997 Microsoft Corp.

6: Reboot Betty, now can takeover
June 24, 1997 6: Reboot Betty, now can takeover SQL Activity No SQL Activity IIS SQL ODBC SQL Local Disks Local Disks ODBC SCSI Disk Cabinet Shared Disks IIS Windows NT Server Cluster IP Betty is going trough automatic reboot after the power failure. Betty is restarting all her services, including the cluster service. She is restarting IIS and SQL, but the failover SQL databases are off-line. Betty SQL is recovering her local databases fixing up the local catalogs to reflect the fact that the failover databases are off-line. This activity appears in the SQL trace window. Eventually Betty is recovered and rejoins the cluster. This recovery takes about five minutes. Meanwhile, Alice is offering service to all the applications and the globe keeps spinning. Once Betty is recovered, we are back where we started at step 1. Now we can give the demo again! HTTP Copyright (c) 1996, 1997 Microsoft Corp.

Cluster and NT Abstractions
June 24, 1997 Cluster and NT Abstractions Resource Cluster Group Cluster Abstractions NT Abstractions The NT Clustering abstractions are extensions of underlying NT Server abstractions, so it makes sense to describe three of the core abstractions underlying NT. Service Domain Node Copyright (c) 1996, 1997 Microsoft Corp.

Basic NT Abstractions Service Domain Node
June 24, 1997 Basic NT Abstractions Service Domain Node Service: program or device managed by a node e.g., file service, print service, database server can depend on other services (startup ordering) can be started, stopped, paused, failed Node: a single (tightly-coupled) NT system hosts services; belongs to a domain services on node always remain co-located unit of service co-location; involved in naming services Domain: a collection of nodes cooperation for authentication, administration, naming A node (or sometimes, a system) is a single computer, either a uniprocessor or a shared-memory multiprocessor, running a single instance of NT. A node hosts serveral services and in turn belongs to a domain, these being our two other abstractions. A domain is a collection of cooperating NT nodes. Outside of the cluster context, this cooperation is loosely coupled -- it covers authentication and some naming and administration issues, but does not by itself, for example, handle fault tolerance issues. A service is a program or device started by NT. For example, both TCP/IP and […] are services. Can be started or stopped. Dependencies. Includes both server applications (such as database servers), system services (file and networking services), and devices (disks and network adapters).Unifying servers and devices into a single framework is an important architecture feature of NT, and we’ll see the same unification in the NT Cluster abstraction of a resource. Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Abstractions Resource Cluster
June 24, 1997 Cluster Abstractions Resource Cluster Resource Group Resource: program or device managed by a cluster e.g., file service, print service, database server can depend on other resources (startup ordering) can be online, offline, paused, failed Resource Group: a collection of related resources hosts resources; belongs to a cluster unit of co-location; involved in naming resources Cluster: a collection of nodes, resources, and groups cooperation for authentication, administration, naming Note that we will talk Each of the cluster abstractions has a parallel NT abstraction. A resource is like an NT Service in that it is the unit of service. A group is like A resource is the cluster unit of service; it is like an NT service, except that it can be moved among different nodes as necessary to keep the resource available. A (resource) group is the cluster unit of co-location; all resources in the same resource group are guaranteed to reside on the same node. Resource groups allow/limit migration… so that resources act more like services……… A cluster is a collection of nodes providing … Like an NT service, a resource provides some basic service to clients. Unlike NT services, however, resources may be moved from one node to another. Managing the location of resources is how NT clustering provides availability. Groups are the unit Some of the parallels between a resource group and a node will not be evident until we talk about application support and virtual servers later on. Copyright (c) 1996, 1997 Microsoft Corp.

Resources Resource Cluster Group Resources have...
June 24, 1997 Resources Resource Cluster Group Resources have... Type: what it does (file, DB, print, web…) An operational state (online/offline/failed) Current and possible nodes Containing Resource Group Dependencies on other resources Restart parameters (in case of resource failure) Copyright (c) 1996, 1997 Microsoft Corp.

Resource Types Built-in types Added by others Generic Application
June 24, 1997 Resource Types Built-in types Generic Application Generic Service Internet Information Server (IIS) Virtual Root Network Name TCP/IP Address Physical Disk FT Disk (Software RAID) Print Spooler File Share Added by others Microsoft SQL Server, Message Queues, Exchange Mail Server, Oracle, SAP R/3 Your application? (use developer kit wizard). Copyright (c) 1996, 1997 Microsoft Corp.

Physical Disk

TCP/IP Address

Network Name

File Share

IIS (WWW/FTP) Server

Print Spooler

Resource States Resources states: Resource failure may cause:
June 24, 1997 Resource States Resources states: Offline: exists, not offering service Online: offering service Failed: not able to offer service Resource failure may cause: local restart other resources to go offline resource group to move (all subject to group and resource parameters) Resource failure detected by: Polling failure Node failure I’m Online! Online Go Off-line! Online Pending Failed Offline Pending I’m here! Go Online! I’m Off-line! Offline Copyright (c) 1996, 1997 Microsoft Corp.

Resource Dependencies
June 24, 1997 Resource Dependencies Similar to NT Service Dependencies Orderly startup & shutdown A resource is brought online after any resources it depends on are online. A Resource is taken offline before any resources it depends on Interdependent resources Form dependency trees move among nodes together failover together as per resource group Network Name IP Address Resource DLL IIS Virtual Root File Share Copyright (c) 1996, 1997 Microsoft Corp.

Dependencies Tab

NT Registry Stores all configuration information
Software Hardware Hierarchical (name, value) map Has a open, documented interface Is secure Is visible across the net (RPC interface) Typical Entry: \Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = “GUEST” DefaultDomain = “REDMOND”

Cluster Registry Separate from local NT Registry
Replicated at each node Algorithms explained later Maintains configuration information: Cluster members Cluster resources Resource and group parameters (e.g. restart) Stable storage Refreshed from “master” copy when node joins cluster

Other Resource Properties
June 24, 1997 Other Resource Properties Name Restart policy (restart N times, failover…) Startup parameters Private configuration info (resource type specific) Per-node as well, if necessary Poll Intervals (LooksAlive, IsAlive, Timeout) These properties are all kept in Cluster Registry Copyright (c) 1996, 1997 Microsoft Corp.

General Resource Tab

Advanced Resource Tab

Resource Groups Resource Cluster Group Payroll Group
June 24, 1997 Resource Groups Resource Cluster Group Every resource belongs to a resource group. Resource groups move (failover) as a unit Dependencies NEVER cross groups. (Dependency trees contained within groups.) Group may contain forest of dependency trees Payroll Group Web Server SQL Server IP Address Drive E: Drive F: Copyright (c) 1996, 1997 Microsoft Corp.

Moving a Resource Group

Group Properties CurrentState: Online, Partially Online, Offline
June 24, 1997 Group Properties CurrentState: Online, Partially Online, Offline Members: resources that belong to group members determine which nodes can host group. Preferred Owners: ordered list of host nodes FailoverThreshold: How many faults cause failover FailoverPeriod: Time window for failover threshold FailbackWindowsStart: When can failback happen? FailbackWindowEnd: When can failback happen? Everything (except CurrentState) is stored in registry Copyright (c) 1996, 1997 Microsoft Corp.

Failover and Failback Failover parameters Failback to preferred node
June 24, 1997 Failover and Failback Failover parameters timeout on LooksAlive, IsAlive # local restarts in failure window after this, offline. Failback to preferred node (during failback window) Do resource failures affect group? Node \\Alice Node \\Betty Failover Failback Cluster Service Cluster Service IPaddr name Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Concepts Clusters
June 24, 1997 Cluster Concepts Clusters Resource Cluster Group Resource Group Resource Group Resource Group Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Properties Defined Members: nodes that can join the cluster
June 24, 1997 Cluster Properties Defined Members: nodes that can join the cluster Active Members: nodes currently joined to cluster Resource Groups: groups in a cluster Quorum Resource: Stores copy of cluster registry. Used to form quorum. Network: Which network used for communication All properties kept in Cluster Registry Copyright (c) 1996, 1997 Microsoft Corp.

Cluster API Functions (operations on nodes & groups)
June 24, 1997 Cluster API Functions (operations on nodes & groups) Find and communicate with Cluster Query/Set Cluster properties Enumerate Cluster objects Nodes Groups Resources and Resource Types Cluster Event Notifications Node state and property changes Group state and property changes Resource state and property changes Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Management

Demo Server startup and shutdown Installing applications
June 24, 1997 Demo Server startup and shutdown Installing applications Changing status Failing over Transferring ownership of groups or resources Deleting Groups and Resources Copyright (c) 1996, 1997 Microsoft Corp.

Architecture Top tier provides cluster abstractions
June 24, 1997 Architecture Top tier provides cluster abstractions Middle tier provides distributed operations Bottom tier is NT and drivers Failover Manager Resource Monitor Cluster Registry Global Update Quorum Membership The architecture of NT Clustering can be divided into three levels or tiers. The bottom tier is NT Server itself, along with cluster specific network and disk drivers. The middle tier provides basic distributed system algorithms: membership, or what nodes can this node communicate with; quorum, or what operations can this node participate; and global atomic update, where this node coordinates with others Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Membership and Regroup
June 24, 1997 Membership and Regroup Membership: Used for orderly addition and removal from { active nodes } Regroup: Used for failure detection (via heartbeat messages) Forceful eviction from { active nodes } Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Membership Defined cluster = all nodes Active cluster:
June 24, 1997 Membership Defined cluster = all nodes Active cluster: Subset of defined cluster Includes Quorum Resource Stable (no regroup in progress) Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Quorum Resource Usually (but not necessarily) a SCSI disk
June 24, 1997 Quorum Resource Usually (but not necessarily) a SCSI disk Requirements: Arbitrates for a resource by supporting the challenge/defense protocol Capable of storing cluster registry and logs Configuration Change Logs Tracks changes to configuration database when any defined member missing (not active) Prevents configuration partitions in time Copyright (c) 1996, 1997 Microsoft Corp.

Challenge/Defense Protocol
SCSI-2 has reserve/release verbs Semaphore on disk controller Owner gets lease on semaphore Renews lease once every 3 seconds To preempt ownership: Challenger clears semaphore (SCSI bus reset) Waits 10 seconds 3 seconds for renewal + 2 seconds bus settle time x2 to give owner two chances to renew If still clear, then former owner loses lease Challenger issues reserve to acquire semaphore

Challenge/Defense Protocol: Successful Defense
1 5 4 3 2 6 7 11 10 9 8 12 13 16 15 14 Defender Node Challenger Node Reserve Bus Reset Reservation detected

Challenge/Defense Protocol: Successful Challenge
Defender Node Reserve Bus Reset 1 5 4 3 2 6 7 11 10 9 8 12 13 16 15 14 No reservation detected Challenger Node

Regroup Invariant: All members agree on { members }
June 24, 1997 Regroup Invariant: All members agree on { members } Regroup re-computes { members } Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages suspicion that sender is dead failure detection in bounded time Uses a 5-round protocol to agree. Checks communication among nodes. Suspected missing node may survive. Upper levels (global update, etc.) informed of regroup event. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Membership State Machine
Initialize Search or Reserve Fails Sleeping Start Cluster Member Search Quorum Disk Search Search Fails Regroup Minority or no Quorum Acquire (reserve) Quorum Disk Found Online Member Non-Minority and Quorum Lost Heartbeat Forming Joining Join Succeeds Synchronize Succeeds Online

June 24, 1997 Joining a Cluster When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which looks in local (stale) registry for members Asks each member in turn to sponsor new node’s membership. (Stop when sponsor found.) Sponsor (any active member) Sponsor authenticates applicant Broadcasts applicant to cluster members Sponsor sends updated registry to applicant Applicant becomes a cluster member Copyright (c) 1996, 1997 Microsoft Corp.

Forming a Cluster (when Joining fails)
June 24, 1997 Forming a Cluster (when Joining fails) Use registry to find quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource e.g. if we were down when it was in use Form new one-node cluster Bring other cluster resources online Let others join your cluster Copyright (c) 1996, 1997 Microsoft Corp.

Leaving A Cluster (Gracefully)
Pause: Move all groups off this member. Change to paused state (remains a cluster member) Offline: Sends ClusterExit message all cluster members Prevents regroup Prevents stalls during departure transitions Close Cluster connections (now not an active cluster member) Cluster service stops on node Evict: remove node from defined member list

Leaving a Cluster (Node Failure)
June 24, 1997 Leaving a Cluster (Node Failure) Node (or communication) failure triggers Regroup If after regroup: Minority group OR no quorum device: group does NOT survive Non-minority group AND quorum device: group DOES survive Non-Minority rule: Number of new members >= 1/2 old active cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster Quorum guarantees correctness Prevents “split-brain” e.g. with newly forming cluster containing a single node Copyright (c) 1996, 1997 Microsoft Corp.

Global Update Propagates updates to all nodes in cluster
June 24, 1997 Global Update Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership all are up all can communicate R. Carr, Tandem Systems Review. V , sketches regroup and global update protocol. Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Global Update Algorithm
June 24, 1997 Global Update Algorithm Cluster has locker node that regulates updates. Oldest active node in cluster Send Update to locker node Update other (active) nodes in seniority order (e.g. locker first) this includes the updating node Failure of all updated nodes: Update never happened Updated nodes will roll back on recovery Survival of any updated nodes: New locker is oldest and so has update if any do. New locker restarts update L ack L X=100! S S Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Registry Separate from local NT Registry
June 24, 1997 Cluster Registry Separate from local NT Registry Maintains cluster configuration members, resources, restart parameters, etc. Stable storage Replicated at each member Global Update protocol NT Registry keeps local copy Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Registry Bootstrapping
Membership uses Cluster Registry for list of nodes …Circular dependency Solution: Membership uses stale local cluster registry Refresh after joining or forming cluster Master is either quorum device, or active members Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers

Resource Monitor Polls resources: Detects failures
June 24, 1997 Resource Monitor Polls resources: IsAlive and LooksAlive Detects failures polling failure failure event from resource Higher levels tell it Online, Offline Restart Failover Manager Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Failover Manager Assigns groups to nodes based on Failover parameters
June 24, 1997 Failover Manager Failover Manager Assigns groups to nodes based on Failover parameters Possible nodes for each resource in group Preferred nodes for resource group Resource Monitor Cluster Registry Global Update Membership Regroup Windows NT Server Cluster Disk Driver Cluster Net Drivers Copyright (c) 1996, 1997 Microsoft Corp.

Failover (Resource Goes Offline)
June 24, 1997 Failover (Resource Goes Offline) Notify Failover Manager. Resource Manager Detects resource error. Failover Manager checks: Failover Window and Failover Threshold Attempt to restart resource. Wait for Failback Window Are Failover conditions within Constraints? No Has the Resource Retry limit been exceeded? No Yes Leave Group in partially Online state. Yes Can another owner be found? (Arbitration) No Switch resource (and Dependants) Offline. Notify Failover Manager on the new system to bring resource Online. Yes Copyright (c) 1996, 1997 Microsoft Corp.

Pushing a Group (Resource Failure)
June 24, 1997 Pushing a Group (Resource Failure) Resource Monitor notifies Resource Manager of resource failure. Resource Manager enumerates all objects in the Dependency Tree of the failed resource. Resource Manager notifies Failover Manager that the Dependency Tree is Offline and needs to fail over. Resource Manager takes each depending resource Offline. Failover Manager performs Arbitration to locate a new owner for the group. Failover Manager on the new owner node brings the resources Online. Any resource has “Affect the Group” True Leave Group in partially Online state. No Yes Copyright (c) 1996, 1997 Microsoft Corp.

Pulling a Group (Node Failure)
June 24, 1997 Pulling a Group (Node Failure) Cluster Service notifies Failover Manager of node failure. Failover Manager determines which groups were owned by the failed node. Failover Manager performs Arbitration to locate a new owner for the groups. Failover Manager on the new owner(s) bring the resources Online in dependency order. Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over. Copyright (c) 1996, 1997 Microsoft Corp.

Failback to Preferred Owner Node
June 24, 1997 Group may have a Preferred Owner Preferred Owner comes back online Will only occur during the Failback Window (time slot, e.g. at night) Resource Manager takes each resource on the current owner Offline. Preferred owner comes back Online. Failover Manager performs Arbitration to locate the Preferred Owner of the group. Is the time within the Failback Window? Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner. Failover Manager on the Preferred Owner brings the resources Online. Copyright (c) 1996, 1997 Microsoft Corp.

Process Structure Cluster Service Resource Monitor Resources A Node
June 24, 1997 Process Structure Cluster Service Failover Manager Cluster Registry Global Update Quorum Membership Resource Monitor Resource DLLs Resources Services Applications A Node DLL Resource Private calls Cluster Service Resource Monitor Copyright (c) 1996, 1997 Microsoft Corp.

Resource Control Commands A Node And resource events Cluster Service
June 24, 1997 Resource Control Commands CreateResource() OnlineResource() OfflineResource() TerminateResource() CloseResource() ShutdownProcess() And resource events A Node Cluster Service Resource Monitor Private calls Resource Monitor DLL Private calls Resource Copyright (c) 1996, 1997 Microsoft Corp.

Resource DLLs Calls to Resource DLL Resource Open: get handle
June 24, 1997 Online Pending Failed Offline Go Online! I’m Off-line! here! Calls to Resource DLL Open: get handle Online: start offering service Offline: stop offering service as a standby or pair-is offline LooksAlive: Quick check IsAlive: Thorough check Terminate: Forceful Offline Close: release handle Resource Monitor DLL Private calls Resource Std calls Copyright (c) 1996, 1997 Microsoft Corp.

Cluster Communications
June 24, 1997 Cluster Communications Most communication via DCOM /RPC UDP used for membership heartbeat messages Standard (e.g. Ethernet) interconnects Management apps DCOM DCOM DCOM / RPC: admin UDP: Heartbeat Cluster Service Cluster Service Resource Monitors DCOM / RPC DCOM / RPC Resource Monitors Resource Monitors Copyright (c) 1996, 1997 Microsoft Corp.

Application Support Virtual Servers Generic Resource DLLs
June 24, 1997 Application Support Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API Copyright (c) 1996, 1997 Microsoft Corp.

Virtual Servers Problem: Virtual Server
June 24, 1997 Virtual Servers Problem: Client and Server Applications do not want node name to change when server app moves to another node. A Virtual Server simulates an NT Node Resource Group (name, disks, databases,…) NetName and IP address (node: \\a keeps name and IP address as is moves) Virtual Registry (registry “moves” (is replicated)) Virtual Service Control Virtual RPC service Challenges: Limit app to virtual server’s devices and services. Client reconnect on failover (easy if connectionless -- eg web-clients) Virtual Server \\a: Virtual Server \\a: Copyright (c) 1996, 1997 Microsoft Corp.

Virtual Servers (before failover)
June 24, 1997 Virtual Servers (before failover) Nodes \\Y and \\Z support virtual servers \\A and \\B Things that need to fail over transparently Client connection Server dependencies Service names Binding to local resources Binding to local servers \\Y \\Z SAP SAP SQL SQL S:\ T:\ \\A \\B “SAP on A” “SAP on B” Copyright (c) 1996, 1997 Microsoft Corp.

Virtual Servers (just after failover)
June 24, 1997 Virtual Servers (just after failover) \\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z A resources bind to each other and to local resources (e.g., local file system) Registry Physical resource Security domain Time Transactions used to make DB state consistent. To “work”, local resources on \\Y and \\Z have to be similar E.g. time must remain monotonic after failover \\Y \\Z SAP SQL S:\ T:\ “SAP on A” “SAP on B” \\A \\B Copyright (c) 1996, 1997 Microsoft Corp.

Address Failover and Client Reconnection
June 24, 1997 Name and Address rebind to new node Details later Clients reconnect Failure not transparent Must log on again Client context lost (encourages connectionless) Applications could maintain context \\Y \\Z SAP SAP SQL SQL S:\ T:\ \\A \\B “SAP on A” “SAP on B” Copyright (c) 1996, 1997 Microsoft Corp.

Mapping Local References to Group-Relative References
June 24, 1997 Send client requests to correct server \\A\SAP refers to \\.\SQL \\B\SAP refers to \\.\SQL Must remap references: \\A\SAP to \\.\SQL$A \\B\SAP to \\.\SQL$B Also handles namespace collision Done via modifying server apps, or DLLs to transparently rename \\Y \\Z SAP SAP SQL SQL S:\ T:\ \\A \\B “SAP on A” “SAP on B” Copyright (c) 1996, 1997 Microsoft Corp.

Naming and Binding and Failover
June 24, 1997 Naming and Binding and Failover Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Applications register names to advertise services Example: \\Alice\SQL (i.e. <node><service>) Example: :80 (= Binding Clients bind to an address (e.g. name->IP address) Thus the node name and IP address must failover along with the services (preserve client bindings) Copyright (c) 1996, 1997 Microsoft Corp.

Client to Cluster Communications IP address mobility based on MAC rebinding
June 24, 1997 IP rebinds to failover MAC addr Transparent to client or server Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr. Cluster Clients Must use IP (TCP, UDP, NBT,... ) Must Reconnect or Retry after failure Cluster Servers All cluster nodes must be on same LAN segment Client Alice <-> Virtual Alice <-> Betty <-> Virtual Betty <-> WAN Alice <-> Virtual Alice <-> Betty <-> Virtual Betty <-> Router: >AliceMAC >AliceMAC >BettyMAC >BettyMAC Local Network Copyright (c) 1996, 1997 Microsoft Corp.

Time Time must increase monotonically
June 24, 1997 Time Time must increase monotonically Otherwise applications get confused e.g. make/nmake/build Time is maintained within failover resolution Not hard, since failover on order of seconds Time is a resource, so one node owns time resource Other nodes periodically correct drift from owner’s time Copyright (c) 1996, 1997 Microsoft Corp.

Application Local NT Registry Checkpointing
June 24, 1997 Application Local NT Registry Checkpointing Resources can request that local NT registry sub-trees be replicated Changes written out to quorum device Uses registry change notification interface Changes read and applied on fail-over \\A on \\X \\A on \\B registry registry Each update registry After Failover Quorum Device Copyright (c) 1996, 1997 Microsoft Corp.

Registry Replication

Generic Resource DLLs Generic Application DLL Generic Service DLL
June 24, 1997 Generic Resource DLLs Generic Application DLL Simplest: just starts, stops application, and makes sure process is alive Generic Service DLL Translates DLL calls into equivalent NT Server calls Online => Service Start Offline => Service Stop Looks/IsAlive => Service Status Resource Monitor DLL Private calls Std Copyright (c) 1996, 1997 Microsoft Corp.

Generic Application

Generic Service

Resource DLL VC++ Wizard
June 24, 1997 Resource DLL VC++ Wizard Asks for resource type name Asks for optional service to control Asks for other parameters (and associated types) Generates DLL source code Source can be modified as necessary E.g. additional checks for Looks/IsAlive Copyright (c) 1996, 1997 Microsoft Corp.

Creating a New Workspace

Specifying Resource Type Name

Specifying Resource Parameters

Automatic Code Generation

Customizing The Code

Cluster API Allows resources to: Specs & API became public Sept 1996
June 24, 1997 Cluster API Allows resources to: Examine dependencies Manage per-resource data Change parameters (e.g. failover) Listen for cluster events etc. Specs & API became public Sept 1996 On all MSDN Level 3 On web site: Copyright (c) 1996, 1997 Microsoft Corp.

Cluster API Documentation
June 24, 1997 Copyright (c) 1996, 1997 Microsoft Corp.

Research Topics? Even easier to manage Transparent failover
June 24, 1997 Research Topics? Even easier to manage Transparent failover Instant failover Geographic distribution (disaster tolerance) Server pools (load-balanced pool of processes) Process pair (active/backup process) 10,000 nodes? Better algorithms Shared memory or shared disk among nodes a truly bad idea? Copyright (c) 1996, 1997 Microsoft Corp.

References Microsoft NT site: http://www.microsoft.com/ntserver/
BARC site (e.g. these slides): Inside Windows NT, H. Custer, Microsoft Pr, ISBN: Tandem Global Update Protocol, R. Carr, Tandem Systems Review. V , sketches regroup and global update protocol. VAXclusters: a Closely Coupled Distributed System, Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V A (the) shared disk cluster. In Search of Clusters : The Coming Battle in Lowly Parallel Computing, Gregory F. Pfister, Prentice Hall, 1995, ISBN: Argues for shared nothing Transaction Processing Concepts and Techniques, Gray, J., Reuter A., Morgan Kaufmann, ISBN , survey of outages, transaction techniques.

FT NT: A Tutorial on Microsoft Cluster Server™ (formerly “Wolfpack”)

Similar presentations

Presentation on theme: "FT NT: A Tutorial on Microsoft Cluster Server™ (formerly “Wolfpack”)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FT NT: A Tutorial on Microsoft Cluster Server™ (formerly “Wolfpack”)

Similar presentations

Presentation on theme: "FT NT: A Tutorial on Microsoft Cluster Server™ (formerly “Wolfpack”)"— Presentation transcript:

Similar presentations

About project

Feedback