Presentation is loading. Please wait.

Presentation is loading. Please wait.

©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server (formerly Wolfpack) Joe Barrera Jim Gray Microsoft Research {joebar, gray}

Similar presentations


Presentation on theme: "©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server (formerly Wolfpack) Joe Barrera Jim Gray Microsoft Research {joebar, gray}"— Presentation transcript:

1

2 ©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server (formerly Wolfpack) Joe Barrera Jim Gray Microsoft Research {joebar, gray} @ microsoft.com http://research.microsoft.com/barc

3 ©1996, 1997 Microsoft Corp. 2 Outline Why FT and Why Clusters Why FT and Why Clusters Cluster Abstractions Cluster Abstractions Cluster Architecture Cluster Architecture Cluster Implementation Cluster Implementation Application Support Application Support Q&A Q&A

4 ©1996, 1997 Microsoft Corp. 3 DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing. (also large MTTF) RELIABILITY / INTEGRITY: Does the right thing. (also large MTTF) AVAILABILITY: Does it now. (also small MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ). AVAILABILITY: Does it now. (also small MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ). Holistic vs. Reductionist view Holistic vs. Reductionist view Security Integrity Reliability Availability

5 ©1996, 1997 Microsoft Corp. 4 Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor (hardware and software) 5 Months Application software 9 Months Communications lines1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 42% 12% 25% 9.3% 11.2 % Vendor Environment Operations Application Software Tele Comm lines

6 ©1996, 1997 Microsoft Corp. 5 Case Studies - Tandem Trends MTTF improved Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software

7 ©1996, 1997 Microsoft Corp. 6 Summary of FT Studies Current Situation: ~4-year MTTF => Fault Tolerance Works. Current Situation: ~4-year MTTF => Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Software masks most hardware faults. Many hidden software outages in operations: Many hidden software outages in operations: New Software. New Software. Utilities. Utilities. Must make all software ONLINE. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.

8 ©1996, 1997 Microsoft Corp. 7 Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults Fault-Tolerance: mask local faults RAID disks RAID disks Uninterruptible Power Supplies Uninterruptible Power Supplies Cluster Failover Cluster Failover Disaster Tolerance: masks site failures Disaster Tolerance: masks site failures Protects against fire, flood, sabotage,.. Protects against fire, flood, sabotage,.. Redundant system and service at remote site. Redundant system and service at remote site.

9 ©1996, 1997 Microsoft Corp. 8 The Microsoft Vision: Plug & Play Dependability Integrity / Security Integrity Reliability Availability Transactions for reliability Transactions for reliability Clusters: for availability Clusters: for availability Security Security All built into the OS All built into the OS

10 ©1996, 1997 Microsoft Corp. 9 Cluster Goals Manageability Manageability Manage nodes as a single system Manage nodes as a single system Perform server maintenance without affecting users Perform server maintenance without affecting users Mask faults, so repair is non-disruptive Mask faults, so repair is non-disruptive Availability Availability Restart failed applications & servers Restart failed applications & servers un-availability ~ MTTR / MTBF, so quick repair.un-availability ~ MTTR / MTBF, so quick repair. Detect/warn administrators of failures Detect/warn administrators of failures Scalability Scalability Add nodes for incremental Add nodes for incremental processingprocessing storagestorage bandwidthbandwidth

11 ©1996, 1997 Microsoft Corp. 10 Fault Model Failures are independent So, single fault tolerance is a big win Hardware fails fast (blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot: Heisenbugs Operations tasks: major source of outage Utility operations Software upgrades

12 ©1996, 1997 Microsoft Corp. 11 Cluster: Servers Combined to Improve Availability & Scalability Cluster: A group of independent systems working together as a single system. Clients see scalable & FT services (single system image). Cluster: A group of independent systems working together as a single system. Clients see scalable & FT services (single system image). Node : A server in a cluster. May be an SMP server. Node : A server in a cluster. May be an SMP server. Interconnect : Communications link used for intra- cluster status info such as heartbeats. Can be Ethernet. Interconnect : Communications link used for intra- cluster status info such as heartbeats. Can be Ethernet. Client PCs Printers Server A Disk array A Disk array B Server B Interconnect

13 ©1996, 1997 Microsoft Corp. 12 Microsoft Cluster Server Microsoft Cluster Server 2-node availability Summer 97 (20,000 Beta Testers now) 2-node availability Summer 97 (20,000 Beta Testers now) Commoditize fault-tolerance (high availability) Commoditize fault-tolerance (high availability) Commodity hardware (no special hardware) Commodity hardware (no special hardware) Easy to set up and manage Easy to set up and manage Lots of applications work out of the box. Lots of applications work out of the box. 16-node scalability later (next year?) 16-node scalability later (next year?)

14 ©1996, 1997 Microsoft Corp. 13 Web site Database Web site files Database files Server 1 Server 2 Browser Failover Example Web site Database Server 1 Server 2

15 ©1996, 1997 Microsoft Corp. 14 Client/Server Software failure Admin shutdown Server failure MS Press Failover Demo ! Resource States - Pending - Partial - Failed - Offline

16 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Configuration Server Alice SMP Pentium ® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Server Betty SMP Pentium ® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Interconnect standard Ethernet Client Windows NT Workstation Internet Explorer MS Press OLTP app Administrator Windows NT Workstation Cluster Admin SQL Enterprise Mgr Local Disks

17 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Administration Client Server Alice Runs SQL Trace Runs Globe Server Betty Run SQL Trace Local Disks Cluster Admin Console Windows GUI Windows GUI Shows cluster resource status Shows cluster resource status Replicates status to all servers Replicates status to all servers Define apps & related resources Define apps & related resources Define resource dependencies Define resource dependencies Orchestrates recovery order Orchestrates recovery order SQL Enterprise Mgr Windows GUI Windows GUI Shows server status Shows server status Manages many servers Manages many servers Start, stop manage DBs Start, stop manage DBs

18 ©1996, 1997 Microsoft Corp. 17 Generic Stateless Application Rotating Globe Mplay32 is generic app. Mplay32 is generic app. Registered with MSCS Registered with MSCS MSCS restarts it on failure MSCS restarts it on failure Move/restart ~ 2 seconds Move/restart ~ 2 seconds Fail-over if Fail-over if 4 failures (= process exits) 4 failures (= process exits) in 3 minutes in 3 minutes settable default settable default

19 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Moving or Failing Over An Application Local Disks AVI Application X Alice Fails or Operator Requests move AVI Application X

20 ©1996, 1997 Microsoft Corp. 19 Generic Stateful Application NotePad Notepad saves state on shared disk Notepad saves state on shared disk Failure before save => lost changes Failure before save => lost changes Failover or move (disk & state move) Failover or move (disk & state move)

21 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Step 1: Alice Delivering Service Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP IIS SQL ODBC

22 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 2: Request Move to Betty Local Disks HTTP IIS SQL ODBC IP IIS SQL ODBC No SQL Activity IP SQL Activity

23 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 3: Betty Delivering Service Local Disks IIS SQL ODBC IIS SQL ODBC No SQL Activity IP. SQL Activity

24 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 4: Power Fail Betty, Alice Takeover Local Disks IIS SQL ODBC No SQL Activity IP SQL Activity IIS SQL ODBC IP

25 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 5: Alice Delivering Service Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP

26 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 6: Reboot Betty, now can takeover Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP IIS SQL ODBC

27 ©1996, 1997 Microsoft Corp. 26 Outline Why FT and Why Clusters Why FT and Why Clusters Cluster Abstractions Cluster Abstractions Cluster Architecture Cluster Architecture Cluster Implementation Cluster Implementation Application Support Application Support Q&A Q&A

28 ©1996, 1997 Microsoft Corp. 27 Cluster and NT Abstractions ClusterGroup Resource DomainNode Service Cluster Abstractions NT Abstractions

29 ©1996, 1997 Microsoft Corp. 28 Basic NT Abstractions DomainNode Service Service: program or device managed by a node e.g., file service, print service, database server can depend on other services (startup ordering) can be started, stopped, paused, failed Node: a single (tightly-coupled) NT system hosts services; belongs to a domain services on node always remain co-located unit of service co-location; involved in naming services Domain: a collection of nodes cooperation for authentication, administration, naming

30 ©1996, 1997 Microsoft Corp. 29 Cluster Abstractions ClusterResourceGroup Resource Resource: program or device managed by a cluster e.g., file service, print service, database server can depend on other resources (startup ordering) can be online, offline, paused, failed Resource Group: a collection of related resources hosts resources; belongs to a cluster unit of co-location; involved in naming resources Cluster: a collection of nodes, resources, and groups cooperation for authentication, administration, naming

31 ©1996, 1997 Microsoft Corp. 30 Resources Resources have... Type: what it does (file, DB, print, web…) Type: what it does (file, DB, print, web…) An operational state (online/offline/failed) An operational state (online/offline/failed) Current and possible nodes Current and possible nodes Containing Resource Group Containing Resource Group Dependencies on other resources Dependencies on other resources Restart parameters (in case of resource failure) Restart parameters (in case of resource failure) ClusterGroup Resource

32 ©1996, 1997 Microsoft Corp. 31 Resource Types Built-in types Built-in types Generic Application Generic Application Generic Service Generic Service Internet Information Server (IIS) Virtual Root Internet Information Server (IIS) Virtual Root Network Name Network Name TCP/IP Address TCP/IP Address Physical Disk Physical Disk FT Disk (Software RAID) FT Disk (Software RAID) Print Spooler Print Spooler File Share File Share Added by others Microsoft SQL Server, Message Queues, Exchange Mail Server, Oracle, SAP R/3 Your application? (use developer kit wizard).

33 ©1996, 1997 Microsoft Corp. 32 Physical Disk

34 ©1996, 1997 Microsoft Corp. 33 TCP/IP Address

35 ©1996, 1997 Microsoft Corp. 34 Network Name

36 ©1996, 1997 Microsoft Corp. 35 File Share

37 ©1996, 1997 Microsoft Corp. 36 IIS (WWW/FTP) Server

38 ©1996, 1997 Microsoft Corp. 37 Print Spooler

39 ©1996, 1997 Microsoft Corp. 38 Resource States Resources states: Resources states: Offline: exists, not offering service Offline: exists, not offering service Online: offering service Online: offering service Failed: not able to offer service Failed: not able to offer service Resource failure may cause: Resource failure may cause: local restart local restart other resources to go offline other resources to go offline resource group to move resource group to move (all subject to group and resource parameters) (all subject to group and resource parameters) Resource failure detected by: Resource failure detected by: Polling failure Polling failure Node failure Node failure Online Pending Online Failed Offline Pending Go Online! Im Online! Im Off-line! Go Off-line! Im here!

40 ©1996, 1997 Microsoft Corp. 39 Resource Dependencies Similar to NT Service Dependencies Similar to NT Service Dependencies Orderly startup & shutdown Orderly startup & shutdown A resource is brought online after any resources it depends on are online. A resource is brought online after any resources it depends on are online. A Resource is taken offline before any resources it depends on A Resource is taken offline before any resources it depends on Interdependent resources Interdependent resources Form dependency trees Form dependency trees move among nodes together move among nodes together failover together failover together as per resource group as per resource group Network Name IP Address Resource DLL IIS Virtual Root File Share

41 ©1996, 1997 Microsoft Corp. 40 Dependencies Tab

42 ©1996, 1997 Microsoft Corp. 41 NT Registry Stores all configuration information Stores all configuration information Software Software Hardware Hardware Hierarchical (name, value) map Hierarchical (name, value) map Has a open, documented interface Has a open, documented interface Is secure Is secure Is visible across the net (RPC interface) Is visible across the net (RPC interface) Typical Entry: Typical Entry:\Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = GUEST DefaultDomain = REDMOND

43 ©1996, 1997 Microsoft Corp. 42 Cluster Registry Separate from local NT Registry Separate from local NT Registry Replicated at each node Replicated at each node Algorithms explained later Algorithms explained later Maintains configuration information: Maintains configuration information: Cluster members Cluster members Cluster resources Cluster resources Resource and group parameters (e.g. restart) Resource and group parameters (e.g. restart) Stable storage Stable storage Refreshed from master copy when node joins cluster Refreshed from master copy when node joins cluster

44 ©1996, 1997 Microsoft Corp. 43 Other Resource Properties Name Name Restart policy (restart N times, failover…) Restart policy (restart N times, failover…) Startup parameters Startup parameters Private configuration info (resource type specific) Private configuration info (resource type specific) Per-node as well, if necessary Per-node as well, if necessary Poll Intervals (LooksAlive, IsAlive, Timeout ) Poll Intervals (LooksAlive, IsAlive, Timeout ) These properties are all kept in Cluster Registry These properties are all kept in Cluster Registry

45 ©1996, 1997 Microsoft Corp. 44 General Resource Tab

46 ©1996, 1997 Microsoft Corp. 45 Advanced Resource Tab

47 ©1996, 1997 Microsoft Corp. 46 Resource Groups Every resource belongs to a resource group. Every resource belongs to a resource group. Resource groups move (failover) as a unit Resource groups move (failover) as a unit Dependencies NEVER cross groups. (Dependency trees contained within groups.) Dependencies NEVER cross groups. (Dependency trees contained within groups.) Group may contain forest of dependency trees Group may contain forest of dependency trees ClusterGroup Resource Drive E:IP Address SQL Server Web Server Drive F: Payroll Group

48 ©1996, 1997 Microsoft Corp. 47 Moving a Resource Group

49 ©1996, 1997 Microsoft Corp. 48 Group Properties CurrentState: Online, Partially Online, Offline CurrentState: Online, Partially Online, Offline Members: resources that belong to group Members: resources that belong to group members determine which nodes can host group. members determine which nodes can host group. Preferred Owners: ordered list of host nodes Preferred Owners: ordered list of host nodes FailoverThreshold: How many faults cause failover FailoverThreshold: How many faults cause failover FailoverPeriod: Time window for failover threshold FailoverPeriod: Time window for failover threshold FailbackWindowsStart: When can failback happen? FailbackWindowsStart: When can failback happen? FailbackWindowEnd: When can failback happen? FailbackWindowEnd: When can failback happen? Everything (except CurrentState) is stored in registry Everything (except CurrentState) is stored in registry

50 ©1996, 1997 Microsoft Corp. 49 Failover and Failback Failover parameters Failover parameters timeout on LooksAlive, IsAlive timeout on LooksAlive, IsAlive # local restarts in failure window after this, offline. # local restarts in failure window after this, offline. Failback to preferred node Failback to preferred node (during failback window) (during failback window) Do resource failures affect group? Do resource failures affect group? Cluster Service name IPaddr Cluster Service Node \\Betty Node \\Alice Failover Failback

51 ©1996, 1997 Microsoft Corp. 50 Cluster Concepts Clusters ClusterGroup Resource Group Group Group Resource Resource Resource

52 ©1996, 1997 Microsoft Corp. 51 Cluster Properties Defined Members: nodes that can join the cluster Defined Members: nodes that can join the cluster Active Members: nodes currently joined to cluster Active Members: nodes currently joined to cluster Resource Groups : groups in a cluster Resource Groups : groups in a cluster Quorum Resource : Quorum Resource : Stores copy of cluster registry. Stores copy of cluster registry. Used to form quorum. Used to form quorum. Network : Which network used for communication Network : Which network used for communication All properties kept in Cluster Registry All properties kept in Cluster Registry

53 ©1996, 1997 Microsoft Corp. 52 Cluster API Functions (operations on nodes & groups) Find and communicate with Cluster Find and communicate with Cluster Query/Set Cluster properties Query/Set Cluster properties Enumerate Cluster objects Enumerate Cluster objects Nodes Nodes Groups Groups Resources and Resource Types Resources and Resource Types Cluster Event Notifications Cluster Event Notifications Node state and property changes Node state and property changes Group state and property changes Group state and property changes Resource state and property changes Resource state and property changes

54 ©1996, 1997 Microsoft Corp. 53 Cluster Management

55 ©1996, 1997 Microsoft Corp. 54 Demo Server startup and shutdown Server startup and shutdown Installing applications Installing applications Changing status Changing status Failing over Failing over Transferring ownership of groups or resources Transferring ownership of groups or resources Deleting Groups and Resources Deleting Groups and Resources

56 ©1996, 1997 Microsoft Corp. 55 Outline Why FT and Why Clusters Why FT and Why Clusters Cluster Abstractions Cluster Abstractions Cluster Architecture Cluster Architecture Cluster Implementation Cluster Implementation Application Support Application Support Q&A Q&A

57 ©1996, 1997 Microsoft Corp. 56 Architecture Top tier provides cluster abstractions Middle tier provides distributed operations Bottom tier is NT and drivers Windows NT Server Membership Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Quorum

58 ©1996, 1997 Microsoft Corp. 57 Membership and Regroup Membership: Used for orderly addition and removal from { active nodes } Regroup: eartbeat messages) Used for failure detection (via heartbeat messages) Forceful eviction from { active nodes } Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

59 ©1996, 1997 Microsoft Corp. 58 Membership Defined cluster = all nodes Active cluster: Subset of defined cluster Includes Quorum Resource Stable (no regroup in progress) Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

60 ©1996, 1997 Microsoft Corp. 59 Quorum Resource Usually (but not necessarily) a SCSI disk Usually (but not necessarily) a SCSI disk Requirements: Requirements: Arbitrates for a resource by supporting the challenge/defense protocol Arbitrates for a resource by supporting the challenge/defense protocol Capable of storing cluster registry and logs Capable of storing cluster registry and logs Configuration Change Logs Configuration Change Logs Tracks changes to configuration database when any defined member missing (not active) Tracks changes to configuration database when any defined member missing (not active) Prevents configuration partitions in time Prevents configuration partitions in time

61 ©1996, 1997 Microsoft Corp. 60 Challenge/Defense Protocol SCSI-2 has reserve/release verbs SCSI-2 has reserve/release verbs Semaphore on disk controller Semaphore on disk controller Owner gets lease on semaphore Owner gets lease on semaphore Renews lease once every 3 seconds Renews lease once every 3 seconds To preempt ownership: To preempt ownership: Challenger clears semaphore (SCSI bus reset) Challenger clears semaphore (SCSI bus reset) Waits 10 seconds Waits 10 seconds 3 seconds for renewal + 2 seconds bus settle time x 2 to give owner two chances to renew If still clear, then former owner loses lease If still clear, then former owner loses lease Challenger issues reserve to acquire semaphore Challenger issues reserve to acquire semaphore

62 ©1996, 1997 Microsoft Corp. 61 Challenge/Defense Protocol: Successful Defense 015432671110981213161514 Defender Node Challenger Node Reserve Bus Reset Reserve Reservation detected

63 ©1996, 1997 Microsoft Corp. 62 Challenger Node No reservation detected Challenge/Defense Protocol: Successful Challenge Defender Node Reserve Bus Reset Reserve 015432671110981213161514

64 ©1996, 1997 Microsoft Corp. 63 Regroup Invariant: All members agree on { members } Regroup re-computes { members } Each node sends heartbeat message to a peer (default is one per second) Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages Regroup if two lost heartbeat messages suspicion that sender is dead suspicion that sender is dead failure detection in bounded time failure detection in bounded time Uses a 5-round protocol to agree. Uses a 5-round protocol to agree. Checks communication among nodes. Checks communication among nodes. Suspected missing node may survive. Suspected missing node may survive. Upper levels (global update, etc.) informed of regroup event. Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

65 ©1996, 1997 Microsoft Corp. 64 Membership State Machine Initialize Joining Member Search Sleeping Quorum Disk Search Regroup Forming Online Start Cluster Found Online Member Acquire (reserve) Quorum Disk Join Succeeds Synchronize Succeeds Search or Reserve Fails Search Fails Minority or no Quorum Non-Minority and Quorum Lost Heartbeat

66 ©1996, 1997 Microsoft Corp. 65 When a node starts up, it mounts and configures only local, non-cluster devices When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which Starts Cluster Service which looks in local (stale) registry for members looks in local (stale) registry for members Asks each member in turn to sponsor new nodes membership. (Stop when sponsor found.) Asks each member in turn to sponsor new nodes membership. (Stop when sponsor found.) Sponsor (any active member) Sponsor (any active member) Sponsor authenticates applicant Sponsor authenticates applicant Broadcasts applicant to cluster members Broadcasts applicant to cluster members Sponsor sends updated registry to applicant Sponsor sends updated registry to applicant Applicant becomes a cluster member Applicant becomes a cluster member Joining a Cluster

67 ©1996, 1997 Microsoft Corp. 66 Use registry to find quorum resource Use registry to find quorum resource Attach to (arbitrate for) quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource Update cluster registry from quorum resource e.g. if we were down when it was in use e.g. if we were down when it was in use Form new one-node cluster Form new one-node cluster Bring other cluster resources online Bring other cluster resources online Let others join your cluster Let others join your cluster Forming a Cluster (when Joining fails)

68 ©1996, 1997 Microsoft Corp. 67 Leaving A Cluster (Gracefully) Pause: Pause: Move all groups off this member. Move all groups off this member. Change to paused state (remains a cluster member) Change to paused state (remains a cluster member) Offline: Offline: Move all groups off this member. Move all groups off this member. Sends ClusterExit message all cluster members Sends ClusterExit message all cluster members Prevents regroup Prevents stalls during departure transitions Close Cluster connections (now not an active cluster member) Close Cluster connections (now not an active cluster member) Cluster service stops on node Cluster service stops on node Evict: remove node from defined member list Evict: remove node from defined member list

69 ©1996, 1997 Microsoft Corp. 68 Node (or communication) failure triggers Regroup Node (or communication) failure triggers Regroup If after regroup: If after regroup: Minority group OR no quorum device: Minority group OR no quorum device: group does NOT survive Non-minority group AND quorum device: Non-minority group AND quorum device: group DOES survive Non-Minority rule: Non-Minority rule: Number of new members >= 1/2 old active cluster Number of new members >= 1/2 old active cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster Quorum guarantees correctness Quorum guarantees correctness Prevents split-brain Prevents split-brain e.g. with newly forming cluster containing a single node Leaving a Cluster (Node Failure)

70 ©1996, 1997 Microsoft Corp. 69 Global Update Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership all are up all can communicate R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

71 ©1996, 1997 Microsoft Corp. 70 Global Update Algorithm Cluster has locker node that regulates updates. Cluster has locker node that regulates updates. Oldest active node in cluster Oldest active node in cluster Send Update to locker node Send Update to locker node Update other (active) nodes Update other (active) nodes in seniority order (e.g. locker first) in seniority order (e.g. locker first) this includes the updating node this includes the updating node Failure of all updated nodes: Failure of all updated nodes: Update never happened Update never happened Updated nodes will roll back on recovery Updated nodes will roll back on recovery Survival of any updated nodes: Survival of any updated nodes: New locker is oldest and so has update if any do. New locker is oldest and so has update if any do. New locker restarts update New locker restarts update S L X=100! L ack S

72 ©1996, 1997 Microsoft Corp. 71 Cluster Registry Separate from local NT Registry Separate from local NT Registry Maintains cluster configuration Maintains cluster configuration members, resources, restart parameters, etc. members, resources, restart parameters, etc. Stable storage Stable storage Replicated at each member Replicated at each member Global Update protocol Global Update protocol NT Registry keeps local copy NT Registry keeps local copy Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

73 ©1996, 1997 Microsoft Corp. 72 Cluster Registry Bootstrapping Membership uses Cluster Registry for list of nodes Membership uses Cluster Registry for list of nodes …Circular dependency …Circular dependency Solution: Solution: Membership uses stale local cluster registry Membership uses stale local cluster registry Refresh after joining or forming cluster Refresh after joining or forming cluster Master is either Master is either quorum device, or active members Windows NT Server Membership Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Regroup

74 ©1996, 1997 Microsoft Corp. 73 Resource Monitor Polls resources: Polls resources: IsAlive and LooksAlive IsAlive and LooksAlive Detects failures Detects failures polling failure polling failure failure event from resource failure event from resource Higher levels tell it Higher levels tell it Online, Offline Online, Offline Restart Restart Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

75 ©1996, 1997 Microsoft Corp. 74 Failover Manager Assigns groups to nodes based on Failover parameters Possible nodes for each resource in group Preferred nodes for resource group Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

76 ©1996, 1997 Microsoft Corp. 75 Failover (Resource Goes Offline) Resource Manager Detects resource error. Attempt to restart resource. Has the Resource Retry limit been exceeded? Yes No Switch resource (and Dependants) Offline. Notify Failover Manager. Are Failover conditions within Constraints? Yes No Yes No Notify Failover Manager on the new system to bring resource Online. Leave Group in partially Online state. Wait for Failback Window Can another owner be found? (Arbitration) Failover Manager checks: Failover Window and Failover Threshold

77 ©1996, 1997 Microsoft Corp. 76 Pushing a Group (Resource Failure) Resource Monitor notifies Resource Manager of resource failure. Resource Manager enumerates all objects in the Dependency Tree of the failed resource. Resource Manager notifies Failover Manager that the Dependency Tree is Offline and needs to fail over. Failover Manager on the new owner node brings the resources Online. Failover Manager performs Arbitration to locate a new owner for the group. Resource Manager takes each depending resource Offline. Any resource has Affect the Group True No Leave Group in partially Online state. Yes

78 ©1996, 1997 Microsoft Corp. 77 Pulling a Group (Node Failure) Cluster Service notifies Failover Manager of node failure. Failover Manager determines which groups were owned by the failed node. Failover Manager on the new owner(s) bring the resources Online in dependency order. Failover Manager performs Arbitration to locate a new owner for the groups. Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over.

79 ©1996, 1997 Microsoft Corp. 78 Failback to Preferred Owner Node Preferred owner comes back Online. Is the time within the Failback Window? Failover Manager on the Preferred Owner brings the resources Online. Failover Manager performs Arbitration to locate the Preferred Owner of the group. Resource Manager takes each resource on the current owner Offline. Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner. Group may have a Preferred Owner Group may have a Preferred Owner Preferred Owner comes back online Preferred Owner comes back online Will only occur during the Failback Window (time slot, e.g. at night) Will only occur during the Failback Window (time slot, e.g. at night)

80 ©1996, 1997 Microsoft Corp. 79 Outline Why FT and Why Clusters Why FT and Why Clusters Cluster Abstractions Cluster Abstractions Cluster Architecture Cluster Architecture Cluster Implementation Cluster Implementation Application Support Application Support Q&A Q&A

81 ©1996, 1997 Microsoft Corp. 80 Cluster Service Process Structure Cluster Service Cluster Service Failover Manager Failover Manager Cluster Registry Cluster Registry Global Update Global Update Quorum Quorum Membership Membership Resource Monitor Resource Monitor Resource DLLs Resource DLLs Resources Resources Services Services Applications Applications A Node Resource Monitor Resource Monitor DLL Resource Private calls Private calls

82 ©1996, 1997 Microsoft Corp. 81 Resource Control Resource Monitor DLL Resource Private calls Commands Commands CreateResource() CreateResource() OnlineResource() OnlineResource() OfflineResource() OfflineResource() TerminateResource() TerminateResource() CloseResource() CloseResource() ShutdownProcess() ShutdownProcess() And resource events And resource events Resource Monitor Private calls Cluster Service A Node

83 ©1996, 1997 Microsoft Corp. 82 Resource DLLs Calls to Resource DLL Calls to Resource DLL Open: get handle Open: get handle Online: start offering service Online: start offering service Offline: stop offering service Offline: stop offering service as a standby oras a standby or pair-is offlinepair-is offline LooksAlive: Quick check LooksAlive: Quick check IsAlive: Thorough check IsAlive: Thorough check Terminate: Forceful Offline Terminate: Forceful Offline Close: release handle Close: release handle Online Pending Online Failed Offline Pending Go Online! Im Online! Im Off-line! Go Off-line! Im here! Resource Monitor DLL Resource Private calls Std calls

84 ©1996, 1997 Microsoft Corp. 83 Cluster Service Resource Monitors Resource Monitors DCOM / RPC Cluster Communications Management apps Cluster Service Resource Monitors Resource Monitors DCOM / RPC DCOM DCOM / RPC: admin UDP: Heartbeat Most communication via DCOM /RPC Most communication via DCOM /RPC UDP used for membership heartbeat messages UDP used for membership heartbeat messages Standard (e.g. Ethernet) interconnects Standard (e.g. Ethernet) interconnects

85 ©1996, 1997 Microsoft Corp. 84 Outline Why FT and Why Clusters Why FT and Why Clusters Cluster Abstractions Cluster Abstractions Cluster Architecture Cluster Architecture Cluster Implementation Cluster Implementation Application Support Application Support Q&A Q&A

86 ©1996, 1997 Microsoft Corp. 85 Application Support Virtual Servers Virtual Servers Generic Resource DLLs Generic Resource DLLs Resource DLL VC++ Wizard Resource DLL VC++ Wizard Cluster API Cluster API

87 ©1996, 1997 Microsoft Corp. 86 Virtual Servers Problem: Problem: Client and Server Applications do not want node name to change when server app moves to another node. Client and Server Applications do not want node name to change when server app moves to another node. A Virtual Server simulates an NT Node A Virtual Server simulates an NT Node Resource Group (name, disks, databases,…) Resource Group (name, disks, databases,…) NetName and IP address (node: \\a keeps name and IP address as is moves) NetName and IP address (node: \\a keeps name and IP address as is moves) Virtual Registry (registry moves (is replicated)) Virtual Registry (registry moves (is replicated)) Virtual Service Control Virtual Service Control Virtual RPC service Virtual RPC service Challenges: Challenges: Limit app to virtual servers devices and services. Limit app to virtual servers devices and services. Client reconnect on failover (easy if connectionless -- eg web-clients) Client reconnect on failover (easy if connectionless -- eg web-clients) Virtual Server \\a:1.2.3.4 Virtual Server \\a: 1.2.3.4

88 ©1996, 1997 Microsoft Corp. 87 Virtual Servers (before failover) Nodes \\Y and \\Z support virtual servers \\A and \\B Nodes \\Y and \\Z support virtual servers \\A and \\B Things that need to fail over transparently Things that need to fail over transparently Client connection Client connection Server dependencies Server dependencies Service names Service names Binding to local resources Binding to local resources Binding to local servers Binding to local servers SAP SAP on ASAP on B \\A \\B SAP SQL T:\S:\ \\Y \\Z

89 ©1996, 1997 Microsoft Corp. 88 Virtual Servers (just after failover) \\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z \\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z A resources bind to each other and to local resources (e.g., local file system) A resources bind to each other and to local resources (e.g., local file system) Registry Registry Physical resource Physical resource Security domain Security domain Time Time Transactions used to make DB state consistent. Transactions used to make DB state consistent. To work, local resources on \\Y and \\Z have to be similar To work, local resources on \\Y and \\Z have to be similar E.g. time must remain monotonic after failover E.g. time must remain monotonic after failover SAP SQL S:\ SAP SQL T:\ SAP on ASAP on B \\A\\B \\Y\\Z

90 ©1996, 1997 Microsoft Corp. 89 Address Failover and Client Reconnection Name and Address rebind to new node Name and Address rebind to new node Details later Details later Clients reconnect Clients reconnect Failure not transparent Failure not transparent Must log on again Must log on again Client context lost (encourages connectionless) Client context lost (encourages connectionless) Applications could maintain context Applications could maintain context SAP SQL S:\ SAP SQL T:\ SAP on ASAP on B \\A\\B \\Y\\Z

91 ©1996, 1997 Microsoft Corp. 90 Mapping Local References to Group-Relative References Send client requests to correct server Send client requests to correct server \\A\SAP refers to \\.\SQL \\A\SAP refers to \\.\SQL \\B\SAP refers to \\.\SQL \\B\SAP refers to \\.\SQL Must remap references: Must remap references: \\A\SAP to \\.\SQL$A \\A\SAP to \\.\SQL$A \\B\SAP to \\.\SQL$B \\B\SAP to \\.\SQL$B Also handles namespace collision Also handles namespace collision Done via Done via modifying server apps, or modifying server apps, or DLLs to transparently rename DLLs to transparently rename SAP SQL S:\ SAP SQL T:\ SAP on ASAP on B \\A\\B \\Y\\Z

92 ©1996, 1997 Microsoft Corp. 91 Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Applications register names to advertise services Applications register names to advertise services Example: \\Alice\SQL (i.e. ) Example: \\Alice\SQL (i.e. ) Example: 128.2.2.2:80 (=http://www.foo.com/) Example: 128.2.2.2:80 (=http://www.foo.com/) Binding Binding Clients bind to an address (e.g. name->IP address) Clients bind to an address (e.g. name->IP address) Thus the node name and IP address must failover along with the services (preserve client bindings) Thus the node name and IP address must failover along with the services (preserve client bindings) Naming and Binding and Failover

93 ©1996, 1997 Microsoft Corp. 92 Client to Cluster Communications IP address mobility based on MAC rebinding Alice 200.110.120.4 Virtual Alice 200.110.120.5 Betty 200.110.120.6 Virtual Betty 200.110.120.7 Client Alice 200.110.12.4 Virtual Alice 200.110.12.5 Betty 200.110.12.6 Virtual Betty 200.110.12.7 Router: 200.110.120.4 ->AliceMAC 200.110.120.5 ->AliceMAC 200.110.120.6 ->BettyMAC 200.110.120.7 ->BettyMAC WAN Local Network Cluster Clients Cluster Clients Must use IP (TCP, UDP, NBT,... ) Must use IP (TCP, UDP, NBT,... ) Must Reconnect or Retry after failure Must Reconnect or Retry after failure Cluster Servers Cluster Servers All cluster nodes must be on same LAN segment All cluster nodes must be on same LAN segment IP rebinds to failover MAC addr Transparent to client or server Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr.

94 ©1996, 1997 Microsoft Corp. 93 Time Time must increase monotonically Time must increase monotonically Otherwise applications get confused Otherwise applications get confused e.g. make/nmake/build e.g. make/nmake/build Time is maintained within failover resolution Time is maintained within failover resolution Not hard, since failover on order of seconds Not hard, since failover on order of seconds Time is a resource, so one node owns time resource Time is a resource, so one node owns time resource Other nodes periodically correct drift from owners time Other nodes periodically correct drift from owners time

95 ©1996, 1997 Microsoft Corp. 94 Application Local NT Registry Checkpointing Resources can request that local NT registry sub- trees be replicated Resources can request that local NT registry sub- trees be replicated Changes written out to quorum device Changes written out to quorum device Uses registry change notification interface Uses registry change notification interface Changes read and applied on fail-over Changes read and applied on fail-over \\A on \\X registry Quorum Device registry \\A on \\B registry Each update After Failover

96 ©1996, 1997 Microsoft Corp. 95 Registry Replication

97 ©1996, 1997 Microsoft Corp. 96 Application Support Virtual Servers Virtual Servers Generic Resource DLLs Generic Resource DLLs Resource DLL VC++ Wizard Resource DLL VC++ Wizard Cluster API Cluster API

98 ©1996, 1997 Microsoft Corp. 97 Generic Resource DLLs Generic Application DLL Generic Application DLL Simplest: just starts, stops application, and makes sure process is alive Simplest: just starts, stops application, and makes sure process is alive Generic Service DLL Generic Service DLL Translates DLL calls into equivalent NT Server calls Translates DLL calls into equivalent NT Server calls Online => Service StartOnline => Service Start Offline => Service StopOffline => Service Stop Looks/IsAlive => Service StatusLooks/IsAlive => Service Status Resource Monitor DLL Resource Private calls Std calls

99 ©1996, 1997 Microsoft Corp. 98 Generic Application

100 ©1996, 1997 Microsoft Corp. 99 Generic Service

101 ©1996, 1997 Microsoft Corp. 100 Application Support Virtual Servers Virtual Servers Generic Resource DLLs Generic Resource DLLs Resource DLL VC++ Wizard Resource DLL VC++ Wizard Cluster API Cluster API

102 ©1996, 1997 Microsoft Corp. 101 Resource DLL VC++ Wizard Asks for resource type name Asks for resource type name Asks for optional service to control Asks for optional service to control Asks for other parameters (and associated types) Asks for other parameters (and associated types) Generates DLL source code Generates DLL source code Source can be modified as necessary Source can be modified as necessary E.g. additional checks for Looks/IsAlive E.g. additional checks for Looks/IsAlive

103 ©1996, 1997 Microsoft Corp. 102 Creating a New Workspace

104 ©1996, 1997 Microsoft Corp. 103 Specifying Resource Type Name

105 ©1996, 1997 Microsoft Corp. 104 Specifying Resource Parameters

106 ©1996, 1997 Microsoft Corp. 105 Automatic Code Generation

107 ©1996, 1997 Microsoft Corp. 106 Customizing The Code

108 ©1996, 1997 Microsoft Corp. 107 Application Support Virtual Servers Virtual Servers Generic Resource DLLs Generic Resource DLLs Resource DLL VC++ Wizard Resource DLL VC++ Wizard Cluster API Cluster API

109 ©1996, 1997 Microsoft Corp. 108 Cluster API Allows resources to: Allows resources to: Examine dependencies Examine dependencies Manage per-resource data Manage per-resource data Change parameters (e.g. failover) Change parameters (e.g. failover) Listen for cluster events Listen for cluster events etc. etc. Specs & API became public Sept 1996 Specs & API became public Sept 1996 On all MSDN Level 3 On all MSDN Level 3 On web site: On web site: http://www.microsoft.com/clustering.htm http://www.microsoft.com/clustering.htm

110 ©1996, 1997 Microsoft Corp. 109 Cluster API Documentation

111 ©1996, 1997 Microsoft Corp. 110 Outline Why FT and Why Clusters Why FT and Why Clusters Cluster Abstractions Cluster Abstractions Cluster Architecture Cluster Architecture Cluster Implementation Cluster Implementation Application Support Application Support Q&A Q&A

112 ©1996, 1997 Microsoft Corp. 111 Research Topics? Even easier to manage Even easier to manage Transparent failover Transparent failover Instant failover Instant failover Geographic distribution (disaster tolerance) Geographic distribution (disaster tolerance) Server pools (load-balanced pool of processes) Server pools (load-balanced pool of processes) Process pair (active/backup process) Process pair (active/backup process) 10,000 nodes? 10,000 nodes? Better algorithms Better algorithms Shared memory or shared disk among nodes Shared memory or shared disk among nodes a truly bad idea? a truly bad idea?

113 ©1996, 1997 Microsoft Corp. 112 References Microsoft NT site: http://www.microsoft.com/ntserver/ BARC site (e.g. these slides ): http://research.microsoft.com/~joebar/wolfpack Inside Windows NT, H. Custer, Microsoft Pr, ISBN: 155615481 Tandem Global Update Protocol, R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. VAXclusters: a Closely Coupled Distributed System, Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk cluster. In Search of Clusters : The Coming Battle in Lowly Parallel Computing, Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing Transaction Processing Concepts and Techniques, Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques.


Download ppt "©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server (formerly Wolfpack) Joe Barrera Jim Gray Microsoft Research {joebar, gray}"

Similar presentations


Ads by Google