Presentation is loading. Please wait.

Presentation is loading. Please wait.

©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”) Joe Barrera Jim Gray Microsoft Research {joebar, gray}

Similar presentations


Presentation on theme: "©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”) Joe Barrera Jim Gray Microsoft Research {joebar, gray}"— Presentation transcript:

1

2 ©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”) Joe Barrera Jim Gray Microsoft Research {joebar, gray} @ microsoft.com http://research.microsoft.com/barc

3 ©1996, 1997 Microsoft Corp. 2 Outline  Why FT and Why Clusters  Cluster Abstractions  Cluster Architecture  Cluster Implementation  Application Support  Q&A

4 ©1996, 1997 Microsoft Corp. 3 DEPENDABILITY: The 3 ITIES  RELIABILITY / INTEGRITY: Does the right thing. (also large MTTF)  AVAILABILITY: Does it now. (also small MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ).  Holistic vs. Reductionist view Security Integrity Reliability Availability

5 ©1996, 1997 Microsoft Corp. 4 Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor (hardware and software) 5 Months Application software 9 Months Communications lines1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 42% 12% 25% 9.3% 11.2 % Vendor Environment Operations Application Software Tele Comm lines

6 ©1996, 1997 Microsoft Corp. 5 Case Studies - Tandem Trends MTTF improved Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software

7 ©1996, 1997 Microsoft Corp. 6 Summary of FT Studies  Current Situation: ~4-year MTTF => Fault Tolerance Works.  Hardware is GREAT (maintenance and MTTF).  Software masks most hardware faults.  Many hidden software outages in operations:  New Software.  Utilities.  Must make all software ONLINE.  Software seems to define a 30-year MTTF ceiling.  Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.

8 ©1996, 1997 Microsoft Corp. 7 Fault Tolerance vs Disaster Tolerance  Fault-Tolerance: mask local faults  RAID disks  Uninterruptible Power Supplies  Cluster Failover  Disaster Tolerance: masks site failures  Protects against fire, flood, sabotage,..  Redundant system and service at remote site.

9 ©1996, 1997 Microsoft Corp. 8 The Microsoft “Vision”: Plug & Play Dependability Integrity / Security Integrity Reliability Availability  Transactions for reliability  Clusters: for availability  Security  All built into the OS

10 ©1996, 1997 Microsoft Corp. 9 Cluster Goals  Manageability  Manage nodes as a single system  Perform server maintenance without affecting users  Mask faults, so repair is non-disruptive  Availability  Restart failed applications & servers un-availability ~ MTTR / MTBF, so quick repair.un-availability ~ MTTR / MTBF, so quick repair.  Detect/warn administrators of failures  Scalability  Add nodes for incremental processingprocessing storagestorage bandwidthbandwidth

11 ©1996, 1997 Microsoft Corp. 10 Fault Model   Failures are independent So, single fault tolerance is a big win   Hardware fails fast (blue-screen)   Software fails-fast (or goes to sleep)   Software often repaired by reboot:   Heisenbugs   Operations tasks: major source of outage   Utility operations   Software upgrades

12 ©1996, 1997 Microsoft Corp. 11 Cluster: Servers Combined to Improve Availability & Scalability  Cluster: A group of independent systems working together as a single system. Clients see scalable & FT services (single system image).  Node : A server in a cluster. May be an SMP server.  Interconnect : Communications link used for intra- cluster status info such as “heartbeats”. Can be Ethernet. Client PCs Printers Server A Disk array A Disk array B Server B Interconnect

13 ©1996, 1997 Microsoft Corp. 12 Microsoft Cluster Server ™  2-node availability Summer 97 (20,000 Beta Testers now)  Commoditize fault-tolerance (high availability)  Commodity hardware (no special hardware)  Easy to set up and manage  Lots of applications work out of the box.  16-node scalability later (next year?)

14 ©1996, 1997 Microsoft Corp. 13 Web site Database Web site files Database files Server 1 Server 2 Browser Failover Example Web site Database Server 1 Server 2

15 ©1996, 1997 Microsoft Corp. 14   Client/Server   Software failure   Admin shutdown   Server failure MS Press Failover Demo ! Resource States - Pending - Partial - Failed - Offline

16 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Configuration Server “Alice” SMP Pentium ® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Server “Betty” SMP Pentium ® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Interconnect standard Ethernet Client Windows NT Workstation Internet Explorer MS Press OLTP app Administrator Windows NT Workstation Cluster Admin SQL Enterprise Mgr Local Disks

17 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Administration Client Server “Alice” Runs SQL Trace Runs Globe Server “Betty” Run SQL Trace Local Disks Cluster Admin Console  Windows GUI  Shows cluster resource status  Replicates status to all servers  Define apps & related resources  Define resource dependencies  Orchestrates recovery order SQL Enterprise Mgr  Windows GUI  Shows server status  Manages many servers  Start, stop manage DBs

18 ©1996, 1997 Microsoft Corp. 17 Generic Stateless Application Rotating Globe  Mplay32 is generic app.  Registered with MSCS  MSCS restarts it on failure  Move/restart ~ 2 seconds  Fail-over if  4 failures (= process exits)  in 3 minutes  settable default

19 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Moving or Failing Over An Application Local Disks AVI Application X Alice Fails or Operator Requests move AVI Application X

20 ©1996, 1997 Microsoft Corp. 19 Generic Stateful Application NotePad  Notepad saves state on shared disk  Failure before save => lost changes  Failover or move (disk & state move)

21 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Step 1: Alice Delivering Service Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP IIS SQL ODBC

22 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 2: Request Move to Betty Local Disks HTTP IIS SQL ODBC IP IIS SQL ODBC No SQL Activity IP SQL Activity

23 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 3: Betty Delivering Service Local Disks IIS SQL ODBC IIS SQL ODBC No SQL Activity IP. SQL Activity

24 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 4: Power Fail Betty, Alice Takeover Local Disks IIS SQL ODBC No SQL Activity IP SQL Activity IIS SQL ODBC IP

25 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 5: Alice Delivering Service Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP

26 ©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 6: Reboot Betty, now can takeover Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP IIS SQL ODBC

27 ©1996, 1997 Microsoft Corp. 26 Outline  Why FT and Why Clusters  Cluster Abstractions  Cluster Architecture  Cluster Implementation  Application Support  Q&A

28 ©1996, 1997 Microsoft Corp. 27 Cluster and NT Abstractions ClusterGroup Resource DomainNode Service Cluster Abstractions NT Abstractions

29 ©1996, 1997 Microsoft Corp. 28 Basic NT Abstractions DomainNode Service  Service: program or device managed by a node  e.g., file service, print service, database server  can depend on other services (startup ordering)  can be started, stopped, paused, failed  Node: a single (tightly-coupled) NT system  hosts services; belongs to a domain  services on node always remain co-located  unit of service co-location; involved in naming services  Domain: a collection of nodes  cooperation for authentication, administration, naming

30 ©1996, 1997 Microsoft Corp. 29 Cluster Abstractions ClusterResourceGroup Resource  Resource: program or device managed by a cluster  e.g., file service, print service, database server  can depend on other resources (startup ordering)  can be online, offline, paused, failed  Resource Group: a collection of related resources  hosts resources; belongs to a cluster  unit of co-location; involved in naming resources  Cluster: a collection of nodes, resources, and groups  cooperation for authentication, administration, naming

31 ©1996, 1997 Microsoft Corp. 30 Resources Resources have...  Type: what it does (file, DB, print, web…)  An operational state (online/offline/failed)  Current and possible nodes  Containing Resource Group  Dependencies on other resources  Restart parameters (in case of resource failure) ClusterGroup Resource

32 ©1996, 1997 Microsoft Corp. 31 Resource Types  Built-in types  Generic Application  Generic Service  Internet Information Server (IIS) Virtual Root  Network Name  TCP/IP Address  Physical Disk  FT Disk (Software RAID)  Print Spooler  File Share  Added by others  Microsoft SQL Server,  Message Queues,  Exchange Mail Server,  Oracle,  SAP R/3  Your application? (use developer kit wizard).

33 ©1996, 1997 Microsoft Corp. 32 Physical Disk

34 ©1996, 1997 Microsoft Corp. 33 TCP/IP Address

35 ©1996, 1997 Microsoft Corp. 34 Network Name

36 ©1996, 1997 Microsoft Corp. 35 File Share

37 ©1996, 1997 Microsoft Corp. 36 IIS (WWW/FTP) Server

38 ©1996, 1997 Microsoft Corp. 37 Print Spooler

39 ©1996, 1997 Microsoft Corp. 38 Resource States  Resources states:  Offline: exists, not offering service  Online: offering service  Failed: not able to offer service  Resource failure may cause:  local restart  other resources to go offline  resource group to move  (all subject to group and resource parameters)  Resource failure detected by:  Polling failure  Node failure Online Pending Online Failed Offline Pending Go Online! I’m Online! I’m Off-line! Go Off-line! I’m here!

40 ©1996, 1997 Microsoft Corp. 39 Resource Dependencies  Similar to NT Service Dependencies  Orderly startup & shutdown  A resource is brought online after any resources it depends on are online.  A Resource is taken offline before any resources it depends on  Interdependent resources  Form dependency trees  move among nodes together  failover together  as per resource group Network Name IP Address Resource DLL IIS Virtual Root File Share

41 ©1996, 1997 Microsoft Corp. 40 Dependencies Tab

42 ©1996, 1997 Microsoft Corp. 41 NT Registry  Stores all configuration information  Software  Hardware  Hierarchical (name, value) map  Has a open, documented interface  Is secure  Is visible across the net (RPC interface)  Typical Entry: \Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = “GUEST” DefaultDomain = “REDMOND”

43 ©1996, 1997 Microsoft Corp. 42 Cluster Registry  Separate from local NT Registry  Replicated at each node  Algorithms explained later  Maintains configuration information:  Cluster members  Cluster resources  Resource and group parameters (e.g. restart)  Stable storage  Refreshed from “master” copy when node joins cluster

44 ©1996, 1997 Microsoft Corp. 43 Other Resource Properties  Name  Restart policy (restart N times, failover…)  Startup parameters  Private configuration info (resource type specific)  Per-node as well, if necessary  Poll Intervals (LooksAlive, IsAlive, Timeout )  These properties are all kept in Cluster Registry

45 ©1996, 1997 Microsoft Corp. 44 General Resource Tab

46 ©1996, 1997 Microsoft Corp. 45 Advanced Resource Tab

47 ©1996, 1997 Microsoft Corp. 46 Resource Groups  Every resource belongs to a resource group.  Resource groups move (failover) as a unit  Dependencies NEVER cross groups. (Dependency trees contained within groups.)  Group may contain forest of dependency trees ClusterGroup Resource Drive E:IP Address SQL Server Web Server Drive F: Payroll Group

48 ©1996, 1997 Microsoft Corp. 47 Moving a Resource Group

49 ©1996, 1997 Microsoft Corp. 48 Group Properties  CurrentState: Online, Partially Online, Offline  Members: resources that belong to group  members determine which nodes can host group.  Preferred Owners: ordered list of host nodes  FailoverThreshold: How many faults cause failover  FailoverPeriod: Time window for failover threshold  FailbackWindowsStart: When can failback happen?  FailbackWindowEnd: When can failback happen?  Everything (except CurrentState) is stored in registry

50 ©1996, 1997 Microsoft Corp. 49 Failover and Failback  Failover parameters  timeout on LooksAlive, IsAlive  # local restarts in failure window after this, offline.  Failback to preferred node  (during failback window)  Do resource failures affect group? Cluster Service name IPaddr Cluster Service Node \\Betty Node \\Alice Failover Failback

51 ©1996, 1997 Microsoft Corp. 50 Cluster Concepts Clusters ClusterGroup Resource Group Group Group Resource Resource Resource

52 ©1996, 1997 Microsoft Corp. 51 Cluster Properties  Defined Members: nodes that can join the cluster  Active Members: nodes currently joined to cluster  Resource Groups : groups in a cluster  Quorum Resource :  Stores copy of cluster registry.  Used to form quorum.  Network : Which network used for communication  All properties kept in Cluster Registry

53 ©1996, 1997 Microsoft Corp. 52 Cluster API Functions (operations on nodes & groups)  Find and communicate with Cluster  Query/Set Cluster properties  Enumerate Cluster objects  Nodes  Groups  Resources and Resource Types  Cluster Event Notifications  Node state and property changes  Group state and property changes  Resource state and property changes

54 ©1996, 1997 Microsoft Corp. 53 Cluster Management

55 ©1996, 1997 Microsoft Corp. 54 Demo  Server startup and shutdown  Installing applications  Changing status  Failing over  Transferring ownership of groups or resources  Deleting Groups and Resources

56 ©1996, 1997 Microsoft Corp. 55 Outline  Why FT and Why Clusters  Cluster Abstractions  Cluster Architecture  Cluster Implementation  Application Support  Q&A

57 ©1996, 1997 Microsoft Corp. 56 Architecture   Top tier provides cluster abstractions   Middle tier provides distributed operations   Bottom tier is NT and drivers Windows NT Server Membership Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Quorum

58 ©1996, 1997 Microsoft Corp. 57 Membership and Regroup   Membership:   Used for orderly addition and removal from { active nodes }   Regroup:  eartbeat messages)  Used for failure detection (via heartbeat messages)   Forceful eviction from { active nodes } Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

59 ©1996, 1997 Microsoft Corp. 58 Membership   Defined cluster = all nodes   Active cluster:   Subset of defined cluster   Includes Quorum Resource   Stable (no regroup in progress) Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

60 ©1996, 1997 Microsoft Corp. 59 Quorum Resource  Usually (but not necessarily) a SCSI disk  Requirements:  Arbitrates for a resource by supporting the challenge/defense protocol  Capable of storing cluster registry and logs  Configuration Change Logs  Tracks changes to configuration database when any defined member missing (not active)  Prevents configuration partitions in time

61 ©1996, 1997 Microsoft Corp. 60 Challenge/Defense Protocol  SCSI-2 has reserve/release verbs  Semaphore on disk controller  Owner gets lease on semaphore  Renews lease once every 3 seconds  To preempt ownership:  Challenger clears semaphore (SCSI bus reset)  Waits 10 seconds 3 seconds for renewal + 2 seconds bus settle time x 2 to give owner two chances to renew  If still clear, then former owner loses lease  Challenger issues reserve to acquire semaphore

62 ©1996, 1997 Microsoft Corp. 61 Challenge/Defense Protocol: Successful Defense 015432671110981213161514 Defender Node Challenger Node Reserve Bus Reset Reserve Reservation detected

63 ©1996, 1997 Microsoft Corp. 62 Challenger Node No reservation detected Challenge/Defense Protocol: Successful Challenge Defender Node Reserve Bus Reset Reserve 015432671110981213161514

64 ©1996, 1997 Microsoft Corp. 63 Regroup   Invariant: All members agree on { members }   Regroup re-computes { members }  Each node sends heartbeat message to a peer (default is one per second)  Regroup if two lost heartbeat messages  suspicion that sender is dead  failure detection in bounded time  Uses a 5-round protocol to agree.  Checks communication among nodes.  Suspected missing node may survive.   Upper levels (global update, etc.) informed of regroup event. Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

65 ©1996, 1997 Microsoft Corp. 64 Membership State Machine Initialize Joining Member Search Sleeping Quorum Disk Search Regroup Forming Online Start Cluster Found Online Member Acquire (reserve) Quorum Disk Join Succeeds Synchronize Succeeds Search or Reserve Fails Search Fails Minority or no Quorum Non-Minority and Quorum Lost Heartbeat

66 ©1996, 1997 Microsoft Corp. 65  When a node starts up, it mounts and configures only local, non-cluster devices  Starts Cluster Service which  looks in local (stale) registry for members  Asks each member in turn to sponsor new node’s membership. (Stop when sponsor found.)  Sponsor (any active member)  Sponsor authenticates applicant  Broadcasts applicant to cluster members  Sponsor sends updated registry to applicant  Applicant becomes a cluster member Joining a Cluster

67 ©1996, 1997 Microsoft Corp. 66  Use registry to find quorum resource  Attach to (arbitrate for) quorum resource  Update cluster registry from quorum resource  e.g. if we were down when it was in use  Form new one-node cluster  Bring other cluster resources online  Let others join your cluster Forming a Cluster (when Joining fails)

68 ©1996, 1997 Microsoft Corp. 67 Leaving A Cluster (Gracefully)  Pause:  Move all groups off this member.  Change to paused state (remains a cluster member)  Offline:  Move all groups off this member.  Sends ClusterExit message all cluster members Prevents regroup Prevents stalls during departure transitions  Close Cluster connections (now not an active cluster member)  Cluster service stops on node  Evict: remove node from defined member list

69 ©1996, 1997 Microsoft Corp. 68  Node (or communication) failure triggers Regroup  If after regroup:  Minority group OR no quorum device: group does NOT survive  Non-minority group AND quorum device: group DOES survive  Non-Minority rule:  Number of new members >= 1/2 old active cluster  Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster  Quorum guarantees correctness  Prevents “split-brain” e.g. with newly forming cluster containing a single node Leaving a Cluster (Node Failure)

70 ©1996, 1997 Microsoft Corp. 69 Global Update   Propagates updates to all nodes in cluster   Used to maintain replicated cluster registry   Updates are atomic and totally ordered   Tolerates all benign failures.   Depends on membership   all are up   all can communicate  R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

71 ©1996, 1997 Microsoft Corp. 70 Global Update Algorithm  Cluster has locker node that regulates updates.  Oldest active node in cluster  Send Update to locker node  Update other (active) nodes  in seniority order (e.g. locker first)  this includes the updating node  Failure of all updated nodes:  Update never happened  Updated nodes will roll back on recovery  Survival of any updated nodes:  New locker is oldest and so has update if any do.  New locker restarts update S L X=100! L ack S

72 ©1996, 1997 Microsoft Corp. 71 Cluster Registry  Separate from local NT Registry  Maintains cluster configuration  members, resources, restart parameters, etc.  Stable storage  Replicated at each member  Global Update protocol  NT Registry keeps local copy Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

73 ©1996, 1997 Microsoft Corp. 72 Cluster Registry Bootstrapping  Membership uses Cluster Registry for list of nodes  …Circular dependency  Solution:  Membership uses stale local cluster registry  Refresh after joining or forming cluster  Master is either quorum device, or active members Windows NT Server Membership Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Regroup

74 ©1996, 1997 Microsoft Corp. 73 Resource Monitor  Polls resources:  IsAlive and LooksAlive  Detects failures  polling failure  failure event from resource  Higher levels tell it  Online, Offline  Restart Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

75 ©1996, 1997 Microsoft Corp. 74 Failover Manager   Assigns groups to nodes based on   Failover parameters   Possible nodes for each resource in group   Preferred nodes for resource group Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership

76 ©1996, 1997 Microsoft Corp. 75 Failover (Resource Goes Offline) Resource Manager Detects resource error. Attempt to restart resource. Has the Resource Retry limit been exceeded? Yes No Switch resource (and Dependants) Offline. Notify Failover Manager. Are Failover conditions within Constraints? Yes No Yes No Notify Failover Manager on the new system to bring resource Online. Leave Group in partially Online state. Wait for Failback Window Can another owner be found? (Arbitration) Failover Manager checks: Failover Window and Failover Threshold

77 ©1996, 1997 Microsoft Corp. 76 Pushing a Group (Resource Failure) Resource Monitor notifies Resource Manager of resource failure. Resource Manager enumerates all objects in the Dependency Tree of the failed resource. Resource Manager notifies Failover Manager that the Dependency Tree is Offline and needs to fail over. Failover Manager on the new owner node brings the resources Online. Failover Manager performs Arbitration to locate a new owner for the group. Resource Manager takes each depending resource Offline. Any resource has “Affect the Group” True No Leave Group in partially Online state. Yes

78 ©1996, 1997 Microsoft Corp. 77 Pulling a Group (Node Failure) Cluster Service notifies Failover Manager of node failure. Failover Manager determines which groups were owned by the failed node. Failover Manager on the new owner(s) bring the resources Online in dependency order. Failover Manager performs Arbitration to locate a new owner for the groups. Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over.

79 ©1996, 1997 Microsoft Corp. 78 Failback to Preferred Owner Node Preferred owner comes back Online. Is the time within the Failback Window? Failover Manager on the Preferred Owner brings the resources Online. Failover Manager performs Arbitration to locate the Preferred Owner of the group. Resource Manager takes each resource on the current owner Offline. Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner.  Group may have a Preferred Owner  Preferred Owner comes back online  Will only occur during the Failback Window (time slot, e.g. at night)

80 ©1996, 1997 Microsoft Corp. 79 Outline  Why FT and Why Clusters  Cluster Abstractions  Cluster Architecture  Cluster Implementation  Application Support  Q&A

81 ©1996, 1997 Microsoft Corp. 80 Cluster Service Process Structure  Cluster Service  Failover Manager  Cluster Registry  Global Update  Quorum  Membership  Resource Monitor  Resource Monitor  Resource DLLs  Resources  Services  Applications A Node Resource Monitor Resource Monitor DLL Resource Private calls Private calls

82 ©1996, 1997 Microsoft Corp. 81 Resource Control Resource Monitor DLL Resource Private calls  Commands  CreateResource()  OnlineResource()  OfflineResource()  TerminateResource()  CloseResource()  ShutdownProcess()  And resource events Resource Monitor Private calls Cluster Service A Node

83 ©1996, 1997 Microsoft Corp. 82 Resource DLLs  Calls to Resource DLL  Open: get handle  Online: start offering service  Offline: stop offering service as a standby oras a standby or pair-is offlinepair-is offline  LooksAlive: Quick check  IsAlive: Thorough check  Terminate: Forceful Offline  Close: release handle Online Pending Online Failed Offline Pending Go Online! I’m Online! I’m Off-line! Go Off-line! I’m here! Resource Monitor DLL Resource Private calls Std calls

84 ©1996, 1997 Microsoft Corp. 83 Cluster Service Resource Monitors Resource Monitors DCOM / RPC Cluster Communications Management apps Cluster Service Resource Monitors Resource Monitors DCOM / RPC DCOM DCOM / RPC: admin UDP: Heartbeat  Most communication via DCOM /RPC  UDP used for membership heartbeat messages  Standard (e.g. Ethernet) interconnects

85 ©1996, 1997 Microsoft Corp. 84 Outline  Why FT and Why Clusters  Cluster Abstractions  Cluster Architecture  Cluster Implementation  Application Support  Q&A

86 ©1996, 1997 Microsoft Corp. 85 Application Support  Virtual Servers  Generic Resource DLLs  Resource DLL VC++ Wizard  Cluster API

87 ©1996, 1997 Microsoft Corp. 86 Virtual Servers  Problem:  Client and Server Applications do not want node name to change when server app moves to another node.  A Virtual Server simulates an NT Node  Resource Group (name, disks, databases,…)  NetName and IP address (node: \\a keeps name and IP address as is moves)  Virtual Registry (registry “moves” (is replicated))  Virtual Service Control  Virtual RPC service  Challenges:  Limit app to virtual server’s devices and services.  Client reconnect on failover (easy if connectionless -- eg web-clients) Virtual Server \\a:1.2.3.4 Virtual Server \\a: 1.2.3.4

88 ©1996, 1997 Microsoft Corp. 87 Virtual Servers (before failover)  Nodes \\Y and \\Z support virtual servers \\A and \\B  Things that need to fail over transparently  Client connection  Server dependencies  Service names  Binding to local resources  Binding to local servers SAP “SAP on A”“SAP on B” \\A \\B SAP SQL T:\S:\ \\Y \\Z

89 ©1996, 1997 Microsoft Corp. 88 Virtual Servers (just after failover)  \\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z  A resources bind to each other and to local resources (e.g., local file system)  Registry  Physical resource  Security domain  Time  Transactions used to make DB state consistent.  To “work”, local resources on \\Y and \\Z have to be similar  E.g. time must remain monotonic after failover SAP SQL S:\ SAP SQL T:\ “SAP on A”“SAP on B” \\A\\B \\Y\\Z

90 ©1996, 1997 Microsoft Corp. 89 Address Failover and Client Reconnection  Name and Address rebind to new node  Details later  Clients reconnect  Failure not transparent  Must log on again  Client context lost (encourages connectionless)  Applications could maintain context SAP SQL S:\ SAP SQL T:\ “SAP on A”“SAP on B” \\A\\B \\Y\\Z

91 ©1996, 1997 Microsoft Corp. 90 Mapping Local References to Group-Relative References  Send client requests to correct server  \\A\SAP refers to \\.\SQL  \\B\SAP refers to \\.\SQL  Must remap references:  \\A\SAP to \\.\SQL$A  \\B\SAP to \\.\SQL$B  Also handles namespace collision  Done via  modifying server apps, or  DLLs to transparently rename SAP SQL S:\ SAP SQL T:\ “SAP on A”“SAP on B” \\A\\B \\Y\\Z

92 ©1996, 1997 Microsoft Corp. 91  Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services.  Applications register names to advertise services  Example: \\Alice\SQL (i.e. )  Example: 128.2.2.2:80 (=http://www.foo.com/)  Binding  Clients bind to an address (e.g. name->IP address)  Thus the node name and IP address must failover along with the services (preserve client bindings) Naming and Binding and Failover

93 ©1996, 1997 Microsoft Corp. 92 Client to Cluster Communications IP address mobility based on MAC rebinding Alice 200.110.120.4 Virtual Alice 200.110.120.5 Betty 200.110.120.6 Virtual Betty 200.110.120.7 Client Alice 200.110.12.4 Virtual Alice 200.110.12.5 Betty 200.110.12.6 Virtual Betty 200.110.12.7 Router: 200.110.120.4 ->AliceMAC 200.110.120.5 ->AliceMAC 200.110.120.6 ->BettyMAC 200.110.120.7 ->BettyMAC WAN Local Network  Cluster Clients  Must use IP (TCP, UDP, NBT,... )  Must Reconnect or Retry after failure  Cluster Servers  All cluster nodes must be on same LAN segment  IP rebinds to failover MAC addr  Transparent to client or server  Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr.

94 ©1996, 1997 Microsoft Corp. 93 Time  Time must increase monotonically  Otherwise applications get confused  e.g. make/nmake/build  Time is maintained within failover resolution  Not hard, since failover on order of seconds  Time is a resource, so one node owns time resource  Other nodes periodically correct drift from owner’s time

95 ©1996, 1997 Microsoft Corp. 94 Application Local NT Registry Checkpointing  Resources can request that local NT registry sub- trees be replicated  Changes written out to quorum device  Uses registry change notification interface  Changes read and applied on fail-over \\A on \\X registry Quorum Device registry \\A on \\B registry Each update After Failover

96 ©1996, 1997 Microsoft Corp. 95 Registry Replication

97 ©1996, 1997 Microsoft Corp. 96 Application Support  Virtual Servers  Generic Resource DLLs  Resource DLL VC++ Wizard  Cluster API

98 ©1996, 1997 Microsoft Corp. 97 Generic Resource DLLs  Generic Application DLL  Simplest: just starts, stops application, and makes sure process is alive  Generic Service DLL  Translates DLL calls into equivalent NT Server calls Online => Service StartOnline => Service Start Offline => Service StopOffline => Service Stop Looks/IsAlive => Service StatusLooks/IsAlive => Service Status Resource Monitor DLL Resource Private calls Std calls

99 ©1996, 1997 Microsoft Corp. 98 Generic Application

100 ©1996, 1997 Microsoft Corp. 99 Generic Service

101 ©1996, 1997 Microsoft Corp. 100 Application Support  Virtual Servers  Generic Resource DLLs  Resource DLL VC++ Wizard  Cluster API

102 ©1996, 1997 Microsoft Corp. 101 Resource DLL VC++ Wizard  Asks for resource type name  Asks for optional service to control  Asks for other parameters (and associated types)  Generates DLL source code  Source can be modified as necessary  E.g. additional checks for Looks/IsAlive

103 ©1996, 1997 Microsoft Corp. 102 Creating a New Workspace

104 ©1996, 1997 Microsoft Corp. 103 Specifying Resource Type Name

105 ©1996, 1997 Microsoft Corp. 104 Specifying Resource Parameters

106 ©1996, 1997 Microsoft Corp. 105 Automatic Code Generation

107 ©1996, 1997 Microsoft Corp. 106 Customizing The Code

108 ©1996, 1997 Microsoft Corp. 107 Application Support  Virtual Servers  Generic Resource DLLs  Resource DLL VC++ Wizard  Cluster API

109 ©1996, 1997 Microsoft Corp. 108 Cluster API  Allows resources to:  Examine dependencies  Manage per-resource data  Change parameters (e.g. failover)  Listen for cluster events  etc.  Specs & API became public Sept 1996  On all MSDN Level 3  On web site:  http://www.microsoft.com/clustering.htm

110 ©1996, 1997 Microsoft Corp. 109 Cluster API Documentation

111 ©1996, 1997 Microsoft Corp. 110 Outline  Why FT and Why Clusters  Cluster Abstractions  Cluster Architecture  Cluster Implementation  Application Support  Q&A

112 ©1996, 1997 Microsoft Corp. 111 Research Topics?  Even easier to manage  Transparent failover  Instant failover  Geographic distribution (disaster tolerance)  Server pools (load-balanced pool of processes)  Process pair (active/backup process)  10,000 nodes?  Better algorithms  Shared memory or shared disk among nodes  a truly bad idea?

113 ©1996, 1997 Microsoft Corp. 112 References Microsoft NT site: http://www.microsoft.com/ntserver/ BARC site: http://research.microsoft.com/BARC These slides : http://research.microsoft.com/~joebar/ftcs-27/ftcs20.ppt Inside Windows NT, H. Custer, Microsoft Pr, ISBN: 155615481 Tandem Global Update Protocol, R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. VAXclusters: a Closely Coupled Distributed System, Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk cluster. In Search of Clusters : The Coming Battle in Lowly Parallel Computing, Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing Transaction Processing Concepts and Techniques, Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques.


Download ppt "©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”) Joe Barrera Jim Gray Microsoft Research {joebar, gray}"

Similar presentations


Ads by Google