Download presentation
Presentation is loading. Please wait.
Published bySophia Doris Wheeler Modified over 8 years ago
2
©1996, 1997 Microsoft Corp. 1 FT NT: A Tutorial on Microsoft Cluster Server ™ (formerly “Wolfpack”) Joe Barrera Jim Gray Microsoft Research {joebar, gray} @ microsoft.com http://research.microsoft.com/barc
3
©1996, 1997 Microsoft Corp. 2 Outline Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
4
©1996, 1997 Microsoft Corp. 3 DEPENDABILITY: The 3 ITIES RELIABILITY / INTEGRITY: Does the right thing. (also large MTTF) AVAILABILITY: Does it now. (also small MTTR ) MTTF+MTTR System Availability: If 90% of terminals up & 99% of DB up? (=>89% of transactions are serviced on time ). Holistic vs. Reductionist view Security Integrity Reliability Availability
5
©1996, 1997 Microsoft Corp. 4 Case Study - Japan "Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe). Vendor (hardware and software) 5 Months Application software 9 Months Communications lines1.5 Years Operations 2 Years Environment 2 Years 10 Weeks 10 Weeks 1,383 institutions reported (6/84 - 7/85) 7,517 outages, MTTF ~ 10 weeks, avg duration ~ 90 MINUTES To Get 10 Year MTTF, Must Attack All These Areas 42% 12% 25% 9.3% 11.2 % Vendor Environment Operations Application Software Tele Comm lines
6
©1996, 1997 Microsoft Corp. 5 Case Studies - Tandem Trends MTTF improved Shiftfrom Hardware & Maintenance to from 50% to 10% toSoftware (62%) & Operations (15%) NOTE: Systematic under-reporting ofEnvironment Operations errors Application Software
7
©1996, 1997 Microsoft Corp. 6 Summary of FT Studies Current Situation: ~4-year MTTF => Fault Tolerance Works. Hardware is GREAT (maintenance and MTTF). Software masks most hardware faults. Many hidden software outages in operations: New Software. Utilities. Must make all software ONLINE. Software seems to define a 30-year MTTF ceiling. Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.
8
©1996, 1997 Microsoft Corp. 7 Fault Tolerance vs Disaster Tolerance Fault-Tolerance: mask local faults RAID disks Uninterruptible Power Supplies Cluster Failover Disaster Tolerance: masks site failures Protects against fire, flood, sabotage,.. Redundant system and service at remote site.
9
©1996, 1997 Microsoft Corp. 8 The Microsoft “Vision”: Plug & Play Dependability Integrity / Security Integrity Reliability Availability Transactions for reliability Clusters: for availability Security All built into the OS
10
©1996, 1997 Microsoft Corp. 9 Cluster Goals Manageability Manage nodes as a single system Perform server maintenance without affecting users Mask faults, so repair is non-disruptive Availability Restart failed applications & servers un-availability ~ MTTR / MTBF, so quick repair.un-availability ~ MTTR / MTBF, so quick repair. Detect/warn administrators of failures Scalability Add nodes for incremental processingprocessing storagestorage bandwidthbandwidth
11
©1996, 1997 Microsoft Corp. 10 Fault Model Failures are independent So, single fault tolerance is a big win Hardware fails fast (blue-screen) Software fails-fast (or goes to sleep) Software often repaired by reboot: Heisenbugs Operations tasks: major source of outage Utility operations Software upgrades
12
©1996, 1997 Microsoft Corp. 11 Cluster: Servers Combined to Improve Availability & Scalability Cluster: A group of independent systems working together as a single system. Clients see scalable & FT services (single system image). Node : A server in a cluster. May be an SMP server. Interconnect : Communications link used for intra- cluster status info such as “heartbeats”. Can be Ethernet. Client PCs Printers Server A Disk array A Disk array B Server B Interconnect
13
©1996, 1997 Microsoft Corp. 12 Microsoft Cluster Server ™ 2-node availability Summer 97 (20,000 Beta Testers now) Commoditize fault-tolerance (high availability) Commodity hardware (no special hardware) Easy to set up and manage Lots of applications work out of the box. 16-node scalability later (next year?)
14
©1996, 1997 Microsoft Corp. 13 Web site Database Web site files Database files Server 1 Server 2 Browser Failover Example Web site Database Server 1 Server 2
15
©1996, 1997 Microsoft Corp. 14 Client/Server Software failure Admin shutdown Server failure MS Press Failover Demo ! Resource States - Pending - Partial - Failed - Offline
16
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Configuration Server “Alice” SMP Pentium ® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Server “Betty” SMP Pentium ® Pro Processors Windows NT Server with Wolfpack Microsoft Internet Information Server Microsoft SQL Server Interconnect standard Ethernet Client Windows NT Workstation Internet Explorer MS Press OLTP app Administrator Windows NT Workstation Cluster Admin SQL Enterprise Mgr Local Disks
17
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Administration Client Server “Alice” Runs SQL Trace Runs Globe Server “Betty” Run SQL Trace Local Disks Cluster Admin Console Windows GUI Shows cluster resource status Replicates status to all servers Define apps & related resources Define resource dependencies Orchestrates recovery order SQL Enterprise Mgr Windows GUI Shows server status Manages many servers Start, stop manage DBs
18
©1996, 1997 Microsoft Corp. 17 Generic Stateless Application Rotating Globe Mplay32 is generic app. Registered with MSCS MSCS restarts it on failure Move/restart ~ 2 seconds Fail-over if 4 failures (= process exits) in 3 minutes settable default
19
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Moving or Failing Over An Application Local Disks AVI Application X Alice Fails or Operator Requests move AVI Application X
20
©1996, 1997 Microsoft Corp. 19 Generic Stateful Application NotePad Notepad saves state on shared disk Failure before save => lost changes Failover or move (disk & state move)
21
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks Demo Step 1: Alice Delivering Service Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP IIS SQL ODBC
22
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 2: Request Move to Betty Local Disks HTTP IIS SQL ODBC IP IIS SQL ODBC No SQL Activity IP SQL Activity
23
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 3: Betty Delivering Service Local Disks IIS SQL ODBC IIS SQL ODBC No SQL Activity IP. SQL Activity
24
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 4: Power Fail Betty, Alice Takeover Local Disks IIS SQL ODBC No SQL Activity IP SQL Activity IIS SQL ODBC IP
25
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 5: Alice Delivering Service Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP
26
©1996, 1997 Microsoft Corp. Windows NT Server Cluster SCSI Disk Cabinet Shared Disks Local Disks 6: Reboot Betty, now can takeover Local Disks No SQL Activity SQL Activity IIS SQL HTTP ODBC IP IIS SQL ODBC
27
©1996, 1997 Microsoft Corp. 26 Outline Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
28
©1996, 1997 Microsoft Corp. 27 Cluster and NT Abstractions ClusterGroup Resource DomainNode Service Cluster Abstractions NT Abstractions
29
©1996, 1997 Microsoft Corp. 28 Basic NT Abstractions DomainNode Service Service: program or device managed by a node e.g., file service, print service, database server can depend on other services (startup ordering) can be started, stopped, paused, failed Node: a single (tightly-coupled) NT system hosts services; belongs to a domain services on node always remain co-located unit of service co-location; involved in naming services Domain: a collection of nodes cooperation for authentication, administration, naming
30
©1996, 1997 Microsoft Corp. 29 Cluster Abstractions ClusterResourceGroup Resource Resource: program or device managed by a cluster e.g., file service, print service, database server can depend on other resources (startup ordering) can be online, offline, paused, failed Resource Group: a collection of related resources hosts resources; belongs to a cluster unit of co-location; involved in naming resources Cluster: a collection of nodes, resources, and groups cooperation for authentication, administration, naming
31
©1996, 1997 Microsoft Corp. 30 Resources Resources have... Type: what it does (file, DB, print, web…) An operational state (online/offline/failed) Current and possible nodes Containing Resource Group Dependencies on other resources Restart parameters (in case of resource failure) ClusterGroup Resource
32
©1996, 1997 Microsoft Corp. 31 Resource Types Built-in types Generic Application Generic Service Internet Information Server (IIS) Virtual Root Network Name TCP/IP Address Physical Disk FT Disk (Software RAID) Print Spooler File Share Added by others Microsoft SQL Server, Message Queues, Exchange Mail Server, Oracle, SAP R/3 Your application? (use developer kit wizard).
33
©1996, 1997 Microsoft Corp. 32 Physical Disk
34
©1996, 1997 Microsoft Corp. 33 TCP/IP Address
35
©1996, 1997 Microsoft Corp. 34 Network Name
36
©1996, 1997 Microsoft Corp. 35 File Share
37
©1996, 1997 Microsoft Corp. 36 IIS (WWW/FTP) Server
38
©1996, 1997 Microsoft Corp. 37 Print Spooler
39
©1996, 1997 Microsoft Corp. 38 Resource States Resources states: Offline: exists, not offering service Online: offering service Failed: not able to offer service Resource failure may cause: local restart other resources to go offline resource group to move (all subject to group and resource parameters) Resource failure detected by: Polling failure Node failure Online Pending Online Failed Offline Pending Go Online! I’m Online! I’m Off-line! Go Off-line! I’m here!
40
©1996, 1997 Microsoft Corp. 39 Resource Dependencies Similar to NT Service Dependencies Orderly startup & shutdown A resource is brought online after any resources it depends on are online. A Resource is taken offline before any resources it depends on Interdependent resources Form dependency trees move among nodes together failover together as per resource group Network Name IP Address Resource DLL IIS Virtual Root File Share
41
©1996, 1997 Microsoft Corp. 40 Dependencies Tab
42
©1996, 1997 Microsoft Corp. 41 NT Registry Stores all configuration information Software Hardware Hierarchical (name, value) map Has a open, documented interface Is secure Is visible across the net (RPC interface) Typical Entry: \Software\Microsoft\MSSQLServer\MSSQLServer\ DefaultLogin = “GUEST” DefaultDomain = “REDMOND”
43
©1996, 1997 Microsoft Corp. 42 Cluster Registry Separate from local NT Registry Replicated at each node Algorithms explained later Maintains configuration information: Cluster members Cluster resources Resource and group parameters (e.g. restart) Stable storage Refreshed from “master” copy when node joins cluster
44
©1996, 1997 Microsoft Corp. 43 Other Resource Properties Name Restart policy (restart N times, failover…) Startup parameters Private configuration info (resource type specific) Per-node as well, if necessary Poll Intervals (LooksAlive, IsAlive, Timeout ) These properties are all kept in Cluster Registry
45
©1996, 1997 Microsoft Corp. 44 General Resource Tab
46
©1996, 1997 Microsoft Corp. 45 Advanced Resource Tab
47
©1996, 1997 Microsoft Corp. 46 Resource Groups Every resource belongs to a resource group. Resource groups move (failover) as a unit Dependencies NEVER cross groups. (Dependency trees contained within groups.) Group may contain forest of dependency trees ClusterGroup Resource Drive E:IP Address SQL Server Web Server Drive F: Payroll Group
48
©1996, 1997 Microsoft Corp. 47 Moving a Resource Group
49
©1996, 1997 Microsoft Corp. 48 Group Properties CurrentState: Online, Partially Online, Offline Members: resources that belong to group members determine which nodes can host group. Preferred Owners: ordered list of host nodes FailoverThreshold: How many faults cause failover FailoverPeriod: Time window for failover threshold FailbackWindowsStart: When can failback happen? FailbackWindowEnd: When can failback happen? Everything (except CurrentState) is stored in registry
50
©1996, 1997 Microsoft Corp. 49 Failover and Failback Failover parameters timeout on LooksAlive, IsAlive # local restarts in failure window after this, offline. Failback to preferred node (during failback window) Do resource failures affect group? Cluster Service name IPaddr Cluster Service Node \\Betty Node \\Alice Failover Failback
51
©1996, 1997 Microsoft Corp. 50 Cluster Concepts Clusters ClusterGroup Resource Group Group Group Resource Resource Resource
52
©1996, 1997 Microsoft Corp. 51 Cluster Properties Defined Members: nodes that can join the cluster Active Members: nodes currently joined to cluster Resource Groups : groups in a cluster Quorum Resource : Stores copy of cluster registry. Used to form quorum. Network : Which network used for communication All properties kept in Cluster Registry
53
©1996, 1997 Microsoft Corp. 52 Cluster API Functions (operations on nodes & groups) Find and communicate with Cluster Query/Set Cluster properties Enumerate Cluster objects Nodes Groups Resources and Resource Types Cluster Event Notifications Node state and property changes Group state and property changes Resource state and property changes
54
©1996, 1997 Microsoft Corp. 53 Cluster Management
55
©1996, 1997 Microsoft Corp. 54 Demo Server startup and shutdown Installing applications Changing status Failing over Transferring ownership of groups or resources Deleting Groups and Resources
56
©1996, 1997 Microsoft Corp. 55 Outline Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
57
©1996, 1997 Microsoft Corp. 56 Architecture Top tier provides cluster abstractions Middle tier provides distributed operations Bottom tier is NT and drivers Windows NT Server Membership Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Quorum
58
©1996, 1997 Microsoft Corp. 57 Membership and Regroup Membership: Used for orderly addition and removal from { active nodes } Regroup: eartbeat messages) Used for failure detection (via heartbeat messages) Forceful eviction from { active nodes } Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership
59
©1996, 1997 Microsoft Corp. 58 Membership Defined cluster = all nodes Active cluster: Subset of defined cluster Includes Quorum Resource Stable (no regroup in progress) Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership
60
©1996, 1997 Microsoft Corp. 59 Quorum Resource Usually (but not necessarily) a SCSI disk Requirements: Arbitrates for a resource by supporting the challenge/defense protocol Capable of storing cluster registry and logs Configuration Change Logs Tracks changes to configuration database when any defined member missing (not active) Prevents configuration partitions in time
61
©1996, 1997 Microsoft Corp. 60 Challenge/Defense Protocol SCSI-2 has reserve/release verbs Semaphore on disk controller Owner gets lease on semaphore Renews lease once every 3 seconds To preempt ownership: Challenger clears semaphore (SCSI bus reset) Waits 10 seconds 3 seconds for renewal + 2 seconds bus settle time x 2 to give owner two chances to renew If still clear, then former owner loses lease Challenger issues reserve to acquire semaphore
62
©1996, 1997 Microsoft Corp. 61 Challenge/Defense Protocol: Successful Defense 015432671110981213161514 Defender Node Challenger Node Reserve Bus Reset Reserve Reservation detected
63
©1996, 1997 Microsoft Corp. 62 Challenger Node No reservation detected Challenge/Defense Protocol: Successful Challenge Defender Node Reserve Bus Reset Reserve 015432671110981213161514
64
©1996, 1997 Microsoft Corp. 63 Regroup Invariant: All members agree on { members } Regroup re-computes { members } Each node sends heartbeat message to a peer (default is one per second) Regroup if two lost heartbeat messages suspicion that sender is dead failure detection in bounded time Uses a 5-round protocol to agree. Checks communication among nodes. Suspected missing node may survive. Upper levels (global update, etc.) informed of regroup event. Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership
65
©1996, 1997 Microsoft Corp. 64 Membership State Machine Initialize Joining Member Search Sleeping Quorum Disk Search Regroup Forming Online Start Cluster Found Online Member Acquire (reserve) Quorum Disk Join Succeeds Synchronize Succeeds Search or Reserve Fails Search Fails Minority or no Quorum Non-Minority and Quorum Lost Heartbeat
66
©1996, 1997 Microsoft Corp. 65 When a node starts up, it mounts and configures only local, non-cluster devices Starts Cluster Service which looks in local (stale) registry for members Asks each member in turn to sponsor new node’s membership. (Stop when sponsor found.) Sponsor (any active member) Sponsor authenticates applicant Broadcasts applicant to cluster members Sponsor sends updated registry to applicant Applicant becomes a cluster member Joining a Cluster
67
©1996, 1997 Microsoft Corp. 66 Use registry to find quorum resource Attach to (arbitrate for) quorum resource Update cluster registry from quorum resource e.g. if we were down when it was in use Form new one-node cluster Bring other cluster resources online Let others join your cluster Forming a Cluster (when Joining fails)
68
©1996, 1997 Microsoft Corp. 67 Leaving A Cluster (Gracefully) Pause: Move all groups off this member. Change to paused state (remains a cluster member) Offline: Move all groups off this member. Sends ClusterExit message all cluster members Prevents regroup Prevents stalls during departure transitions Close Cluster connections (now not an active cluster member) Cluster service stops on node Evict: remove node from defined member list
69
©1996, 1997 Microsoft Corp. 68 Node (or communication) failure triggers Regroup If after regroup: Minority group OR no quorum device: group does NOT survive Non-minority group AND quorum device: group DOES survive Non-Minority rule: Number of new members >= 1/2 old active cluster Prevents minority from seizing quorum device at the expense of a larger potentially surviving cluster Quorum guarantees correctness Prevents “split-brain” e.g. with newly forming cluster containing a single node Leaving a Cluster (Node Failure)
70
©1996, 1997 Microsoft Corp. 69 Global Update Propagates updates to all nodes in cluster Used to maintain replicated cluster registry Updates are atomic and totally ordered Tolerates all benign failures. Depends on membership all are up all can communicate R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership
71
©1996, 1997 Microsoft Corp. 70 Global Update Algorithm Cluster has locker node that regulates updates. Oldest active node in cluster Send Update to locker node Update other (active) nodes in seniority order (e.g. locker first) this includes the updating node Failure of all updated nodes: Update never happened Updated nodes will roll back on recovery Survival of any updated nodes: New locker is oldest and so has update if any do. New locker restarts update S L X=100! L ack S
72
©1996, 1997 Microsoft Corp. 71 Cluster Registry Separate from local NT Registry Maintains cluster configuration members, resources, restart parameters, etc. Stable storage Replicated at each member Global Update protocol NT Registry keeps local copy Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership
73
©1996, 1997 Microsoft Corp. 72 Cluster Registry Bootstrapping Membership uses Cluster Registry for list of nodes …Circular dependency Solution: Membership uses stale local cluster registry Refresh after joining or forming cluster Master is either quorum device, or active members Windows NT Server Membership Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Regroup
74
©1996, 1997 Microsoft Corp. 73 Resource Monitor Polls resources: IsAlive and LooksAlive Detects failures polling failure failure event from resource Higher levels tell it Online, Offline Restart Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership
75
©1996, 1997 Microsoft Corp. 74 Failover Manager Assigns groups to nodes based on Failover parameters Possible nodes for each resource in group Preferred nodes for resource group Windows NT Server Regroup Global Update Failover Manager Cluster Registry Resource Monitor Cluster Disk Driver Cluster Net Drivers Membership
76
©1996, 1997 Microsoft Corp. 75 Failover (Resource Goes Offline) Resource Manager Detects resource error. Attempt to restart resource. Has the Resource Retry limit been exceeded? Yes No Switch resource (and Dependants) Offline. Notify Failover Manager. Are Failover conditions within Constraints? Yes No Yes No Notify Failover Manager on the new system to bring resource Online. Leave Group in partially Online state. Wait for Failback Window Can another owner be found? (Arbitration) Failover Manager checks: Failover Window and Failover Threshold
77
©1996, 1997 Microsoft Corp. 76 Pushing a Group (Resource Failure) Resource Monitor notifies Resource Manager of resource failure. Resource Manager enumerates all objects in the Dependency Tree of the failed resource. Resource Manager notifies Failover Manager that the Dependency Tree is Offline and needs to fail over. Failover Manager on the new owner node brings the resources Online. Failover Manager performs Arbitration to locate a new owner for the group. Resource Manager takes each depending resource Offline. Any resource has “Affect the Group” True No Leave Group in partially Online state. Yes
78
©1996, 1997 Microsoft Corp. 77 Pulling a Group (Node Failure) Cluster Service notifies Failover Manager of node failure. Failover Manager determines which groups were owned by the failed node. Failover Manager on the new owner(s) bring the resources Online in dependency order. Failover Manager performs Arbitration to locate a new owner for the groups. Resource Manager notifies Failover Manager that the node is Offline and the groups it owned need to fail over.
79
©1996, 1997 Microsoft Corp. 78 Failback to Preferred Owner Node Preferred owner comes back Online. Is the time within the Failback Window? Failover Manager on the Preferred Owner brings the resources Online. Failover Manager performs Arbitration to locate the Preferred Owner of the group. Resource Manager takes each resource on the current owner Offline. Resource Manager notifies Failover Manager that the Group is Offline and needs to fail over to the Preferred Owner. Group may have a Preferred Owner Preferred Owner comes back online Will only occur during the Failback Window (time slot, e.g. at night)
80
©1996, 1997 Microsoft Corp. 79 Outline Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
81
©1996, 1997 Microsoft Corp. 80 Cluster Service Process Structure Cluster Service Failover Manager Cluster Registry Global Update Quorum Membership Resource Monitor Resource Monitor Resource DLLs Resources Services Applications A Node Resource Monitor Resource Monitor DLL Resource Private calls Private calls
82
©1996, 1997 Microsoft Corp. 81 Resource Control Resource Monitor DLL Resource Private calls Commands CreateResource() OnlineResource() OfflineResource() TerminateResource() CloseResource() ShutdownProcess() And resource events Resource Monitor Private calls Cluster Service A Node
83
©1996, 1997 Microsoft Corp. 82 Resource DLLs Calls to Resource DLL Open: get handle Online: start offering service Offline: stop offering service as a standby oras a standby or pair-is offlinepair-is offline LooksAlive: Quick check IsAlive: Thorough check Terminate: Forceful Offline Close: release handle Online Pending Online Failed Offline Pending Go Online! I’m Online! I’m Off-line! Go Off-line! I’m here! Resource Monitor DLL Resource Private calls Std calls
84
©1996, 1997 Microsoft Corp. 83 Cluster Service Resource Monitors Resource Monitors DCOM / RPC Cluster Communications Management apps Cluster Service Resource Monitors Resource Monitors DCOM / RPC DCOM DCOM / RPC: admin UDP: Heartbeat Most communication via DCOM /RPC UDP used for membership heartbeat messages Standard (e.g. Ethernet) interconnects
85
©1996, 1997 Microsoft Corp. 84 Outline Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
86
©1996, 1997 Microsoft Corp. 85 Application Support Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
87
©1996, 1997 Microsoft Corp. 86 Virtual Servers Problem: Client and Server Applications do not want node name to change when server app moves to another node. A Virtual Server simulates an NT Node Resource Group (name, disks, databases,…) NetName and IP address (node: \\a keeps name and IP address as is moves) Virtual Registry (registry “moves” (is replicated)) Virtual Service Control Virtual RPC service Challenges: Limit app to virtual server’s devices and services. Client reconnect on failover (easy if connectionless -- eg web-clients) Virtual Server \\a:1.2.3.4 Virtual Server \\a: 1.2.3.4
88
©1996, 1997 Microsoft Corp. 87 Virtual Servers (before failover) Nodes \\Y and \\Z support virtual servers \\A and \\B Things that need to fail over transparently Client connection Server dependencies Service names Binding to local resources Binding to local servers SAP “SAP on A”“SAP on B” \\A \\B SAP SQL T:\S:\ \\Y \\Z
89
©1996, 1997 Microsoft Corp. 88 Virtual Servers (just after failover) \\Y resources and groups (i.e. Virtual Server \\A) moved to \\Z A resources bind to each other and to local resources (e.g., local file system) Registry Physical resource Security domain Time Transactions used to make DB state consistent. To “work”, local resources on \\Y and \\Z have to be similar E.g. time must remain monotonic after failover SAP SQL S:\ SAP SQL T:\ “SAP on A”“SAP on B” \\A\\B \\Y\\Z
90
©1996, 1997 Microsoft Corp. 89 Address Failover and Client Reconnection Name and Address rebind to new node Details later Clients reconnect Failure not transparent Must log on again Client context lost (encourages connectionless) Applications could maintain context SAP SQL S:\ SAP SQL T:\ “SAP on A”“SAP on B” \\A\\B \\Y\\Z
91
©1996, 1997 Microsoft Corp. 90 Mapping Local References to Group-Relative References Send client requests to correct server \\A\SAP refers to \\.\SQL \\B\SAP refers to \\.\SQL Must remap references: \\A\SAP to \\.\SQL$A \\B\SAP to \\.\SQL$B Also handles namespace collision Done via modifying server apps, or DLLs to transparently rename SAP SQL S:\ SAP SQL T:\ “SAP on A”“SAP on B” \\A\\B \\Y\\Z
92
©1996, 1997 Microsoft Corp. 91 Services rely on the NT node name and - or IP address to advertise Shares, Printers, and Services. Applications register names to advertise services Example: \\Alice\SQL (i.e. ) Example: 128.2.2.2:80 (=http://www.foo.com/) Binding Clients bind to an address (e.g. name->IP address) Thus the node name and IP address must failover along with the services (preserve client bindings) Naming and Binding and Failover
93
©1996, 1997 Microsoft Corp. 92 Client to Cluster Communications IP address mobility based on MAC rebinding Alice 200.110.120.4 Virtual Alice 200.110.120.5 Betty 200.110.120.6 Virtual Betty 200.110.120.7 Client Alice 200.110.12.4 Virtual Alice 200.110.12.5 Betty 200.110.12.6 Virtual Betty 200.110.12.7 Router: 200.110.120.4 ->AliceMAC 200.110.120.5 ->AliceMAC 200.110.120.6 ->BettyMAC 200.110.120.7 ->BettyMAC WAN Local Network Cluster Clients Must use IP (TCP, UDP, NBT,... ) Must Reconnect or Retry after failure Cluster Servers All cluster nodes must be on same LAN segment IP rebinds to failover MAC addr Transparent to client or server Low-level ARP (address resolution protocol) rebinds IP add to new MAC addr.
94
©1996, 1997 Microsoft Corp. 93 Time Time must increase monotonically Otherwise applications get confused e.g. make/nmake/build Time is maintained within failover resolution Not hard, since failover on order of seconds Time is a resource, so one node owns time resource Other nodes periodically correct drift from owner’s time
95
©1996, 1997 Microsoft Corp. 94 Application Local NT Registry Checkpointing Resources can request that local NT registry sub- trees be replicated Changes written out to quorum device Uses registry change notification interface Changes read and applied on fail-over \\A on \\X registry Quorum Device registry \\A on \\B registry Each update After Failover
96
©1996, 1997 Microsoft Corp. 95 Registry Replication
97
©1996, 1997 Microsoft Corp. 96 Application Support Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
98
©1996, 1997 Microsoft Corp. 97 Generic Resource DLLs Generic Application DLL Simplest: just starts, stops application, and makes sure process is alive Generic Service DLL Translates DLL calls into equivalent NT Server calls Online => Service StartOnline => Service Start Offline => Service StopOffline => Service Stop Looks/IsAlive => Service StatusLooks/IsAlive => Service Status Resource Monitor DLL Resource Private calls Std calls
99
©1996, 1997 Microsoft Corp. 98 Generic Application
100
©1996, 1997 Microsoft Corp. 99 Generic Service
101
©1996, 1997 Microsoft Corp. 100 Application Support Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
102
©1996, 1997 Microsoft Corp. 101 Resource DLL VC++ Wizard Asks for resource type name Asks for optional service to control Asks for other parameters (and associated types) Generates DLL source code Source can be modified as necessary E.g. additional checks for Looks/IsAlive
103
©1996, 1997 Microsoft Corp. 102 Creating a New Workspace
104
©1996, 1997 Microsoft Corp. 103 Specifying Resource Type Name
105
©1996, 1997 Microsoft Corp. 104 Specifying Resource Parameters
106
©1996, 1997 Microsoft Corp. 105 Automatic Code Generation
107
©1996, 1997 Microsoft Corp. 106 Customizing The Code
108
©1996, 1997 Microsoft Corp. 107 Application Support Virtual Servers Generic Resource DLLs Resource DLL VC++ Wizard Cluster API
109
©1996, 1997 Microsoft Corp. 108 Cluster API Allows resources to: Examine dependencies Manage per-resource data Change parameters (e.g. failover) Listen for cluster events etc. Specs & API became public Sept 1996 On all MSDN Level 3 On web site: http://www.microsoft.com/clustering.htm
110
©1996, 1997 Microsoft Corp. 109 Cluster API Documentation
111
©1996, 1997 Microsoft Corp. 110 Outline Why FT and Why Clusters Cluster Abstractions Cluster Architecture Cluster Implementation Application Support Q&A
112
©1996, 1997 Microsoft Corp. 111 Research Topics? Even easier to manage Transparent failover Instant failover Geographic distribution (disaster tolerance) Server pools (load-balanced pool of processes) Process pair (active/backup process) 10,000 nodes? Better algorithms Shared memory or shared disk among nodes a truly bad idea?
113
©1996, 1997 Microsoft Corp. 112 References Microsoft NT site: http://www.microsoft.com/ntserver/ BARC site: http://research.microsoft.com/BARC These slides : http://research.microsoft.com/~joebar/ftcs-27/ftcs20.ppt Inside Windows NT, H. Custer, Microsoft Pr, ISBN: 155615481 Tandem Global Update Protocol, R. Carr, Tandem Systems Review. V1.2 1985, sketches regroup and global update protocol. VAXclusters: a Closely Coupled Distributed System, Kronenberg, N., Levey, H., Strecker, W., ACM TOCS, V 4.2 1986. A (the) shared disk cluster. In Search of Clusters : The Coming Battle in Lowly Parallel Computing, Gregory F. Pfister, Prentice Hall, 1995, ISBN: 0134376250. Argues for shared nothing Transaction Processing Concepts and Techniques, Gray, J., Reuter A., Morgan Kaufmann, 1994. ISBN 1558601902, survey of outages, transaction techniques.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.