Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to.

Similar presentations


Presentation on theme: "Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to."— Presentation transcript:

1 Scalable Clusters Jed Liu 11 April 2002

2 Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to clients as a single system Frangipani A scalable distributed file system

3 Microsoft Cluster Service Design goals: Cluster composed of COTS components Scalability – able to add components without interrupting services Transparency – clients see cluster as a single machine Reliability – when a node fails, can restart services on a different node

4 Cluster Abstractions Nodes Resources e.g., logical disk volumes, NetBIOS names, SMB shares, mail service, SQL service Quorum resource Implements persistent storage for cluster configuration database and change log Resource dependencies Tracks dependencies btw resources

5 Cluster Abstractions (cont’d) Resource groups The unit of migration: resources in the same group are hosted on the same node Cluster database Configuration data for starting the cluster is kept in a database, accessed through the Windows registry. Database is replicated at each node in the cluster.

6

7 Node Failure Active members broadcast periodic heartbeat messages Failure suspicion occurs when a node misses two successive heartbeat messages from some other node Regroup algorithm gets initiated to determine new membership information Resources that were online at a failed member are brought online at active nodes

8 Member Regroup Algorithm Lockstep algorithm Activate. Each node waits for a clock tick, then starts sending and collecting status messages Closing. Determine whether partitions exist and determines whether current node is in a partition that should survive Pruning. Prune the surviving group so that all nodes are fully-connected

9 Regroup Algorithm (cont’d) Cleanup. Surviving nodes local membership information as appropriate Stabilized. Done

10 Joining a Cluster Sponsor authenticates the joining node Denies access if applicant isn’t authorized to join Sponsor sends version info of config database Also sends updates as needed, if changes were made while applicant was offline Sponsor atomically broadcasts information about applicant to all other members Active members update local membership information

11 Forming a Cluster Use local registry to find address of quorum resource Acquire ownership of quorum resource Arbitration protocol ensures that at most one node owns quorum resource Synchronize local cluster database with master copy

12 Leaving a Cluster Member sends an exit message to all other cluster members and shuts down immediately Active members gossip about exiting member and update their cluster databases

13 Node States Inactive nodes are offline Active members are either online or paused All active nodes participate in cluster database updates, vote in the quorum algorithm, maintain heartbeats Only online nodes can take ownership of resource groups

14 Resource Management Achieved by invoking a calls through a resource control library (implemented as a DLL) Through this library, MSCS can monitor the state of the resource

15 Resource Migration Reasons for migration: Node failure Resource failure Resource group prefers to execute at a different node Operator-requested migration In the first case, resource group is pulled to new node In all other cases, resource group is pushed

16 Pushing a Resource Group All resources in the old node are brought offline Old host node chooses a new host Local copy of MSCS at new host brings up the resource group

17 Pulling a Resource Group Active nodes capable of hosting the group determine amongst themselves the new host for the group New host chosen based on attributes that are stored in the cluster database Since database is replicated at all nodes, decision can be made without any communication! New host brings online the resource group

18 Client Access to Resources Normally, clients access SMB resources using names of the form \\node\service This presents a problem – as resources migrate between nodes, the resource name will change With MSCS, whenever a resource migrates, resource’s network name also migrates as part of resource group Clients only sees services and their network names – cluster becomes a single virtual node

19 Membership Manager Maintains consensus among active nodes about who is active and who is defined A join mechanism admits new members into the cluster A regroup mechanism determines current membership on start up or suspected failure

20 Global Update Manager Used to implement atomic broadcast A single node in the cluster is always designated as the locker Locker node takes over atomic broadcast in case original sender fails in mid-broadcast

21 Frangipani Design goals: Provide users with coherent, shared access to files Arbitrarily scalable to provide more storage, higher performance Highly available in spite of component failures Minimal human administration Full and consistent backups can be made of the entire file system without bringing it down Complexity of administration stays constant despite the addition of components

22 Server Layering User program User program User program Frangipani file server Frangipani file server Petal distributed virtual disk service Distributed lock service Physical disks

23 Assumptions Frangipani servers trust: One another Petal servers Lock service Meant to run in a cluster of machines that are under a common administration and can communicate securely

24 System Structure Frangipani implemented as a file system option in the OS kernel All file servers read and write the same file system data structures on the shared Petal disk Each file server keeps a redo log in Petal so that when it fails, another server can access log and recover

25 Petal server Lock server Petal server Lock server User programs File system switch Frangipani file server module Petal device driver Network User programs File system switch Frangipani file server module Petal device driver Petal server Lock server Petal virtual disk

26 Security Considerations Any Frangipani machine can access and modify any block of the Petal virtual disk Must run only on machines with trusted OSes Petal servers and lock servers should also run on trusted OSes All three types of components should authenticate one another Network security also important: eavesdropping should be prevented

27 Disk Layout 2 64 bytes of addressable disk space, partitioned into regions: Shared configuration parameters Logs – each server owns a part of this region to hold its private log Allocation bitmaps – each server owns parts of this region for its exclusive use Inodes, small data blocks, large data blocks

28 Logging and Recovery Only log changes to metadata – user data is not logged Use write-ahead redo logging Log implemented as a circular buffer When log fills, reclaim oldest ¼ of buffer Need to be able to find end of log Add monotonically increasing sequence numbers to each block of the log

29 Concurrency Considerations Need to ensure logging and recovery work in the presence of multiple logs Updates requested to same data by different servers are serialized Recovery applies a change only if it was logged under an active lock at the time of failure To ensure this, never replay an update that has already been completed keep a version number on each metadata block

30 Concurrency Considerations (cont’d) Ensure that only one recovery daemon is replaying the log of a given server Do this through an exclusive lock on the log

31 Cache Coherence When lock service detects conflicting lock requests, current lock holder is asked to release or downgrade lock Lock service uses read locks and write locks When a read lock is released, corresponding cache entry must be invalidated When a write lock is downgraded, dirty data must be written to disk Releasing a write lock = downgrade to read lock, then release

32 Synchronization Division of on-disk data structures into lockable segments is designed to avoid lock contention Each log is lockable Bitmap space divided into lockable units Unallocated inode or data block is protected by lock on corresponding piece of the bitmap space A single lock protects the inode and any file data that it points to

33 Locking Service Locks are sticky – they’re retained until someone else needs them Client failure dealt with by using leases Network failures can prevent a Frangipani server from renewing its lease Server discards all locks and all cached data If there was dirty data in the cache, Frangipani throws errors until file system is unmounted

34 Locking Service Hole If a Frangipani server’s lease expires due to temporary network outage, it might still try to access Petal Problem basically caused by lack of clock synchronization Can be fixed without synchronized clocks by including a lease identifier with every Petal request

35 Adding and Removing Servers Adding a server is easy! Just point it to a Petal virtual disk and a lock service, and it automagically gets integrated Removing a server is even easier! Just take a sledgehammer to it Alternatively, if you want to be nicer, you can flush dirty data before using the sledgehammer

36 Backups Just use the snapshot features that are built into Petal to do backups Resulting snapshot is crash-consistent: reflects state reachable if all Frangipani servers were to crash This is good enough – if you restore the backup, recovery mechanism can handle the rest

37 Summary Microsoft Cluster Service Aims to provide reliable services running on a cluster Presents itself as a virtual node to its clients Frangipani Aims to provide a reliable distributed file system Uses metadata logging to recover from crashes Clients see it as a regular shared disk Adding and removing nodes is really easy


Download ppt "Scalable Clusters Jed Liu 11 April 2002. Overview Microsoft Cluster Service Built on Windows NT Provides high availability services Presents itself to."

Similar presentations


Ads by Google