Veritas Cluster Server

Veritas Cluster Server
Delivering Data Center Availability

Agenda Functional overview of Cluster Server
Architectural overview of Cluster Server Future directions

Types of clusters “Cluster” is an broadly used term
High Availability (HA) clusters Parallel processing clusters Load balancing clusters High Performance Computing (HPC) clusters VCS is primarily an HA cluster With support for some key parallel processing applications like Oracle RAC

What is Availability? Availability = Availability of application
Reduce planned and unplanned downtime of application Needed for server type of applications (databases, app/web servers, ..) Planned downtime: caused by Hardware/OS/application maintenance Unplanned downtime: caused by Logical failures: software bugs, operator errors, viruses Component failures: CPU, NIC, HBA, Disk, software crashes Site/infrastructure failures: Power outage, natural and other disasters

Before you cluster Clustering is not a replacement for backup or storage virtualization Avoid all single point of failures and build redundancy Multiple HBAs, switches, mirroring and paths to storage Multiple NICs and network paths from clients to clustered application Multiple machines, power supplies, cooling systems, data centers Redundancy alone is not sufficient. Clustering software is needed: Monitor and orchestrate all components and specify policies to act deterministically in case of failures and other events It is not realistic to have complete redundancy for all applications. Clustering software allows you to have different kinds of tradeoffs for different applications Clustering software should be able to make any application highly available without making change in the application

VCS Highlights Developed completely “in-house”
Shipping since 1998 – replaced previous generation 2-node product Current shipping version 4.x; 5.0 release in Q3 06 Released from common code base and available on following platforms: Solaris, Windows, AIX, Linux, HP-UX, Vmware, Zen Support for up to 32 nodes in a cluster Single HA solution for local, campus-wide, metro area, wide area Java GUI, web GUI, CLI, API interfaces Enablers Heterogeneous storage & SAN support Shipping standalone and part of application focused solutions

Clust-omers & Competition
VCS is the #1 cross-platform HA product on the market (IDC) With about 10% yearly growth Customers include many Global 500 Fidelity, Morgan Stanley, EBay, Verizon, .. Competition: Sun Cluster, Microsoft Cluster Server, IBM HACMP, HP Service Guard, Oracle and other Linux clusters Relative strength compared to competition Comprehensive, consistent solution with support for all major hardware, OSes, storage and applications Integrated stack with support for failover, parallel applications; High Availability & Disaster Recovery Feature richness Product deployed to make many mission critical apps highly available

Enablers: Storage Area Networks
IP Network After: SAN Model SAN model brings IP network like connectivity and accessibility to storage Data on shared disk – and even binaries – can be “imported” on a given host on SAN Storage Network Before: Shared SCSI (Direct Attached) Model Shared SCSI is very limited in connectivity and accessibility Now, applications can move around freely within cluster for Life, Liberty and Pursuit of Availability!

Enablers: Integrated HA Solutions
Foundation Stack Veritas Volume Manager Storage virtualization Broad device & array support Multi-pathing and storage path failure handling Veritas File System Extent based, journaled file system Online operations Storage checkpoints HA Products & Solutions VCS Storage Foundation HA (Storage Foundation + VCS) Storage Foundation for Applications (Storage foundation HA tuned for apps like Oracle, Oracle RAC, DB2, Sybase) Tested and released together as a “train”

Application Requirements for Clustering
Robustness If application keeps crashing all the time, VCS will restart it within cluster and increase availability but not the reliability of the application Recoverability Application should be able to recover on its own upon restart Application should be able to restart without rebooting the machine Location Independence Clients of application should be able to reconnect after application is failed over – using virtual IP address that failed over with app Application should be able to restart on same node or a different node – no “hostname” dependence – can be worked around but not clean Application should allow multiple instances on a machine (Needed for server consolidation scenario)

Clients of Applications
Cluster & Connections Clients of Applications Public Network Machine1 Machine2 Machine3 Machine4 … Up to 32 homogenous machines Applications Applications Applications Applications VCS VCS VCS VCS Private Networks Storage Storage Area Network

Resources, Types & Agents
Application is made up of related resources Storage resources: Disks, volumes, file systems Network resources: NIC, IP address Application processes themselves Resource Type is the definition of resource – like class definition Resources are always of a given type – like object instances Resource Type needs a corresponding agent which is responsible for onlining, offllining and monitoring of resources of a given type on given machine in the cluster Agents run on the clustered machine only if an application is configured to run on the machine with that type of resource

Application requires database and IP address
Service Groups Application is represented as a service group or collection of related service groups Service group is a basic unit of planned migration and unplanned failover of application within cluster Service group consists of one or more resources with their dependencies Should be online only on one node at a time Once defined, they can be onlined, offlined, migrated and failed over with declarative command like “Online Oracle1 service group on machine1” without worrying about procedural details APPLICATION Database IP Address File System Volumes Network Card Physical Disk Application requires database and IP address Can same resource be part of more than one SG? No. You can proxy the resource OR Put the shared resource in separate service group and using group dependency Group Dependencies More loosely defined compared to resource dependencies Various types: soft/firm/hard; local/global Volume requires disks

Service Group Failover
Clients IP Address Network Card Physical Disk Volumes File System Network Card Database IP Address APPLICATION Physical Disk Volumes Network Card Physical Disk File System Network Card Database IP Address Volumes Physical Disk File System Network Card IP Address Physical Disk Volumes APPLICATION Database IP Address File System Network Card Physical Disk Volumes

Flexible Clustering Architectures
Local Clustering (SAN, LAN) Metropolitan HA & Disaster Recovery (SAN, MAN/LAN) Wide Area Disaster Recovery (WAN) Campus Clustering Replicated Data Cluster One Cluster VM/VCS One Cluster VM/VCS One Cluster Replication/VCS 2 or more Clusters Replication/VCS Shared Storage Remote Mirror Replica Replica SAN or Direct Attached SAN Attached; Fibre IP; DWDM; Escon IP; DWDM; Escon

Local HA Clustering Environment Advantages Disadvantages
One Cluster located at a single site Redundant servers, networks, storage for applications and databases Advantages Simpler to implement Disadvantages Data center or site can be a single point of failure LAN

Metropolitan HA & Disaster Recovery Clustering
MAN Environment One cluster: Servers and storage located in multiple sites are part of the same cluster Advantages Protection against failures and localized disasters No replication needed – quick recovery Disadvantages SAN infrastructure limited by distance Requires careful planning for allocation of storage across two sites

Replicated Data Cluster
Environment Cluster stretches over multiple buildings, data centers, or sites Local storage is replicated between cluster nodes at each location Limited solution – not used by many customers Advantages Replication (over IP) rather than SAN -- Leverage existing IP connection Protection against disasters local to a building, data center, or site Disadvantages Synchronization after replication switch can be more complicated and error prone

Wide Area Disaster Recovery
Environment Customers needing local failover for HA as well as remote site takeover for disaster recovery over long distance Advantages Can support any distance using IP Support for array, host and DB replication Local failover before remote takeover Single button site migration Disadvantages Cost of a remote HOT site Synchronization after replication switch can be more complicated and error prone VCS Fire Drill Tools to help customers periodically test “disaster readiness” A VCS Fire Drill is: A combination of technologies that allow customers who set up VCS in a wide-area failover configuration to regularly test the efficacy of their failover environment without affecting their production application A VCS Fire Drill includes: use of underlying storage primitives such as snapshots to obtain a temporary point-in-timed copy of data and VCS primitives to automate its use and the management of a facsimile of the application

Clustering & Disaster Recovery
Having replication solution alone is not sufficient for disaster recovery. Need proper planning including people, processes and software to help taking right decisions and do right things in right order If a site is down, who/how detects it? (VCS) Who needs to be notified? (VCS -> Administrators) Who makes decision whether it is a real disaster (People) When you fail over full or part of data center or site from one place to another, which application should stay alive, which can stay down (Admin -> VCS) Do you do migration of application even if corresponding replicated data is not up to date (Admin ->VCS) How do you make sure all applications come up in proper order upon site migration? (VCS) How do I periodically test disaster preparedness without taking my production application down? (VCS Firedrill)

Local & Global Failover
Site Migration Failover App File Volumes NIC DB IP App File Volumes NIC DB IP App File Volumes NIC DB IP Replication

Available Agents Bundled agents: NIC, IP, Volume, Mount, Process, Application, Service, Registry Replication, NBU, BE, … Enterprise agents: Oracle, DB2, SQL server, Exchange, SAP, BEA, Siebel, Sybase, … Replication Agents: EMC SRDF, Hitachi True Copy, Netapp replication, Oracle and DB2 replication, … Consulting Agents: 50+ agents written by consulting group for various applications like Clear case, Tibco and so on Custom agents: Written by customers for custom apps

Virtualization Support
Current support for server and hardware based virtualization Virtual machine or partition as a machine IBM Micro-partitions Hardware based partitions: HP Superdome partitions, Sun partitions Virtual machine or partition as a resource in service group VMWARE Microsoft Virtual Server Solaris Zones Easier disaster recovery Replicate virtual machine image along with data – VCS provides failover & DR management

Support for Parallel Applications
Cluster Volume Manager Simultaneously import and read/write to volumes across cluster nodes Application has to control mutual exclusion of read/writes to same blocks Cluster File System Simultaneously mount and read/write to file systems across cluster nodes Unlike NFS, read/writes are direct from/to disk (except for metadata) Global lock management Oracle RAC We provide integrated package with VCS+CVM+CFS Support for Oracle Disk Manager: Data file management tuned for Oracle VCS supports both non clustered and clustered versions of these apps Support for parallel CVM, CFS and RAC only on UNIX platforms CVM: Uniform shared storage model (all nodes must see all storage or fail) Talk about any changes to volume like mirroring etc. are reflected on all nodes Talk about how DB2 EEE MPP support is different (Shared nothing) ODM: bypass FS buffering, locking, shared file descriptors across multiple processes, recovery bypassed for data files Clients of RAC do not need to know that DB is is running in parallel mode Clients of CFS keep using same FS semantics

Cluster Management All cluster management available using CLI, Java GUI, Web GUI & API Consistent cluster management across all platforms Java GUI Fat client based interactive UI Will be deprecated in future Web GUI HTML based browser clients for accessibility from anywhere Command Central Availability Management of multiple clusters Aggregation of information and availability, uptime reporting based business units, locations Multi Cluster Management Super set of web GUI and CCA with interactivity using Macromedia Flash

VCS Architecture … Up to 32 homogenous machines
Policy (VCS engine) executed on all nodes in replicated state machine (RSM) mode Java GUI Browser clients Public Network LLT GAB VCS engine Agents LLT GAB VCS engine Agents Web GUI Server CLI Agents Agents … Up to 32 homogenous machines VM, FS and applications not shown for clarity VCS engine GAB LLT Private Networks Storage

LLT Low Latency Transport: Proprietary protocol on top of Ethernet -- Non routable Provides heartbeat and communication mechanism for clustered nodes over private network Guaranteed delivery of packets – like TCP Supports up to 8 network links Performs implicit link level heartbeat across all clustered nodes Transparent handling of link failures Can multiplex across links and scale Also used by CVM (Clustered Volume Manager) and CFS (Clustered File System) for communication and messaging Talk about LLT over UDP (not widely used) When in “Wide Area” or Global cluster mode, LLT is not used. VCS engines on two clusters talk to each other independently to keep state in sync. There is a notion of “Service group ownership” that can be transferred when all clusters up OR can be overridden by administrator in case of a disaster.

GAB Group membership and Atomic Broadcast protocol
Consistent membership across all nodes Guarantees ordering of broadcasts from same node Guarantees consistent ordering of broadcasts from multiple nodes Depends on LLT for reliable delivery of messages Clients register on designated ports Reconfiguration: Happens when node joins or leaves OR client registers with GAB on given port or leaves Used also by CVM and CFS for membership – provides consistent membership and cluster broadcasts for all clustered Veritas products

Split Brain What happens when One or more nodes lose all means of communication with rest of the cluster? Now cluster does not operate as a single “brain” Sub clusters can decide to online same application on different nodes at same time Result? Data corruption. BAD Causes Machine loaded – common cause NIC or network switch failures for all links Avoidance Support for multiple heartbeat links VCS components run at very high priority Can’t completely prevent split brain – failures do occur Detection: Not easy to detect, symptoms are same as node(s) dying Solution? IO fencing

I/O Fencing 1) Upon split brain, Fencing components race for coordinator disks on sub clusters App File Volumes NIC DB IP Write(..) 2) Winner sub cluster fences out loser sub cluster 3) Any attempt for writes from loser sub cluster are rejected. The loser sub cluster would panic depending upon configuration Discuss Why 3 coordinator disks Coordinator Disks Data Disks (VxVM or CVM controlled)

Recovery Order When node leaves cluster
Fencing first: Nodes no longer in cluster can’t write to data disks CVM next: Perform volume metadata & other recovery CFS next: GLM recovery New master for file systems mastered on dead node Oracle RAC next: Now RAC can do its own recovery (locks, open transactions) VCS engine in the end: VCS engine can now decide how to react to node failure

Consistent Communication Stack Across Products
Server NIC Server NIC VCS Engine RAC CFS CVM LLT GAB Fencing RAC CFS CVM VCS Engine Fencing GAB LLT Hardware “pipe” L T Consistent Messaging & Communication Cluster Membership/State Datafile Management Filesystem Metadata Volume Management Cluster State Cache Fusion, Lock mgm Datafile, FS metadat, GLM Volume Metadata mgmt

VCS Engine: Cluster Configuration & State Management
Local, On-disk repository of definitions of nodes in cluster, resource types, resources, agents, service groups and dependencies Grammar based configuration file First node to come up in the cluster builds in-memory configuration In-memory configuration includes transient information like state of resource etc. too Subsequent nodes get in memory snapshot from one of the existing nodes Replicate state machine: Any change in cluster is broadcast to all nodes, then take the action All nodes in cluster have working configuration and state all the time – any node dying does not affect functioning of cluster software itself Any event that changes configuration is written to local configurations on each node atomically

A Very Short Glimpse into the Life of VCS Engine
System S1 System S2 Agent T1 Agent T2 Agent T1 Agent T2 CLI: Online G1 on S1 R1 is online Online R1 Cluster Access Mgmt Policy Mgmt Config & State mgmt Inter-node communication VCS Engine Config & state repository Group G1 Resource R1 (Type T1) Config & state repository Group G1 Resource R1 (Type T1) Online on N1 Cluster Access Mgmt Policy Mgmt Config & State mgmt Inter-node communication VCS Engine Config & state repository Group G1 Resource R1 (Type T1) Config & state repository Group G1 Resource R1 (Type T1) Online on N1 Initial snapshot Recv Bcast: Online R1 Recv Bcast: R1 is online (S1) Send Bcast: Online R1 (S1) Send Bcast: R1 is online (S1) Local disk config Group G1 Resource R1 (Type T1) System S1 System S2 Local disk config Group G1 Resource R1 (Type T1) System S1 System S2 Cluster Interconnect (LLT/GAB)

VCS Engine: Policy Management
Declarative specification and enforcement of What applications can run where: Active/passive, Active/Active N+1, N->1, Server consolidation configurations Failover & Parallel Service groups Failover service group is online on only one machine at a given time What to do in case of various failures: resource, system, cluster, site Register with GAB for notification of node join/leave and inter-node communication Work with agents for resource state management Highly flexible: can add/modify/delete nodes, service groups, resources, types and attributes on the fly Extend policy using scriptable triggers Cluster Policy Simulator for modeling and running “what-if” scenarios Discuss how we handle failures of node, app, NIC, HBA etc. Discuss critical resources and tree mgmt Take about SG dependencies GCO: First local, then remote Link for simulator: (

VCS Engine: Cluster Access Management
All cluster functionality available through Java GUI, Web GUI, CLI and API Secure authentication & communication (VxSS based) Registration API for state change or any notification in general Role based access control at cluster or service group level

Agents: Agent Framework
Agent framework provides common functionality needed by all agents Communication and heartbeat with VCS engine Handling of multiple resources of same resource type – monitor interval and other timers Provides API for entry points Agent entry points Online, offline, monitor, clean – just do intended work periodically or upon engine request and leave the rest to the agent framework “In process” entry point (C/C++) Script or executable entry point

How do I write my own agent?
Reuse generic application, process agent Use of script agent for scripting EPs Write type definition including specification for config file.. Write 4 entry points (online, offline, monitor, clean) with clearly defined actions Should be able to identify one and only one instance of application Monitor should be very efficient

Example Type Definition
type Mount ( static str ArgList[] = { MountPoint, BlockDevice, FSType, MountOpt, FsckOpt, SnapUmount } NameRule = resource.MountPoint str MountPoint str BlockDevice str FSType str MountOpt str FsckOpt int SnapUmount ) OraTypes.cf type Oracle ( static str ContainerType = Zone static str SupportedActions [] = {VRTS_GetInstanceName, VRTS_GetRunningServices,DBRestrict, DBUndoRestrict,DBResume,DBSuspend,DBTbspBackup} static str ArgList[] = { Sid, Owner, Home, Pfile, StartUpOpt, ShutDownOpt, EnvFile, AutoEndBkup, DetailMonitor, User, Pword, Table, MonScript, AgentDebug , Encoding } str Sid str Owner str Home str Pfile str StartUpOpt = STARTUP str ShutDownOpt = IMMEDIATE str EnvFile boolean AutoEndBkup = 1 int DetailMonitor str MonScript = "./bin/Oracle/SqlTest.pl" str User str Pword str Table boolean AgentDebug = 0 str Encoding str ContainerName ) type Netlsnr ( static str SupportedActions [] = {VRTS_GetInstanceName, VRTS_GetRunningServices } static str ArgList[] = { Owner, Home, TnsAdmin, Listener, EnvFile, MonScript, LsnrPwd, AgentDebug, Encoding } str TnsAdmin str Listener str MonScript = "./bin/Netlsnr/LsnrTest.pl" str LsnrPwd main.cf include "types.cf" include "OracleTypes.cf" cluster vcsaav152 ( UserNames = { admin = gJKcJEjGKfKKiSKeJH} Administrators = { admin} CredRenewFrequency = 0 CounterInterval = 5 system thor152 ( system thor157 ( group OraSG1 ( SystemList = { thor152 = 0, thor157 = 1 } DiskGroup DG-orabindg ( DiskGroup = orabindg DiskGroup DG-oradatadg ( DiskGroup = oradatadg IP IP-OraSG1 ( Device = hme0 Address = " " NetMask = " " Mount Mount-orabin ( MountPoint = "/oracle" BlockDevice = "/dev/vx/dsk/orabindg/orabinvol" FSType = vxfs MountOpt = "rw,suid,delaylog,largefiles,ioerror=mwdisable" FsckOpt = "-n" Mount Mount-oradata ( MountPoint = "/oradata" BlockDevice = "/dev/vx/dsk/oradatadg/oradatavol" NIC NIC-OraSG1 ( Netlsnr Netlsnr-LISTENER ( Owner = oracle Home = "/oracle/92064" TnsAdmin = "/oracle/92064/network/admin" Listener = LISTENER_SEA Oracle Ora-SEA ( Sid = SEA Pfile = "/oracle/92064/dbs/initSEA.ora" MonScript = "/opt/VRTSvcs/bin/Oracle/SqlTest.pl" User = scott Pword = JVKvXVsXMvR Table = tab DetailMonitor = 1 Volume Volume-orabindg-orabinvol ( Volume = orabinvol Volume Volume-oradatadg-oradatavol ( Volume = oradatavol IP-OraSG1 requires NIC-OraSG1 Mount-orabin requires Volume-orabindg-orabinvol Mount-oradata requires Volume-oradatadg-oradatavol Netlsnr-LISTENER requires Ora-SEA Netlsnr-LISTENER requires IP-OraSG1 Ora-SEA requires Mount-orabin Ora-SEA requires Mount-oradata Volume-orabindg-orabinvol requires DG-orabindg Volume-oradatadg-oradatavol requires DG-oradatadg

VCS Futures Support for Faster interconnects
More application focused editions Interoperability with Application Automation products More news to be announced in March

Faster Interconnects Potentially excellent performance improvements for latency sensitive parallel applications like CFS and RAC Types: 10 GB: Fat pipe but not much improvement in latency TCP Offload Engines: Not much performance improvement Infiniband (Currently investigating) Can bypass data path copies using RDMA Presents a separate qpair (communication pipe) for each set of communicating processes across cluster – reduces locking and sequencing overhead APIs still evolving and for Linux only as of now Infiniband buys lot of raw performance improvements. How much of that gets translated into actual application performance improvement is yet to be seen.

Application Focused Editions
Today: Oracle, DB2 and Sybase Investigating: SAP edition Mysql Virtualization focused solutions

Thank You!

Veritas Cluster Server

Similar presentations

Presentation on theme: "Veritas Cluster Server"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Veritas Cluster Server

Similar presentations

Presentation on theme: "Veritas Cluster Server"— Presentation transcript:

Similar presentations

About project

Feedback