IBM Tivoli System Automation for Multiplatforms v3.2

IBM Tivoli System Automation for Multiplatforms v3.2
Integrating TSAMP v3.2 with DB2 v10.1 HA Shared Disk Author: Gareth Holl In this training module for Tivoli System Automation for Multiplatforms version 3.2, you learn how to exploit the event notification feature provided by the RSCT cluster infrastructure. 1

Objectives When you have completed this module, you will be able to perform these tasks: Explain the general operation of the TSAMP product Identify each of the products and components that make up the total solution and note the integration points Examine what the ‘db2haicu’ utility does from the perspective of TSAMP and the automation policy Learn how to control and service the combination of TSAMP and DB2 After completing this module, you will be able to perform these tasks: Create a new condition Create a new response Associate a condition with a response Activate your new condition/response combination IBM Tivoli System Automation for Multiplatforms

Agenda Introduction and Overview System Automation Components Overview
Mapping DB2 Components to TSAMP Resources Integrating TSAMP with DB2 using db2haicu Controlling the Operational State of the DB2 Resources Disabling Automation (re-gain manual control of DB2) Serviceability June 19th, 2014

Introduction TSAMP can be used to monitor an application’s resources, and automate the starting, stopping, and failover of resources – it will attempt to maintain a desired operational state. DB2 provides a set of scripts that allow TSAMP to control the DB2 resources. Scripts that are used by TSAMP to start, stop, and monitor each of the DB2 resources – this is the primary link between the two products DB2 provides a utility called ‘db2haicu’ that is used to define the domain and automation policy within TSAMP, that is, the initial setup : The automation policy is the set of definitions of all resources, resource groups, and the relationships between them all. The resource definitions contain attributes that define which DB2 start, stop, and monitor script (the automation scripts) to use for a particular resource. June 19th, 2014

Software Summary Each of the following software products/components need to be installed on both systems (primary and standby servers) : DB2 v ( was latest available at the time this deck was written) TSAMP v3.2.2 (Fixpack 7 ( ) or later recommended) RSCT v (installed as part of a TSAMP installation) Installation of DB2 v10.1 includes the DB2 automation policy scripts: /opt/IBM/db2/V10.1/ha/tsa/ db2V10_monitor.ksh, db2V10_start.ksh, db2V10_stop.ksh mountV10_monitor.ksh, mountV10_start.ksh, mountV10_stop.ksh These scripts can get upgraded when a DB2 fixpack is installed June 19th, 2014

Software Summary (continued …)
TSAMP/RSCT doesn’t need to be installed on a 3rd node to maintain quorum Can use a Network TieBreaker (db2haicu calls this a quorum device) License file for TSAMP is included with the DB2 Activation Zip file, available via your Passport Advantage account. If a base level of TSAMP is not installed, then license file will need to be manually installed TSAMP can be silently installed by the DB2 installer, but if a base level of DB2 is not installed, then again the TSAMP license will need to be manually installed. See the TSAMP formal documentation for platform compatibility & dependencies : For TSAMP v Release Note: HALRN329.pdf For TSAMP v4.1 Installation and Configuration Guide: ALICG41.pdf June 19th, 2014

Example of a DB2 HA Shared Disk environment
7 Example of a DB2 HA Shared Disk environment Cluster (RSCT) Heartbeat Client Apps DB2 Transactions Public network Virt IP eth0:0 Primary Server Standby Server TSAMP TSAMP RSCT RSCT eth0 eth0 eth1 eth1 Private network DB2 Instance Cluster Heartbeat DB2 Instance DB2 DB2 DB2 Database Shared Disk June 19th, 2014

Progress Introduction and Overview
System Automation Components Overview Mapping DB2 Components to TSAMP Resources Integrating TSAMP with DB2 using db2haicu Controlling the Operational State of the DB2 Resources Disabling Automation (re-gain manual control of DB2) Serviceability June 19th, 2014

System Automation – Components
Reliable Scalable Cluster Technology RSCT), the "Cluster" software, is made up of some core daemons and some Resource Managers (two shown in orange on the next 2 slides), the most important being the following two : Resource Manager Function and classes owned IBM.ConfigRM Configuration tasks across the nodes in the domain, including quorum and TieBreaker functionality IBM.StorageRM Maps IBM.AgFileSystem resources to IBM.LogicalVolume to IBM.VolumeGroup to IBM.Disk and manages the mount/umount of the filesystems and the varyon/varyoff of the Volume Groups. Tivoli System Automation for Multiplatforms (TSAMP), the "Automation" software, is made up of 3 Resource Managers (shown in blue on the next 2 slides), as follows : RSCT – the “cluster software” HATS (High Availability Topology Services) provides a scalable heartbeat for adapter (network) and node failure detection HAGS (High Availability Group Services) distributed node & process coordination, messaging, and synchronization service RMC (Resource Monitoring and Control) backbone of RSCT: it uses the Resource Managers to map RMC’s resource and resource class abstraction to actual calls and commands that control the end resources provides global access for configuring, monitoring, and controlling subsystems and resources throughout the cluster (also known as a peer domain for “HA” environments) handles authorization, granting or denying resources based on some criteria using ACL files. Does not handle authentication which is determining the identity of a peer or subcomponent. Configuration Resource Manager (IBM.ConfigRM) used in cluster definition (to create and administer a peer domain) also used for quorum support Resource Manager Function and classes owned IBM.RecoveryRM Automation engine. Owns IBM.ResourceGroup, IBM.Equivalency, IBM.ManagedResource, IBM.ManagedRelationship IBM.GblResRM Starts, stops, and monitors resources of the class IBM.Application and IBM.ServiceIP IBM.TestRM Controls resources of class IBM.Test June 19th, 2014

Everything is a "Resource" in the TSAMP and RSCT world There are different kinds of resources and that is where we introduce the concept of a resource "class". There are different Resource Managers, each responsible for managing or controlling resources that belong to a particular set of resource classes. The following diagram shows the mapping of three key Resource Managers to some Resource Classes they manage and then to some example Resources : Tivoli System Automation – the “automation software” Recovery Resource Manager (IBM.RecoveryRM) This is the decision engine for IBM Tivoli System Automation and it consists of: Resource Manager for resource groups, equivalencies, managed resources and managed relationships Engine part: logic deck and binder the logic deck is responsible for sending requests (start, stop) to resources to ensure the policy requirements the binder is used to bind a resource on a node (select a constituent of a floating resource) Global Resource Manager (IBM.GblResRM) Supports two resource classes: IBM.Application defines the behavior for general application resources can be used to start, stop, and monitor processes a generic class, very flexible, can be used to monitor and control various kind of resources most of the applications that you will automate will be done using this class IBM.ServiceIP defines the behavior of Internet Protocol (IP) address resources allows IP addresses to be assigned to a network interface adapter allows IP addresses to ’float’ between nodes. June 19th, 2014

Consider the servers that make up the cluster ... they are also resources. They are resources of the class IBM.PeerNode. The domain itself is a resource, of class IBM.PeerDomain. The network interfaces are resources, of class IBM.NetworkInterface. The following diagram introduces another Resource Manager (part of RSCT) called IBM.ConfigRM and it shows two Resource Classes it manages and the Resources modelled by those classes: June 19th, 2014

Quorum and TieBreaker One of the questions from the ‘db2haicu’ utility deals with a cluster/automation concept called Quorum. Quorum The number of nodes in a cluster that are required to control the resources, modify the cluster definition, or perform certain cluster operations. The main goals of quorum operations: identify who has the majority when a cluster is broken up into sub-clusters keep data consistent, especially when shared file systems are being used protect critical resources….maintain HA control Two types: configuration vs. operational quorum Note: “configuration” quorum requires 'majority of nodes' (more than half the number of nodes) to be Online for configuration changes to be carried out. TieBreaker a TieBreaker situation occurs when a cluster with equal number of nodes is split into sub-clusters with equal numbers of nodes need to determine which sub-cluster will have an operational quorum in a tie situation June 19th, 2014

db2haicu: Quorum and TieBreaker (continued....)
Network TieBreaker the goal is for each system to figure out (via the RSCT infrastructure) which one is operational and should therefore take control (if not already the active node). use a pingable system independent of node1 and node2, for example node3 in our example. Although it would be just as easy and viable to use the gateway router for node1 and node2. without an active TieBreaker, automated failover/takeover will NEVER occur Create a /usr/sbin/cluster/netmon.cf file on each node. Add IP addresses (one per line) of 3-5 devices external to the domain that are pingable from each node. This is important for a 2 node cluster to allow the cluster software (RSCT) to quickly identify the source of a heartbeat problem between the nodes. The following 7 slides demonstrate how a TieBreaker works … June 19th, 2014

Base Tie-Breaker Functionality
1414 Base Tie-Breaker Functionality Gateway Router eth0 eth eth node1 node2 Node Failure Scenario June 19th, 2014 14

1515 Base Tie-Breaker Functionality Gateway Router eth0 eth eth node1 node2 Node Failure Scenario: System node1 fails June 19th, 2014 15

1616 Base Tie-Breaker Functionality Gateway Router eth eth eth node1 node2 Node Failure Scenario: System node1 fails 2. System node2 gets quorum using network tiebreaker June 19th, 2014 16

1717 Base Tie-Breaker Functionality Gateway Router eth0 eth eth node1 node2 Network Adapter Failure Scenario June 19th, 2014 17

1818 Base Tie-Breaker Functionality Gateway Router eth eth eth node1 node2 Network Adapter Failure Scenario : Network problem affecting node1 June 19th, 2014 18

1919 Base Tie-Breaker Functionality Gateway Router eth eth eth node1 node2 Network Adapter Failure Scenario: Network problem affecting node1 2. Again node2 gets quorum using network tiebreaker June 19th, 2014 19

Network Tiebreak Assumption:
2020 Base Tie-Breaker Functionality Gateway Router eth Network Tiebreaker Scenarios: System node1 fails 1a. System node2 gets quorum using network tiebreaker 2. Network problem affecting node1 2a. Again node2 gets quorum using network tiebreaker 2b. System node1 forced to reboot eth eth node1 node2 Network Tiebreak Assumption: If node1 can communicate (ping) with the gateway and node2 can communicate (ping) with the gateway, THEN node1 must be able to communicate (heartbeat) with node2. June 19th, 2014 20

Mapping DB2 HA components to TSAMP resources
2222 Mapping DB2 HA components to TSAMP resources Node1 Node2 Resource Group: db2_db2inst1_0-rg Floating Resource: db2_db2inst1_0-rs DB2 instance DB2 instance dependsOn relationship Floating Resource: db2mnt-home_db2inst1-rs Shared FS Shared FS Floating Resource: db2ip_10_20_30_42-rs Virtual IP Virtual IP dependsOn relationship Public network Equivalency: db2_public_network_0 eth0 eth0 eth1 eth1 June 19th, 2014

Mapping DB2 Components to TSAMP Resources
DB2 instance called “db2inst1” maps to a TSAMP managed resource called “db2_db2inst1_0-rs” DB2 home directory on the shared disk maps to a TSAMP managed resource called “db2mnt-home_db2inst1-rs” The virtual IP address (optional) maps to a TSAMP managed resource called “db2ip_XX_XX_XX_XX-rs” where XX.XX.XX.XX is the virtual IP address. A public network can be defined and this maps to a TSAMP resource (Equivalency) called “db2_public_network_0” A private network can be defined and this maps to a TSAMP resource (Equivalency) called “db2_private_network_0” June 19th, 2014

IBM.Application class – Example is of a DB2 Instance Resource
# lsrsrc –s “Name = ‘db2_db2inst1_0-rs’” -Ab IBM.Application Name = "db2_db2inst1_0-rs“ StartCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_start.ksh db2inst1 0" StopCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_stop.ksh db2inst1 0" MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_monitor.ksh db2inst1 0" MonitorCommandPeriod = 10 MonitorCommandTimeout = 120 StartCommandTimeout = 330 StopCommandTimeout = 140 UserName = "root" RunCommandsSync = 1 ProtectionMode = 1 ActivePeerDomain = ha_domain NodeNameList = {“node1",”node2”} OpState = 1 June 19th, 2014

IBM.Application class – Example is of a DB2 mount point
# lsrsrc –s “Name = ‘db2mnt-home_db2inst1-rs’ ” -Ab IBM.Application Name = "db2mnt-home_db2inst1-rs“ StartCommand = "/usr/sbin/rsct/sapolicies/db2/mountV10_start.ksh /home/db2inst1" StopCommand = "usr/sbin/rsct/sapolicies/db2/mountV10_stop.ksh /home/db2inst1" MonitorCommand = "usr/sbin/rsct/sapolicies/db2/mountV10_monitor.ksh /home/db2inst1" MonitorCommandPeriod = 10 MonitorCommandTimeout = 120 StartCommandTimeout = 330 StopCommandTimeout = 600 UserName = "root" RunCommandsSync = 1 ProtectionMode = 1 ActivePeerDomain = ha_domain NodeNameList = {“node1",“node2"} OpState = 1 June 19th, 2014

IBM.ServiceIP class – Virtual IP addresses
# lsrsrc -Ab IBM.ServiceIP Name = "db2ip_10_20_30_42-rs" IPAddress = “ " NetMask = " " ProtectionMode = 1 ActivePeerDomain = "ha_domain" NodeNameList = {“node1“,”node2”} OpState = 1 June 19th, 2014

Using ‘db2haicu’ to create Domain and Automation Policy
Step 1. Run the following command as root on each node to configure the RSCT ACLs (security) and allow cluster communication between the servers: preprpnode node1 node2 preprpnode node1 node2 Step 2. Log on to the standby server as the instance owner and issue: db2haicu The db2haicu tool will determine the current instance and apply all cluster configuration steps based on it. It will also activate all databases for the instance as it attempts to gather information. Next db2haicu will determine if a domain has already been created by searching for as “Online” domain. If it doesn’t find one, you will see the following : June 19th, 2014

db2haicu: Create a new domain
1) Type ‘1’ and press Enter at the following initial prompt : Create a domain and continue? [1] 1. Yes 2. No 1 2) Enter a unique name for the domain you want to create (we use ha_domain) and the number of nodes contained in the domain (2 in our case). Create a unique name for the new domain: HA_domain Nodes must now be added to the new domain. How many cluster nodes will the domain HA_domain contain? 2 3) Follow the prompts to enter the names of the two cluster nodes : Enter the host name of a machine to add to the domain: node01 node02 db2haicu can now create a new domain containing the two machines that you specified. If you choose not to create a domain now, db2haicu will exit. Create the domain now? [1] Creating domain HA_domain in the cluster ... Creating domain HA_domain in the cluster was successful. June 19th, 2014

db2haicu: List the new domain
At this point there would be domain called “ha_domain” in an online state: lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort ha_domain Online No You can also list the states of the individual nodes and see output similar to the following, from either server: lsrpnode Name OpState RSCTVersion node1 Online node2 Online June 19th, 2014

db2haicu: Quorum and TieBreaker
The next db2haicu question deals with the creation of a Network TieBreaker: At this point you could list the TieBreaker resources and see the new network TieBreaker: lsrsrc –Ab IBM.TieBreaker The following command should show that your new network TieBreaker is currently active: lsrsrc -c IBM.PeerNode OpQuorumTieBreaker June 19th, 2014

db2haicu: Network Equivalencies
Special TSAMP groups called Equivalencies are created containing the network interfaces found on each of the servers in the cluster. This allows TSAMP to be notified of NIC failures by the RSCT subsystem (who harvested the NICs) and react accordingly. We use db2haicu to create an equivalency called db2_public_network_0 and populate it with the en0 NICs from the server called “node1” : June 19th, 2014

db2haicu: Network Equivalencies (continued …)
Next we add the en0 NIC from the other server, “node2”, to the same equivalency (db2_public_network_0) : June 19th, 2014

db2haicu: Private Network
In the previous slide, notice the option to say “Yes” or “No” when adding NICs to a network. When asked if a non-public NIC should be added to a private network, you could just say “No” because its not used for the DB2 resources. The NICs in the private network can still be used by RSCT for a 2nd cluster heartbeat ring … this is the default. June 19th, 2014

db2haicu: Listing the Equivalency
The network equivalency(ies) would be created at this point and can be listed as follows: lsequ -Ab Displaying Equivalency information: All Attributes Equivalency 1: Name = db2_public_network_0 MemberClass = IBM.NetworkInterface Resource:Node[Membership] = {en0:node1,en0:node2} SelectString = “” ActivePeerDomain = ha_domain Resource:Node[ValidSelectResources] = {en0:node1,en0:node2} June 19th, 2014

db2haicu: Selecting which Cluster Manager to use
Select “TSA” (which is a synonym for TSA MP) … this will result in the CLUSTER_MGR variable within dbm cfg being set to “TSA”: … The cluster manager name configuration parameter (high availability configuration parameter) is not set. For more information, see the topic ‘cluster_mgr - Cluster manager name configuration parameter’ in the DB2 Information Center. Do you want to set the high availability configuration parameter? The following are valid settings for the high availability configuration parameter: 1. TSA 2. Vendor Enter a value for the high availability configuration parameter: [1] 1 June 19th, 2014

db2haicu: Select the appropriate Failover Policy
The failover policy determines the machines on which the cluster manager will restart the database manager if the database manager goes offline unexpectedly: … The following are the available failover policies: 1. Local Restart -- during failover, the database manager will restart in place on the local machine 2. Round Robin -- during failover, the database manager will restart on any machine in the cluster domain 3. Active/Passive -- during failover, the database manager will restart on a specific machine 4. M+N -- during failover, the database partitions on one machine will failover to any other machine in the cluster domain (used with DPF instances) 5. Custom -- during failover, the database manager will restart on a machine from a user-specified list Enter your selection: 3 For a simple two-node single partition setup (such as this one), it is generally best to select option 3. June 19th, 2014

db2haicu: Non-critical mount points
Enter mount points that you do NOT want TSAMP to manage : … You can identify mount points that are non-critical for failover. For more information, see the topic 'Identifying mount points that are non-critical for failover' in the DB2 Information Center. Are there any mount points that you want to designate as non-critical? [2] 1. Yes 2. No 1 Enter the full path of the mount to be made non-critical: /tmp You should add any mount points to the non-critical path list that you are sure that you never want to fail over. This list should include any mount points specified in /etc/fstab that are local mount points and will never be failed over. June 19th, 2014

db2haicu: More on mount points (file system resources)
During the initial execution of “db2haicu”, you don't actually specify the file systems that you want TSAMP to manage. The DB2 engine dynamically adds (and removes) file system related resources to the automation policy based certain DB2 operations: Creating a new database will at least add the file system hosting the DB2 home directory Drop a database may result in file system resources being removed from the policy Table space creation/removal and adding/removing containers for existing table spaces will likely result in addition or removal of file system resources within the automation policy Changing active log path Changing DB2 diagnostic log path Changing audit log path Database restore can result in the automatic addition of file system resources that don't already exist You can re-run “db2haicu” afterwards to explicitly add file systems (db2haicu will enter a “maintenance” mode when executed after a domain has already been created). June 19th, 2014

db2haicu: Requirements for mount points (file system resources)
All file systems in a Volume Group that contains at least one TSAMP managed file system must be managed by TSAMP If there is a file system in the VG that is not in the automation policy, the DB2 “mountV10_stop.ksh” script will not be able to varyoff the VG and thus a failover (controlled or automated) will never be possible The file system resources will likely be set to “Stuck online” on the original active node You can have multiple file systems from multiple Volume Groups in the automation policy All file systems need to be defined in /etc/fstab (Linux) or /etc/filesystems (AIX) and be set to noauto (not to auto-mount at boot time) If you have a manage file system that is not defined in fstab/filesystems file, the DB2 “mountV10_monitor” script will deliberately set the state of the resource to “Failed offline” (and running “resetrsrc” will not help you !) June 19th, 2014

db2haicu: Entering Hostnames for the Active/Passive Pair
Active/Passive failover policy was chosen. You need to specify the host names of an active/passive pair : Enter the host name for the active cluster node: node01 Enter the host name for the passive cluster node: node02 At this point the db2haicu tool will add the DB2 partition instance to the automation policy (resource model). db2 get dbm cfg |grep -i cluster Cluster manager = TSA lsrg Resource Group names: db2_db2inst1_0-rg Once the dbm is configured with “Cluster manager” set to TSA, the DB2 engine expects to have a domain Online. You will have issues stopping/starting DB2 if no domain is Online. Run 'db2haicu -disable' if you want to break the connection between DB2 and TSAMP. This is the only way to unset “Cluster manager” for DB2 v10.x June 19th, 2014

db2haicu: DB2 Instance Resource
lsrg -g db2_db2inst1_0-rg Displaying Member Resource information: For Resource Group "db2_db2inst1_0-rg". Resource Group 1: Name = db2_db2inst1_0-rg MemberLocation = Collocated Priority = 0 AllowedNode = db2_db2inst1_0-rg_group-equ NominalState = Online ActivePeerDomain = ha_domain OpState = Online TopGroup = db2_db2inst1_node2_0-rg TopGroupNominalState = Online Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which servers this resource is allowed to run on … see the next slide for output that shows this PeerNode Equivalency details. June 19th, 2014

db2haicu: Dependencies for the DB2 Instances
The following relationships would exist in the automation policy: lsrel -Ab Displaying Managed Relationship Information: All Attributes Managed Relationship 1: Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_0-rs Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0} Relationship = DependsOn Conditional = NoCondition Name = db2_db2inst1_0-rs_DependsOn_db2_ public_network_0-rel ActivePeerDomain = ha_domain Managed Relationship 2: Class:Resource:Node[Target] = {IBM.Application:db2mnt-home_db2inst1-rs} Name = db2_db2inst1_0-rs_DependsOn_db2mnt-home_db2inst1-rs-rel This shows us that the DB2 instance is dependent on the operational state of the NICs in the public network. If the NICs are not Online, then TSAMP will not be able to start the DB2 instance The DB2 instance is also dependent on the filesystem that contains the DB2 home directory. June 19th, 2014

db2haicu: Adding a Virtual IP address
After the database partition has been added to the cluster, db2haicu will prompt you to create a virtual IP address (needs to be compatible with your “public” network): Do you want to configure a virtual IP address for the DB2 partition: 0? [2] 1. Yes 2. No 1 Enter the virtual IP address: Enter the subnet mask for the virtual IP address : [ ] Select the network for the virtual IP : 1. db2_public_network_0 Enter selection: Adding virtual IP address to the domain ... Adding virtual IP address to the domain was successful. You must make sure that your IP address and subnet mask values are well formed and correspond to the subnet mask of the network you chose. All invalid inputs will be rejected. In such a case, examine the IP addresses and netmasks of the NIC components of the network (using the ‘ifconfig’ command) and verify that the IP address and netmask specified are compatible with each of the NICs in the network. Make sure that the IP address that you want to add is NOT already present on the network. June 19th, 2014

db2haicu: The Complete Automation Policy …
Let’s look at what resources are listed in the ‘lssam’ output after completing the execution of ‘db2haicu’ : lssam Online IBM.ResourceGroup:db2_db2inst1_0-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_0-rs |- Online IBM.Application:db2_db2inst1_0-rs:node1 '- Offline IBM.Application:db2_db2inst1_0-rs:node2 |- Online IBM.Application:db2mnt-home_db2inst1-rs |- Online IBM.Application:db2mnt-home_db2inst1-rs:node1 '- Offline IBM.Application:db2mnt-home_db2inst1-rs:node2 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs |- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1 '- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2 You can also issue the command ‘db2pd –ha’ from the instance owner ID to examine the state of the resources Resource group to instance mapping is stored in a DB2 owned binary configuration file called db2ha.sys (under $INSTANCEDIR/sqllib/cfg dir). We can dump the values using the db2hareg utility. Note that TSAMP does not use this file in any way. June 19th, 2014

Starting and Stopping all DB2 resources in the Group
Change the “Nominal” (desired) state of the resource group to “Online” using the following TSAMP command: # chrg –o online db2_db2inst1_0-rg Changing the desired state of the resource group will instruct TSAMP to start/stop the resources using the scripts defined in the “StartCommand”, “StopCommand” attributes for a resource of class “IBM.Application”, such is the case for a DB2 instance resource. The DB2 start script used to start the instance will also activate all the databases. Check all resource members reach an online state using the “lssam –top” command To take all resources offline, change the group's Nominal state offline: # chrg –o offline db2_db2inst1_0-rg Note you will not be able to start the DB2 instance using db2start, or manually mount any file system if the Nominal state of the group is Offline. The exception is if you place TSAMP into Manual mode first: # samctrl -M T Use 'samctrl -M F' to place TSAMP back into Auto mode. June 19th, 2014

Stopping the DB2 Instance Only
If the database is still active, the force option is required for the db2stop command: db2stop force DB2 will also request TSAMP lock the resource group to prevent TSAMP from trying to restart the instance. Pending Online IBM.ResourceGroup:db2_db2inst1_0-rg Request=Lock Nominal=Online |- Offline IBM.Application:db2_db2inst1_0-rs Control=SuspendedPropagated |- Offline IBM.Application:db2_db2inst1_0-rs:node1 '- Offline IBM.Application:db2_db2inst1_0-rs:node2 |- Online IBM.Application:db2mnt-home_db2inst1-rs Control=SuspendedPropagated |- Online IBM.Application:db2mnt-home_db2inst1-rs:node1 '- Offline IBM.Application:db2mnt-home_db2inst1-rs:node2 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated |- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1 '- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2 To restart the instance: db2start The DB2 engine will also request that TSAMP unlock the resource group at which point TSAMP can resume automating its members. June 19th, 2014

Taking the Node(s) Offline
Offline individual nodes in the cluster before shutting down or rebooting that server : # stoprpnode <node_name> The resources on that node would need to be in an offline state first. You can also offline the entire domain: # stoprpdomain <domain_name> The resources on both nodes would need to be in an offline state first. The domain can be start using the following command: # startrpdomain <domain_name> An individual node can be restarted by issuing the following command from a node that is already online: # startrpnode <node_name> June 19th, 2014

Performing a Manual (Controlled) Failover
A controlled failover is performed by TSAMP by issuing a “move” request: # rgreq -o move db2_db2inst1_0-rg There are 3 stages to a move request: All resources in the group are brought offline in the current active node TSAMP finds a alternate location to start all the resource (no brainer since only 1 other) All resources in the group are brought online on the other node Use 'lssam -top' to monitor the move activity: You will see “Pending offline” states You will see “Pending online” states All resources must make it online, else all resources will be brought offline You should also use native DB2 commands to check the state of the instance. Use the 'mount' and 'df' commands to check the state of the file systems. You can use the 'ifconfig -a' and 'ping' commands to check virtual IP address is assigned. June 19th, 2014

What does “Failed offline” mean ?
If an online/start attempt fails, TSAMP will set the resource to “Failed offline” This is *not* a TSAMP problem Diagnose from the perspective of the underlying resource that would not come online If you believe the source of the start problem has been resolved, you would then reset the “Failed offline” state to allow TSAMP to try starting the resource again, for example: # resetrsrc –s “Name = ‘db2_db2inst1_0-rs’ & NodeNameList={‘node1’}” IBM.Application This will cause the db2V10_stop.ksh script to be executed on node1 and if successful (return code 0), the Operational State will change to “Offline”. If the resource is not already online on node2, TSAMP would then try to start the resource on node1 June 19th, 2014

Disable/Re-enable DB2/TSAMP Integration (using db2haicu)
To prevent TSAMP from taking any action on the DB2 instance, disable HA: db2haicu -disable The database manager’s configuration will be updated so that “Cluster manager” is unset. With “Cluster manager” unset, you would be able to Offline the entire domain without affecting the manual operation of the DB2 instances. As part of the –disable process, DB2 will request TSAMP lock the Resource Group to prevent TSAMP was taking any action against DB2 resources: To re-enable, run ‘db2haicu’, as instance owner, on each server, and select “1” (Yes) when asked if you want to enable high availability, and then choose “TSA”. June 19th, 2014

Alternative for preventing TSAMP starting and stopping DB2 Resources
The quickest way of preventing TSAMP from stopping/starting the resources is to change TSAMP to manual mode (Automation = Manual): # samctrl –M T The only action TSAMP will continue to do is monitor the resources by continuing to execute the monitoring scripts associated with each resource. Check the current automation mode with the following command: # lssamctrl To re-enable automation mode (Automation = Auto): # samctrl –M F Although changing the Nominal (desired) state of a resource group to “offline” will trigger TSAMP to stop its resources, this does not mean automation is stopped. TSAMP will attempt to maintain the offline state, so if any resource is manually started, TSAMP will stop it again. June 19th, 2014

Serviceability - CLI commands
5656 Serviceability - CLI commands Use the TSAMP command “lssam” as previously demonstrated: # lssam –top # lssam –g <resource_group> An alternative is the following TSAMP command: # lsrg –m June 19th, 2014

5757 Serviceability – logs Three main areas of logging Logging from the DB2 automation scripts (i.e. start/stop/monitor scripts) “logger” statements in policy scripts written to syslog (eg. /var/log/messages on Linux systems) Logging of TSAMP / RSCT core processes (i.e. quorum, monitor command timeouts) written to syslog (Linux/AIX/Solaris) and errpt (AIX) Daemon log file directory: /var/ct/<DOMAIN>/log/mc/IBM.<DAEMON>RM where <DAEMON> = Recovery, GblRes, … Circular logs, cannot open with editor directly! rpttr –o dtic <log file dir>/trace_summary > my_trace.out DB2’s log file, “db2diag.log” with DIAGLEVEL 3 or higher Use TSAMP Level 2 Support’s ‘getsadata’ script to collect data: June 19th, 2014

Serviceability – syslog messages from DB2 automation scripts
5858 Serviceability – syslog messages from DB2 automation scripts The following syslog message indicates the DB2 instance is Online (return code =1) : <timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 1 (db2inst1, 0) The following syslog message indicates the DB2 instance is Offline (return code =2) : <timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 2 (db2inst1, 0) The DB2 instance monitors repeat approximately every 10 seconds on each server if you’re using a default automation policy. The following syslog message indicates a mount resource is Online : June 19th, 2014

5959 Serviceability – syslog messages from DB2 automation scripts The following syslog messages occur when TSAMP starts a DB2 instance : <timestamp> node1 user:notice db2V10_start.ksh[856142]: Entered db2V10_start.ksh, db2inst1, 0 <timestamp> node1 user:debug db2V10_start.ksh[856146]: Able to cd to /home/db2inst1/sqllib : db2V10_start.ksh, db2inst1, 0 <timestamp> node1 user:debug db2V10_start.ksh[262214]: 1 partitions total: db2V10_start.ksh, db2inst1, 0 <timestamp> node1 user:notice db2V10_start.ksh[393252]: Returning 0 from db2V10_start.ksh ( db2inst1, 0) If db2start was used to start the instance, the message below would be seen instead of the “1 partitions total” message show above: <timestamp> node1 user:info db2V10_start.ksh[856150]: db2V10_start.ksh is already up... The following syslog messages occur when TSAMP starts a mount resource : June 19th, 2014

6060 Serviceability – syslog messages from DB2 automation scripts The following syslog messages occur when TSAMP stops a DB2 instance. This includes resetting a Failed Offline state for a DB2 instance resource: <timestamp> node1 user:notice db2V10_stop.ksh[856142]: Entered db2V10_stop.ksh, db2inst1, 0 <timestamp> node1 user:notice db2V10_stop.ksh[393252]: Returning 0 from db2V10_stop.ksh ( db2inst1, 0) The following syslog messages occurs when TSAMP stops a mount resource: June 19th, 2014

6161 Serviceability – syslog messages from DB2 automation scripts A manual/controller failover (move request) would result in mountV10_stop.ksh and db2V10_stop.ksh messages on the current active node, and mountV10_start.ksh and db2V10_start.ksh messages on the other node. June 19th, 2014

Serviceability – syslog messages from TSAMP/RSCT
6262 Serviceability – syslog messages from TSAMP/RSCT The following set of messages would indicate a cluster communication problem (domain split) : Firstly, state of the domain changes to PENDING_QUORUM on each node: CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to PENDING_QUORUM. The Automation Engine (RecoveryRM) on each node reports that the other node has left the domain: RECOVERYRM_INFO_4_ST A member has left. Node number = 1 Network TieBreaker is tested and an rc=0 indicates a successful poll of the network TieBreaker: samtb_net[ ]: op=reserve ip= rc=0 log=1 count=2 If the TieBreaker poll is successful, the node regains QUORUM: CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM. June 19th, 2014

Serviceability – syslog messages from TSAMP/RSCT
6363 Serviceability – syslog messages from TSAMP/RSCT The following messages are expected when TSAMP is assigning and removing ServiceIP (Virtual IP address) resources : <timestamp> <node_name> daemon:notice GblResRM[ ]: … :::GBLRESRM_IPONLINE IBM.ServiceIP assigned address on device. IBM.ServiceIP en0 <timestamp> <node_name> daemon:notice GblResRM[618532]: … :::GBLRESRM_IPOFFLINE IBM.ServiceIP removed address. IBM.ServiceIP Release TieBreaker and remove TieBreaker block attempts when node has rejoined a domain again: <timestamp> <node_name> daemon:info samtb_net[790758]: op=release ip= rc=0 log=1 count=2 <timestamp> <node_name> daemon:info samtb_net[925932]: remove reserve block /var/ct/samtb_net_blockreserve_ A MonitorCommand for a resource of class IBM.Application reached a defined timeout: <timestamp> <node_name> GblResRM[24275]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,Application.C,1.2.1,2434 :::GBLRESRM_MONITOR_TIMEOUT IBM.Application monitor command timed out. Resource name <resource_name> Similar TIMEOUT messages exist for StartCommand and StopCommand scripts. June 19th, 2014

Serviceability – Example of the TSAMP trace summary files
6464 Serviceability – Example of the TSAMP trace summary files First format the trace_summary file(s) rpttr –o dtic <log file dir>/trace_summary > my_trace_summary.txt IBM.RecoveryRM (on “master” node only) traces show: all ‘online/offline order’ statements Binder messages and exceptions 16:10: T(229390) _RCD Offline Request against db2_db2inst1_0-rs on node node2 16:10: T(229390) _RCD Offline request injected: db2_db2inst1_0-rg /ResGroup/IBM.ResourceGroup 16:10: T(229390) _RCD Online request injected: db2_db2inst1_0-rg /ResGroup/IBM.ResourceGroup 16:10: T(229390) _RCD RIBME-Hist for <NULL>: BINDER: Bind db2_db2inst1_0-rg /ResGroup/IBM.ResourceGroup IBM.GblResRM (from each individual node) traces show: All start / stop command executions and service IP on / offline 13:56: T(16386) _GBD Monitor reports: Network device "en0:0" (IP address ) flagged UP. Bringing resource “db2ip_10_20_30_42-rs" (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10) online. 13:57: T(163851) _GBD Resource " db2ip_10_20_30_42-rs " (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10): IP address has been successfully taken offline on network interface "en0:0" June 19th, 2014

Serviceability UNKNOWN (0) ONLINE (1) OFFLINE (2) FAILED_OFFLINE (3)
6565 Serviceability UNKNOWN (0) Generally a problematic state … really shouldn’t be deliberately used in the automation scripts ONLINE (1) OFFLINE (2) Offline and should be able to be started here if needed FAILED_OFFLINE (3) Offline and not a possible node to be started If MonitorCommand returns FAILED_OFFLINE then availability can change as soon as MonitorCommand returns something different, like Offline (return code 2) If status is set to FAILED_OFFLINE by StartCommand not succeeding within RetryCount, then manual intervention will be needed to fix underyling resource and reset (resetrsrc) resource. STUCK_ONLINE (4) Manual intervention will be needed to stop the underlying resource PENDING_ONLINE (5) No action is taken in this state, resource should eventually become online, or start attempt will timeout PENDING_OFFLINE (6) No action is taken in this state, resource should eventually become offline or stop attempt will timeout Online FAILED OFFLINE Online Offline June 19th, 2014 65

6666 Serviceability Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands If yes, then problem is most likely in DB2 automation scripts or core DB2 components If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement If Operational State = UNKNOWN (OpState=0) Check syslog and trace_summary for GBLRESRM _MONITOR_TIMEOUT Fix: Increase MonitorCommandTimeout value chrsrc –s “Name = ‘<resource_name>’” IBM.Application MonitorCommandTimeout=<new value> lsrsrc –s “Name = ‘<resource_name>’” IBM.Application Name MonitorCommandTimeout June 19th, 2014 66

IBM Tivoli System Automation for Multiplatforms v3.2

Similar presentations

Presentation on theme: "IBM Tivoli System Automation for Multiplatforms v3.2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Tivoli System Automation for Multiplatforms v3.2

Similar presentations

Presentation on theme: "IBM Tivoli System Automation for Multiplatforms v3.2"— Presentation transcript:

Similar presentations

About project

Feedback