IBM Tivoli System Automation for Multiplatforms v3.2

IBM Tivoli System Automation for Multiplatforms v3.2
Integrating TSAMP v3.2 with DB2 HADR v10.1 Author: Gareth Holl Date: May 27th, 2015 In this training module for Tivoli System Automation for Multiplatforms version 3.2, you learn how to exploit the event notification feature provided by the RSCT cluster infrastructure. 1

Objectives When you have completed this module, you will be able to perform these tasks: Explain the general operation of the TSAMP product Identify each of the products and components that make up the total solution and note the integration points Examine what the ‘db2haicu’ utility does from the perspective of TSAMP and the automation policy Learn how to control and service the combination of TSAMP and DB2 HADR After completing this module, you will be able to perform these tasks: Create a new condition Create a new response Associate a condition with a response Activate your new condition/response combination IBM Tivoli System Automation for Multiplatforms - Event Notification

Agenda Introduction and Overview System Automation Components Overview
Mapping DB2 Components to TSAMP Resources Integrating TSAMP with DB2 HADR using db2haicu Controlling the Operational State of the DB2 Resources Disabling Automation (re-gain manual control of DB2) Serviceability May 27, 2015

Introduction DB2 provides a High Availability Disaster Recovery (HADR) feature that keeps a primary and standby database synchronized, and allows an administrator to switch control to a standby DB2 server DB2 provides a set of scripts that allow TSAMP to control the DB2 resources. Scripts that are used by TSAMP to start, stop, and monitor each of the DB2 resources – this is the primary link between the two products DB2 provides a utility called ‘db2haicu’ that is used to define the domain and automation policy within TSAMP, that is, the initial setup : The automation policy is the set of definitions of all resources, resource groups, and the relationships between them all. The resource definitions contain attributes that define which DB2 start, stop, and monitor script (the automation scripts) to use for a particular resource. TSAMP can be used to monitor an application’s resources, and automate the starting, stopping, and failover of resources – it will attempt to maintain a desired operational state. May 27, 2015

Why the need for TSAMP ? HADR does not perform active monitoring of the topology HADR will not detect a node outage or NIC failure HADR cannot take automated actions in the event of a failed primary instance, node outage, or NIC failure Instead, a DB administrator must monitor the HADR pair manually and issue appropriate takeover commands in the event of a primary database interruption This is where TSAMP’s automation capabilities comes into play : TSAMP can perform restart actions if an instance unexpectedly exits TSAMP can perform a HADR takeover automatically when certain problems are detected on the primary server May 27, 2015

Software Summary Each of the following software products/components need to be installed on both systems (primary and standby servers) : DB2 v ( was latest available at the time this deck was written) TSAMP v3.2.2 (Fixpack 8 ( ) or later recommended) RSCT v (installed as part of a TSAMP installation) Installation of DB2 v10.1 includes the DB2 automation policy scripts: /opt/IBM/db2/V10.1/ha/tsa/ db2V10_monitor.ksh, db2V10_start.ksh, db2V10_stop.ksh hadrV10_monitor.ksh, hadrV10_start.ksh, hadrV10_stop.ksh lockreqprocessed These scripts can get upgraded when a DB2 fixpack is installed May 27, 2015

Software Summary (continued …)
TSAMP/RSCT doesn’t need to be installed on a 3rd node to maintain quorum Can use a Network TieBreaker (db2haicu calls this a quorum device) License file for TSAMP is included with the DB2 Activation Zip file, available via your Passport Advantage account. If a base level of TSAMP is not installed, then license file will need to be manually installed TSAMP can be silently installed by the DB2 installer, but if a base level of DB2 is not installed, then again the TSAMP license will need to be manually installed. See the TSAMP formal documentation for platform compatibility & dependencies : For TSAMP v Release Note: .doc_3.2.2/pdfs/HALRN330.pdf For TSAMP v4.1 Knowledge Center: mp.doc_ /welcome_samp.html?lang=en May 27, 2015

HADR replication via the public network
8 Example 1 of a DB2 HADR environment HADR replication via the public network Cluster (RSCT) Heartbeat Client Apps DB2 Transactions Public network Primary Server Standby Server Virt IP eth0:0 TSAMP TSAMP RSCT RSCT eth0 eth0 DB2 Instance DB2 Instance DB2 DB2 HADR Replication DB2 Database DB2 Database HADR HADR May 27, 2015

HADR replication via a private network
9 Example 2 of a DB2 HADR environment HADR replication via a private network Cluster (RSCT) Heartbeat Client Apps DB2 Transactions Public network Virt IP eth0:0 Primary Server Standby Server TSAMP TSAMP RSCT RSCT eth0 eth0 Cluster (RSCT) Heartbeat DB2 Instance DB2 Instance DB2 DB2 eth1 Private network eth1 DB2 Database DB2 Database HADR HADR Switch HADR Replication May 27, 2015

Progress Introduction and Overview
System Automation Components Overview Mapping DB2 Components to TSAMP Resources Integrating TSAMP with DB2 HADR using db2haicu Controlling the Operational State of the DB2 Resources Disabling Automation (re-gain manual control of DB2) Serviceability May 27, 2015

System Automation – Components
Reliable Scalable Cluster Technology RSCT), the "Cluster" software, is made up of some core daemons and some Resource Managers (two shown in orange on the next 2 slides), the most important being the following two : Resource Manager Function and classes owned IBM.ConfigRM Configuration tasks across the nodes in the domain, including quorum and TieBreaker functionality IBM.StorageRM Maps IBM.AgFileSystem resources to IBM.LogicalVolume to IBM.VolumeGroup to IBM.Disk and manages the mount/umount of the filesystems and the varyon/varyoff of the Volume Groups. Tivoli System Automation for Multiplatforms (TSAMP), the "Automation" software, is made up of 3 Resource Managers (shown in blue on the next 2 slides), as follows : RSCT – the “cluster software” HATS (High Availability Topology Services) provides a scalable heartbeat for adapter (network) and node failure detection HAGS (High Availability Group Services) distributed node & process coordination, messaging, and synchronization service RMC (Resource Monitoring and Control) backbone of RSCT: it uses the Resource Managers to map RMC’s resource and resource class abstraction to actual calls and commands that control the end resources provides global access for configuring, monitoring, and controlling subsystems and resources throughout the cluster (also known as a peer domain for “HA” environments) handles authorization, granting or denying resources based on some criteria using ACL files. Does not handle authentication which is determining the identity of a peer or subcomponent. Configuration Resource Manager (IBM.ConfigRM) used in cluster definition (to create and administer a peer domain) also used for quorum support Resource Manager Function and classes owned IBM.RecoveryRM Automation engine. Owns IBM.ResourceGroup, IBM.Equivalency, IBM.ManagedResource, IBM.ManagedRelationship IBM.GblResRM Starts, stops, and monitors resources of the class IBM.Application and IBM.ServiceIP IBM.TestRM Controls resources of class IBM.Test May 27, 2015

Everything is a "Resource" in the TSAMP and RSCT world There are different kinds of resources and that is where we introduce the concept of a resource "class". There are different Resource Managers, each responsible for managing or controlling resources that belong to a particular set of resource classes. The following diagram shows the mapping of three key Resource Managers to some Resource Classes they manage and then to some example Resources : Tivoli System Automation – the “automation software” Recovery Resource Manager (IBM.RecoveryRM) This is the decision engine for IBM Tivoli System Automation and it consists of: Resource Manager for resource groups, equivalencies, managed resources and managed relationships Engine part: logic deck and binder the logic deck is responsible for sending requests (start, stop) to resources to ensure the policy requirements the binder is used to bind a resource on a node (select a constituent of a floating resource) Global Resource Manager (IBM.GblResRM) Supports two resource classes: IBM.Application defines the behavior for general application resources can be used to start, stop, and monitor processes a generic class, very flexible, can be used to monitor and control various kind of resources most of the applications that you will automate will be done using this class IBM.ServiceIP defines the behavior of Internet Protocol (IP) address resources allows IP addresses to be assigned to a network interface adapter allows IP addresses to ’float’ between nodes. May 27, 2015

Consider the servers that make up the cluster ... they are also resources. They are resources of the class IBM.PeerNode. The domain itself is a resource, of class IBM.PeerDomain. The network interfaces are resources, of class IBM.NetworkInterface. The following diagram introduces another Resource Manager (part of RSCT) called IBM.ConfigRM and it shows two Resource Classes it manages and the Resources modelled by those classes: May 27, 2015

Quorum and TieBreaker One of the questions from the ‘db2haicu’ utility deals with a cluster/automation concept called Quorum. Quorum The number of nodes in a cluster that are required to control the resources, modify the cluster definition, or perform certain cluster operations. The main goals of quorum operations: identify who has the majority when a cluster is broken up into sub-clusters keep data consistent, especially when shared file systems are being used protect critical resources….maintain HA control Two types: configuration vs. operational quorum Note: “configuration” quorum requires 'majority of nodes' (more than half the number of nodes) to be Online for configuration changes to be carried out. TieBreaker a TieBreaker situation occurs when a cluster with equal number of nodes is split into sub-clusters with equal numbers of nodes need to determine which sub-cluster will have an operational quorum in a tie situation May 27, 2015

db2haicu: Quorum and TieBreaker (continued....)
Network TieBreaker the goal is for each system to figure out (via the RSCT infrastructure) which one is operational and should therefore take control (if not already the active node). use a pingable system independent of node1 and node2, for example node3 in our example. Although it would be just as easy and viable to use the gateway router for node1 and node2. without an active TieBreaker, automated failover/takeover will NEVER occur in the event of a cluster split (node outage, or network problem) Create a /usr/sbin/cluster/netmon.cf file on each node. Add IP addresses (one per line) of 3-5 devices external to the domain that are pingable from each node. This is important for a 2 node cluster to allow the cluster software (RSCT) to quickly identify the source of a heartbeat problem between the nodes. The following 7 slides demonstrate how a TieBreaker works … May 27, 2015

Base Tie-Breaker Functionality
1616 Base Tie-Breaker Functionality Gateway Router eth0 eth eth node1 node2 Node Failure Scenario May 27, 2015 16

1717 Base Tie-Breaker Functionality Gateway Router eth0 eth eth node1 node2 Node Failure Scenario: System node1 fails May 27, 2015 17

1818 Base Tie-Breaker Functionality Gateway Router eth eth eth node1 node2 Node Failure Scenario: System node1 fails 2. System node2 gets quorum using network tiebreaker May 27, 2015 18

1919 Base Tie-Breaker Functionality Gateway Router eth0 eth eth node1 node2 Network Adapter Failure Scenario May 27, 2015 19

2020 Base Tie-Breaker Functionality Gateway Router eth eth eth node1 node2 Network Adapter Failure Scenario : Network problem affecting node1 May 27, 2015 20

2121 Base Tie-Breaker Functionality Gateway Router eth eth eth node1 node2 Network Adapter Failure Scenario: Network problem affecting node1 2. Again node2 gets quorum using network tiebreaker May 27, 2015 21

Network Tiebreak Assumption:
2222 Base Tie-Breaker Functionality Gateway Router eth Network Tiebreaker Scenarios: System node1 fails 1a. System node2 gets quorum using network tiebreaker 2. Network problem affecting node1 2a. Again node2 gets quorum using network tiebreaker 2b. System node1 forced to reboot eth eth node1 node2 Network Tiebreak Assumption: If node1 can communicate (ping) with the gateway and node2 can communicate (ping) with the gateway, THEN node1 must be able to communicate (heartbeat) with node2. May 27, 2015 22

Mapping DB2 HADR components to TSAMP resources
2424 Mapping DB2 HADR components to TSAMP resources Node1 Node2 Resource Group: db2_db2inst1_db2inst1_HADRDB-rg Floating Resource: db2_db2inst1_db2inst1_HADRDB-rs HADR HADR Floating Resource: db2ip_10_20_30_42-rs Virtual IP Virtual IP dependsOn relationship Resource Group: db2_db2inst1_node1-rg Resource Group: db2_db2inst1_node2-rg db2_db2inst1_node1-rs Public network db2_db2inst1_node2-rs dependsOn relationship dependsOn relationship Equivalency: db2_public_network_0 eth0 eth0 Equivalency: db2_private_network_0 eth1 eth1 Private network Optional May 27, 2015

Mapping DB2 Components to TSAMP Resources
DB2 instance called “db2inst1” on server called “node1” maps to a TSAMP managed resource called “db2_db2inst1_node1_0-rs” DB2 instance called “db2inst1” on server called “node2” maps to a TSAMP managed resource called “db2_db2inst1_node2_0-rs” DB2 HADR database called “HADRDB” who’s primary and standby instances are both named “db2inst1” maps to a TSAMP managed resource called “db2_db2inst1_db2inst1_HADRDB-rs” The virtual IP address (optional) maps to a TSAMP managed resource called “db2ip_XX_XX_XX_XX-rs” where XX.XX.XX.XX is the virtual IP address. A public network can be defined and this maps to a TSAMP resource (Equivalency) called “db2_public_network_0” Note: No need to defined a private network (Equivalency) ... TSAMP does not manage anything related to the private network … there are no dependencies on it, so no need for it ! Just say “no” to db2haicu You can still have an actual private network for HADR replication … its totally independent of TSAMP. May 27, 2015

IBM.Application class – Example is of a DB2 Instance Resource
# lsrsrc –s “Name = ‘db2_db2inst1_node1_0-rs’” -Ab IBM.Application Name = "db2_db2inst1_node1_0-rs“ StartCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_start.ksh db2inst1 0" StopCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_stop.ksh db2inst1 0" MonitorCommand = "/usr/sbin/rsct/sapolicies/db2/db2V10_monitor.ksh db2inst1 0" MonitorCommandPeriod = 10 MonitorCommandTimeout = 120 StartCommandTimeout = 330 StopCommandTimeout = 140 UserName = "root" RunCommandsSync = 1 ProtectionMode = 1 ActivePeerDomain = hadr_domain NodeNameList = {“node1"} OpState = 1 May 27, 2015

IBM.Application class – Example is of a DB2 HADR Resource
# lsrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ ” -Ab IBM.Application Name = "db2hadr_hadrdb-rs“ StartCommand = "/usr/sbin/rsct/sapolicies/db2/hadrV10_start.ksh db2inst1 db2inst1 HADRDB" StopCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh db2inst1 db2inst1 HADRDB" MonitorCommand = "usr/sbin/rsct/sapolicies/db2/hadrV10_monitor.ksh db2inst1 db2inst1 HADRDB" MonitorCommandPeriod = 21 MonitorCommandTimeout = 29 StartCommandTimeout = 330 StopCommandTimeout = 140 UserName = "root" RunCommandsSync = 1 ProtectionMode = 1 ActivePeerDomain = hadr_domain NodeNameList = {“node1",“node2"} OpState = 1 May 27, 2015

IBM.ServiceIP class – Virtual IP addresses
# lsrsrc -Ab IBM.ServiceIP Name = "db2ip_10_20_30_42-rs" IPAddress = “ " NetMask = " " ProtectionMode = 1 ActivePeerDomain = "hadr_dom" NodeNameList = {“node1“,”node2”} OpState = 1 May 27, 2015

Using ‘db2haicu’ to Automate HADR Failover
Step 1. Run the following command as root on each node to configure the RSCT ACLs (security) and allow cluster communication between the servers: preprpnode node1 node2 preprpnode node1 node2 Step 2. Log on to the standby server as the instance owner and issue: db2haicu The db2haicu tool will determine the current instance and apply all cluster configuration steps based on it. It will also activate all databases for the instance as it attempts to gather information. Next db2haicu will determine if a domain has already been created by searching for as “Online” domain. If it doesn’t find one, you will see the following : May 27, 2015

db2haicu: Create a new domain
Create a new domain with two nodes: May 27, 2015

db2haicu: List the new domain
At this point there would be domain called “hadr_domain” in an online state: lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort hadr_domain Online No You can also list the states of the individual nodes and see output similar to the following, from either server: lsrpnode Name OpState RSCTVersion node1 Online node2 Online May 27, 2015

db2haicu: Quorum and TieBreaker
The next db2haicu question deals with the creation of a Network TieBreaker: At this point you could list the TieBreaker resources and see the new network TieBreaker: lsrsrc –Ab IBM.TieBreaker The following command should show that your new network TieBreaker is currently active: lsrsrc -c IBM.PeerNode OpQuorumTieBreaker May 27, 2015

db2haicu: Network Equivalencies
Special TSAMP groups called Equivalencies are created containing the network interfaces found on each of the servers in the cluster. This allows TSAMP to be notified of NIC failures by the RSCT subsystem (who harvested the NICs) and react accordingly. We use db2haicu to create an equivalency called db2_public_network_0 and populate it with the en0 NICs from the server called “node1” : May 27, 2015

db2haicu: Network Equivalencies (continued …)
Next we add the en0 NIC from the other server, “node2”, to the same equivalency (db2_public_network_0) : May 27, 2015

db2haicu: Private Network
In the previous slide, notice the option to say “Yes” or “No” when adding NICs to a network. When asked if a non-public NIC should be added to a private network, this is where I recommend you choose “No”. So don’t create a private network equivalency via db2haicu even if your DB2 HADR environment does use a private network for HADR replication data. May 27, 2015

db2haicu: Private Network (continued …)
If your DB2 environment uses LDAP for authentication and if you have multiple NICs per server (eg. a private network), then disable the RSCT cluster heartbeat for all NICs not in the public network: Identify the Communication Group that contains the non-public NICs: lsrsrc -Ab IBM.NetworkInterface Name IPAddress CommGroup HeartbeatActive NodeNameList Change HeartbeatActive to 0 to disable heartbeating for a CommGroup: chrsrc -s "CommGroup=='CG2'" IBM.NetworkInterface HeartbeatActive=0 See the following technote for more details on cluster heartbeat settings and Communication Groups: May 27, 2015

db2haicu: Additional NICs per server
For non-LDAP setups, its recommended to have at least a 2nd pair of NICs for cluster heartbeating so as to reduce the likelihood of a forced reboot if there is a problem in the public network (or with the public NICs). For DB2 v9.7 environments, additional “dependsOn” relationships need to be manually added to the automation policy, from each HADR database resource to the public network equivalency If db2haicu from v , v , or v is used to create the automation policy, all the necessary dependsOn relationships will be missing due to a bug with the db2haicu utility (fixed as of ) … they will need to be manually created. Refer to the following knowledge item to obtain a script that can be used to create any missing relationship in either a DB2 v9.7 or v10 environment : checker-for-db2-hadr-and-ha-sh/ The dependsOn relationships between the HADR database resources and the Public Network Equivalency are recommended even if there is only one NIC per server, for both DB2 v9.7 and v10 environments. May 27, 2015

db2haicu: Listing the Equivalency
The network equivalency(ies) would be created at this point and can be listed as follows: lsequ -Ab Displaying Equivalency information: All Attributes Equivalency 1: Name = db2_public_network_0 MemberClass = IBM.NetworkInterface Resource:Node[Membership] = {en0:node1,en0:node2} SelectString = “” ActivePeerDomain = hadr_domain Resource:Node[ValidSelectResources] = {en0:node1,en0:node2} May 27, 2015

db2haicu: Adding the database partition to the Automation Policy
The final part to running db2haicu on the standby server is setting the CLUSTER_MGR variable to “TSA” and then adding resources that represent the DB2 instance on the server where you’re running db2haicu: May 27, 2015

Note in the previous screenshot that you won’t be able to validate and automate the HADR database via db2haicu from the standby server. This is why the next part involves running the db2haicu for a 2nd time but from the current HADR primary server. At this point we can view the a few more changes to the automation policy and the database manager’s configuration : db2 get dbm cfg |grep -i cluster Cluster manager = TSA lsrg Resource Group names: db2_db2inst1_node2_0-rg May 27, 2015

db2haicu: DB2 Standby Instance Resources
lsrg -g db2_db2inst1_node2_0-rg Displaying Member Resource information: For Resource Group "db2_db2inst1_node2_0-rg". Resource Group 1: Name = db2_db2inst1_node2_0-rg MemberLocation = Collocated Priority = 0 AllowedNode = db2_db2inst1_node2_0-rg_group-equ NominalState = Online ActivePeerDomain = hadr_domain OpState = Online TopGroup = db2_db2inst1_node2_0-rg TopGroupNominalState = Online Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server this resource is allowed to run on … see the next slide for output that shows this PeerNode Equivalency details. May 27, 2015

db2haicu: DB2 Standby Instance Dependencies
lsequ -Ab Equivalency 1: Name = db2_db2inst1_samp2_0-rg_group-equ MemberClass = IBM.PeerNode Resource:Node[Membership] = {node2:node2.tivlab.raleigh.ibm.com} SelectString = "" SelectFromPolicy = ANY MinimumNecessary = 1 ActivePeerDomain = hadr_domain Resource:Node[ValidSelectResources] = {node2:node2.tivlab.raleigh.ibm.com} This restricts the DB2 instance resource on the previous slide from only being brought Online by TSAMP on node2. This is fairly obvious given it’s the resource that represents the standby database partition. May 27, 2015

db2haicu: DB2 Standby Instance Dependencies (continued …)
At this point there would be one or two relationship defined in the automation policy depending on how many network equivalencies you created: lsrel -Ab Displaying Managed Relationship Information: All Attributes Managed Relationship 1: Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node2_0-rs Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0} Relationship = DependsOn Conditional = NoCondition Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel ActivePeerDomain = hadr_domain This shows us that the DB2 instance is dependent on the operational state of the NICs in the public network. If the NIC is Online, then TSAMP will be able to start the associated DB2 instance. May 27, 2015

db2haicu: The Automation Policy so far …
Let’s look at what resources and groups are listed in the ‘lssam’ output after completing the execution of ‘db2haicu’ on the standby server : lssam Here we see that the DB2 instance for server “node2” is defined and within its own resource group. There is a PeerNode equivalency which dictates which server the above instance is allowed to run on. Finally, there is a Network Equivalency which contains the NICs for the public network … the DB2 instance would have a dependency relationship on this equivalency. May 27, 2015

Using ‘db2haicu’ to Automate HADR Failover (continued …)
Step 3. Log on to the primary server as the instance owner and issue: db2haicu The db2haicu tool will determine the current instance and apply all cluster configuration steps based on it. It will also activate all databases for the instance as it attempts to gather information. Next db2haicu will determine if a domain has already been created by searching for as “Online” domain. Since we’ve already run db2haicu on the standby server, an Online domain should already exist. You will then be asked to set the cluster manager. May 27, 2015

db2haicu sets the CLUSTER_MGR variable to “TSA” within the local database manager’s configuration: db2 get dbm cfg |grep -i cluster Cluster manager = TSA Please note that once the dbm is configured with “Cluster manager” set to TSA, the DB2 engine expects to have a domain Online. You will have issues stopping and starting the DB2 instance if no domain is Online. Run 'db2haicu -disable' on each DB2 server if you want to break the connection between DB2 and TSAMP. This is the only way to unset “Cluster manager” for DB2 v10.x Then db2haicu adds resources that represent the DB2 instance (the primary DB2 instance) on the server where you’re currently running db2haicu: May 27, 2015

db2haicu: DB2 Primary Instance Resources
At this point we can view the a few more changes to the automation policy lsrg Resource Group names: db2_db2inst1_node1_0-rg db2_db2inst1_node2_0-rg lsrg -g db2_db2inst1_node1_0-rg Displaying Member Resource information: For Resource Group "db2_db2inst1_node1_0-rg". Resource Group 1: Name = db2_db2inst1_node1_0-rg MemberLocation = Collocated Priority = 0 AllowedNode = db2_db2inst1_node1_0-rg_group-equ NominalState = Online ActivePeerDomain = hadr_domain OpState = Online TopGroup = db2_db2inst1_node1_0-rg TopGroupNominalState = Online Note the AllowNode attribute which points to a PeerNode Equivalency that dictates which server this resource is allowed to run on … similar to the other DB2 instance resource group but with a different server name. May 27, 2015

db2haicu: Dependencies for the DB2 Instances
Now there would be additional relationships defined in the automation policy: lsrel -Ab Displaying Managed Relationship Information: All Attributes Managed Relationship 1: Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node2_0-rs Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0} Relationship = DependsOn Conditional = NoCondition Name = db2_db2inst1_node2_0-rs_DependsOn_db2_ public_network_0-rel ActivePeerDomain = hadr_domain Managed Relationship 2: Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_node1_0-rs Name = db2_db2inst1_node1_0-rs_DependsOn_db2_public_network_0-rel This shows us that the DB2 instances are both dependent on the operational state of the NICs in the public network. If the NICs are Online, then TSAMP will be able to start the associated DB2 instances … it also means if either NIC goes offline for any reason, the local DB2 instance will be stopped by TSAMP. May 27, 2015

db2haicu: The Automation Policy so far …
Let’s take another look at what resources and groups are listed in the ‘lssam’ output after ‘db2haicu’ has added both standby and primary database partitions : lssam There’s now a resource and group for the DB2 instance on server “node1”. There’s now another PeerNode equivalency … it forces the “db2_db2inst1_node1_0-rg partition to run on “node1” only. May 27, 2015

db2haicu: Adding the HADR database to the Automation Policy
Validating and automating HADR failover can only be done from the current primary server and only after successfully running db2haicu on the standby server. You may also want to add a virtual IP address for this HADR database May 27, 2015

db2haicu: HADR Database Resources
# lsrg -g db2_db2inst1_db2inst1_HADRDB-rg Displaying Resource Group information: For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg". Resource Group 1: Name = db2_db2inst1_db2inst1_HADRDB-rg MemberLocation = Collocated Priority = 0 AllowedNode = db2_db2inst1_db2inst1_HADRDB-rg_group-equ NominalState = Online ActivePeerDomain = hadr_domain OpState = Online TopGroup = db2_db2inst1_db2inst1_HADRDB-rg TopGroupNominalState = Online Note the AllowedNode attribute. It points to a PeerNode Equivalency that contains the servers “node1” and “node2” that dictates which servers the HADR database can reside on. This is just like the setup for the two DB2 instance resource groups that also use the AllowedNode attribute with other PeerNode Equivalencies, though in this case the HADR resource is a floating resource with two servers as its choices. May 27, 2015

db2haicu: HADR Database Resources
# lsrg -m -g db2_db2inst1_db2inst1_HADRDB-rg Displaying Member Resource information: For Resource Group "db2_db2inst1_db2inst1_HADRDB-rg". Member Resource 1: Class:Resource:Node[ManagedResource] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Mandatory = True MemberOf = db2_db2inst1_db2inst1_HADRDB-rg SelectFromPolicy = ORDERED ActivePeerDomain = hadr_domain OpState = Online Member Resource 2: Class:Resource:Node[ManagedResource] = IBM.ServiceIP:db2ip_9_42_153_137-rs May 27, 2015

db2haicu: Dependency for the HADR DB Resource
Now there would be an additional relationship defined in the automation policy: lsrel -Ab Displaying Managed Relationship Information: All Attributes [...] Managed Relationship 3: Class:Resource:Node[Source] = IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Class:Resource:Node[Target] = {IBM.Equivalency:db2_public_network_0} Relationship = DependsOn Conditional = NoCondition Name = db2_db2inst1_db2inst1_HADRDB-rs_DependsOn_db2_ public_network_0-rel ActivePeerDomain = hadr_domain This shows us that the HADRDB resource is dependent on the operational state of the NICs in the public network. If the NICs are Online, then TSAMP will be able to online the associated HADR db resource it also means if either NIC goes offline for any reason, the constituent of the HADR db resource local to the offline NIC will be offlined by TSAMP (if it is currently online) … this could trigger a failover. May 27, 2015

db2haicu: The Complete Automation Policy …
Let’s look at what resources and groups are listed in the ‘lssam’ output after completing the execution of ‘db2haicu’ on both servers : lssam The resources and group for the HADR database and virtual IP address have been added, as has a new PeerNode Equivalency containing servers “node1” and “node2”. May 27, 2015

Explaining “Nominal” state for a Resource Group
Now that the DB2 instances are managed by TSAMP, the “Nominal” (desired) state of its Resource Group needs to be set to “Online” to be able to use the db2start and db2stop commands. The following is example syntax: # chrg –o online <Resource_Group> Changing the Nominal state of a resource group will instruct TSAMP to start/stop the member resources using the scripts defined in the “StartCommand”, “StopCommand” attributes for a resource of class “IBM.Application”, such is the case for a DB2 instance resource. To change the desired state of multiple resource groups that have similarities in their name, for example, start all DB2 resource groups in parallel where the instance name on each server starts with “db2inst”, use the following syntax: # chrg –o online –s “Name like ‘db2_db2inst_%’ “ Another example, to take the HADR resource group offline (which will also remove any currently assigned IP alias (Virtual IP) ) : # chrg –o offline db2_db2inst1_db2inst1_HADRDB-rg Note: After offlining just the HADR resource group, the HADR pair will remain in a peer connected state even though shown as Offline on both servers when viewed using 'lssam' ! May 27, 2015

Domain Offline & Nominal States set to Offline ...
Start the domain: # startrpdomain hadr_domain Start the primary and standby instances simultaneously: # chrg –o online –s “Name like ‘db2_db2inst1_%’ “ above assumes both the instances are named “db2inst1” Check that both instances reach Online states. Do not proceed until both DB2 instances have come online. Confirm using “lssam –top”. Also run “db2_ps” as the DB2 instance owner on each node. The DB2 start scripts used to start the instances will also activate the databases, resulting in the HADR pair establishing a peer connected state. Confirm that the HADR pair have reached peer state by running the following on each DB2 node : # db2pd –hadr –db hadrdb If HADR state is not active, then manually bring the HADR pair into peer state as follows: a. On designated standby node: # db2 start hadr on db hadrdb as standby b. On designated primary node: # db2 start hadr on db hadrdb as primary Repeat for all HADR databases. Again check the state of the HADR pair before proceeding May 27, 2015

Domain Offline & Nominal States set to Offline (continued..)
As instance owner, ensure that the HADR pair is in “Peer” state (on both nodes) as follows: # db2pd –hadr -db hadrdb You should see output similar (abbreviated) to the following on the primary server: Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:13:36 Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes) Primary Peer Sync ConnectStatus ConnectTime Timeout Connected Tue Jul 8 17:47: ( ) 120 You should see output similar (abbreviated) to the following on the standby server: Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:12:51 Standby Peer Sync Connected Tue Jul 8 17:48: ( ) 120 Finally, change the HADR resource group to online : # chrg –o online db2_db2inst1_db2inst1_HADRDB-rg This last step will cause the virtual IP address (if policy includes one) to be assigned. May 27, 2015

Taking Standby Instance Offline
Because the database is active, the force option is required for the db2stop command: db2stop force DB2 will also request TSAMP lock the instance group to prevent TSAMP from trying to restart the instance. The HADR group will also get locked … this ALWAYS happens when the HADR pair are no longer in a Peer state. Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1 '- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated |- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1 '- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2 Pending online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online '- Offline IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated '- Offline IBM.Application:db2_db2inst1_node2_0-rs:node2 To restart the instance: db2start This will result in TSAMP executing the ‘db2V10_start.ksh’ script which is also responsible for activating the HADR database and HADR re-integration takes place … peer state results. May 27, 2015

Taking Primary Instance Offline
Because the database is active, the force option is required for the db2stop command: db2stop force DB2 will also request TSAMP lock the instance group to prevent TSAMP was trying to restart the instance. The HADR group will also get locked since the HADR pair would no longer be in PEER state. Pending online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated |- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1 '- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated |- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1 '- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2 Pending online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online '- Offline IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated '- Offline IBM.Application:db2_db2inst1_node1_0-rs:node1 To restart the primary instance: db2start This will result in TSAMP executing the ‘db2V10_start.ksh’ script which should activate the db The ‘hadrV10_start.ksh’ script will then be executed and peer state should be re-established. May 27, 2015

Performing a Manual Takeover (Controlled Failover)
In versions of DB2 prior to v9.5, an operator performed a controlled failover by moving the HADR resource group using the TSAMP command ‘rgreq -o move <HADR_group>’. Because the DB2 instances are cluster aware in v9.5+, you can use the native DB2 takeover command (issued as instance owner on the current standby server) : db2 takeover hadr on database HADRDB The HADR resource group will be locked and unlocked several times. There will also be a move request at some point. ‘lssam’ will show the online/offline states swapped for the HADR resource and ServiceIP, assuming the takeover is successful : Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs |- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1 '- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs |- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node1 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node2 Use the following command to check the HADR role has swapped between the nodes and ensure the HADR pair have reached a peer state again : db2pd –hadr –db <hadr_db_name> May 27, 2015

Using a “move” request to perform a controlled failover
If you attempt to move the HADR resource group using the following command: # rgreq –o move db2_db2inst1_db2inst1_HADRDB-rg … a takeover “by force peer window only” will result. This isn't necessarily a bad thing, however there is a small element of risk because there isn't a two-way handshake as there is when a regular takeover (not forced) is performed. Some clients prefer the “move” request as it results in a faster failover for them and there is less TSAMP activity since there is no need to lock and unlock the HADR groups multiple times. May 27, 2015

Resetting a resource set to “Failed offline”
If a resource fails to start, it will be set to “Failed offline”. This is not an indication of a problem with TSAMP … its a problem with the underlying DB2 component, so must be diagnosed from the perspective of the DB2 product. Here's an example showing that the HADR database resource failed to be started as primary on “node2” : Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1 '- Failed offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs |- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1 '- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2 To reset the Failed Offline state, use the following TSAMP command: resetrsrc –s “Name = ‘db2_db2inst1_db2inst1_HADRDB-rs’ & NodeNameList={‘node2’}” IBM.Application This will cause the hadrV10_stop.ksh script to be executed on node2 and if successful (return code 0), the Operational State will change to “Offline”. May 27, 2015

A couple of things to note ...
Usually you need to stop resources prior to stopping the domain. However you can force stop a domain which leaves all DB2 resources running where ever they were last running, including the virtual IP alias: # chrsrc -c IBM.PeerNode CritRsrcProtMethod=5 # stoprpdomain -f <domain_name> Bring the domain back online as follows: # chrsrc -c IBM.PeerNode CritRsrcProtMethod=3 Before rebooting a server for what ever reason, you should “Offline” the node: # stoprpnode <node_name> Even if the HADR resource group’s Nominal state is set to Offline, starting both instances should result in the HADR pair reaching a peer connected state since the start script for the instances also activates the databases. However, while the HADR resource group is set to Offline, the Virtual IP address is truly offline (not assigned to any NIC) so no client access to the database, AND no automated failover actions are possible. May 27, 2015

6666 Failure Scenarios The various failover scenarios supported by this solution are detailed in section 6 of a whitepaper called “Automated Cluster Controlled HADR (High Availability Disaster Recovery) Configuration Setup using the IBM DB2 High Availability Instance Configuration Utility (db2haicu) ” This whitepaper can be downloaded via the following URL: The following scenarios result in automated actions, including failovers/takeovers: Standby Instance Failure Primary Instance Failure Standby NIC Failures (public network) Primary NIC Failures (public network) Standby Node Failure Primary Node Failure May 27, 2015

Disable/Re-enable HA/Automated Failover (using db2haicu)
To prevent TSAMP from taking any action on DB2 resources, disable HA: db2haicu -disable The local database manager’s configuration will be updated so that “Cluster manager” is unset. The ‘db2haicu –disable’ also needs to be executed on the other server so that it’s instance configuration is also updated. With “Cluster manager” unset, you would be able to Offline the entire domain without affecting the manual operation of the DB2 instances. May 27, 2015

Disable/Re-enable HA/Automated Failover (Continued …)
As part of the –disable process, DB2 will request TSAMP lock all Resource Groups to prevent TSAMP was taking any action against DB2 resources: Online IBM.ResourceGroup:db2_db2inst1_db2inst1_HADRDB-rg Request=Lock Nominal=Online |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs Control=SuspendedPropagated |- Online IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node1 '- Offline IBM.Application:db2_db2inst1_db2inst1_HADRDB-rs:node2 '- Online IBM.ServiceIP:db2ip_10_20_30_42-rs Control=SuspendedPropagated |- Online IBM.ServiceIP:db2ip_10_20_30_42-rs:node1 '- Offline IBM.ServiceIP:db2ip_10_20_30_42-rs:node2 Online IBM.ResourceGroup:db2_db2inst1_node1_0-rg Request=Lock Nominal=Online '- Online IBM.Application:db2_db2inst1_node1_0-rs Control=SuspendedPropagated '- Online IBM.Application:db2_db2inst1_node1_0-rs:node1 Online IBM.ResourceGroup:db2_db2inst1_node2_0-rg Request=Lock Nominal=Online '- Online IBM.Application:db2_db2inst1_node2_0-rs Control=SuspendedPropagated '- Online IBM.Application:db2_db2inst1_node2_0-rs:node2 To re-enable, run ‘db2haicu’, as instance owner, on each server, and select “1” (Yes) when asked if you want to enable high availability, and then choose “TSA”. May 27, 2015

Alternative for preventing TSAMP starting and stopping DB2 Resources
The quickest way of preventing TSAMP from stopping/starting the resources is to change TSAMP to manual mode (Automation = Manual): # samctrl –M T The only action TSAMP will continue to do is monitor the resources by continuing to execute the monitoring scripts associated with each resource. Check the current automation mode with the following command: # lssamctrl To re-enable automation mode (Automation = Auto): # samctrl –M F Although changing the Nominal (desired) state of a resource group to “offline” will trigger TSAMP to stop its resources, this does not mean automation is stopped. TSAMP will attempt to maintain the offline state, so if any resource is manually started, TSAMP will stop it again. Note you will not be able to perform a takeover while TSAMP is in Manual mode. May 27, 2015

Serviceability - CLI commands
7272 Serviceability - CLI commands Use the TSAMP command “lssam” as previously demonstrated: # lssam –top # lssam –g <resource_group> An alternative is the following TSAMP command: # lsrg –m May 27, 2015

7373 Serviceability – logs Three main areas of logging Logging from the DB2 automation scripts (i.e. start/stop/monitor scripts) “logger” statements in policy scripts written to syslog (eg. /var/log/messages on Linux systems) Logging of TSAMP / RSCT core processes (i.e. quorum, monitor command timeouts) written to syslog (Linux/AIX/Solaris) and errpt (AIX) Daemon log file directory: /var/ct/<DOMAIN>/log/mc/IBM.<DAEMON>RM where <DAEMON> = Recovery, GblRes, … Circular logs, cannot open with editor directly! rpttr –o dtic <log file dir>/trace_summary > my_trace.out DB2’s log file, “db2diag.log” with DIAGLEVEL 3 or higher Use TSAMP Level 2 Support’s ‘getsadata’ script to collect data: May 27, 2015

Serviceability – syslog messages from DB2 automation scripts
7474 Serviceability – syslog messages from DB2 automation scripts The following syslog message indicates the DB2 instance is Online (return code =1) : <timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 1 (db2inst1, 0) The following syslog message indicates the DB2 instance is Offline (return code =2) : <timestamp> node1 user:info db2V10_monitor.ksh[524352]: Returning 2 (db2inst1, 0) The DB2 instance monitors repeat approximately every 10 seconds on each server if you’re using a default automation policy. The following syslog message indicates that the HADR resource if considered Online (return code = 1) and with a Primary role : <timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB Seen only on the node that’s currently the primary node, repeats approx every 21 seconds The following syslog message indicates that the HADR resource if considered Offline (return code = 2) and mostly likely in a Standby state (normal/OK state) : <timestamp> node2 user:info hadrV10_monitor.ksh[69632]: Returning 2 : db2inst1 db2inst1 HADRDB May 27, 2015

7575 Serviceability – syslog messages from DB2 automation scripts The following syslog messages occur when TSAMP starts a DB2 instance : <timestamp> node1 user:notice db2V10_start.ksh[856142]: Entered db2V10_start.ksh, db2inst1, 0 <timestamp> node1 user:debug db2V10_start.ksh[856146]: Able to cd to /home/db2inst1/sqllib : db2V10_start.ksh, db2inst1, 0 <timestamp> node1 user:debug db2V10_start.ksh[262214]: 1 partitions total: db2V10_start.ksh, db2inst1, 0 <timestamp> node1 user:notice db2V10_start.ksh[393252]: Returning 0 from db2V10_start.ksh ( db2inst1, 0) If db2start was used to start the instance, the message below would be seen instead of the “1 partitions total” message show above: <timestamp> node1 user:info db2V10_start.ksh[856150]: db2V10_start.ksh is already up... The following syslog messages are typical of the HADR resource group being brought online: <timestamp> node1 user:info hadrV10_monitor.ksh[6963]: Returning 1 : db2inst1 db2inst1 HADRDB <timestamp> node1 user:debug root[524540]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock <timestamp> node1 user:notice hadrV10_start.ksh[422078]: Entering : db2inst1 db2inst1 HADRDB <timestamp> node1 user:debug hadrV10_start.ksh[422086]: su - db2inst1 -c db2gcf -t u -i db2inst1 -i db2inst1 -h HADRDB -L <timestamp> node1 user:debug root[524290]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0 <timestamp> node1 user:notice hadrV10_start.ksh[422090]: Returning 0 : db2inst1 db2inst1 HADRDB Note: the ‘hadrV10_start.ksh’ script doesn’t actually bring the HADR database pair into a Peer state. Its likely to already be in a Peer state beforehand because the databases are activated as part of the starting of the DB2 instances. May 27, 2015

7676 Serviceability – syslog messages from DB2 automation scripts The following syslog messages occur when TSAMP stops a DB2 instance. This includes resetting a Failed Offline state for a DB2 instance resource: <timestamp> node1 user:notice db2V10_stop.ksh[856142]: Entered db2V10_stop.ksh, db2inst1, 0 <timestamp> node1 user:notice db2V10_stop.ksh[393252]: Returning 0 from db2V10_stop.ksh ( db2inst1, 0) The following syslog messages are typical of the HADR resource being stopped on one node so a manual takeover can occur to the other node. Its also what you would see if resetting a Failed offline state for a HADR resource: <timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602322]: Entering : db2inst1 db2inst1 HADRDB <timestamp> node1 user:debug /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602330]: su - db2inst1 -c db2gcf -t d -i db2inst1 -i db2inst1 -h HADRDB -L <timestamp> node1 user:notice /usr/sbin/rsct/sapolicies/db2/hadrV10_stop.ksh[602334]: Returning 0 : db2inst1 db2inst1 HADRDB Note: the ‘hadrV10_stop.ksh’ script doesn’t actually stop the HADR functionality within DB2. It doesn’t affect Peer state. May 27, 2015

7777 Serviceability – syslog messages from DB2 automation scripts The following syslog messages show the HADR resource group lock/unlock state : <timestamp> node1 user:debug root[327754]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB- rg lock <timestamp> node1 user:debug root[327780]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock: 1 and <timestamp> node1 user:debug root[856206]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB- rg unlock <timestamp> node1 user:debug root[856212]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0 The HADR resource group is locked whenever Peer state is lost. The DB2 software uses a TSAMP API to request the lock. The “lockreqprocessed” script is used to check the lock and unlock states. When the HADR pair are back in Peer state, the HADR resource group is unlocked, again requested by DB2. The DB2 Instance resource groups also get locked if the db2stop command is used to stop an instance, and unlocked when db2start is used to start it again. May 27, 2015

7878 Serviceability – syslog messages from DB2 automation scripts A manual (no force option) “takeover" (db2 takeover hadr on db HADRDB) would result in the following messages on the original primary server: <timestamp> node1 user:notice hadrV10_stop.ksh[405566]: Entering : db2inst1 db2inst1 HADRDB <timestamp> node1 user:debug hadrV10_stop.ksh[405574]: su - db2inst1 -c db2gcf -t d -i db2inst1 -i db2inst1 -h HADRDB -L <timestamp> node1 user:notice hadrV10_stop.ksh[405578]: Returning 0 : db2inst1 db2inst1 HADRDB Assuming the above hadrV10_stop.ksh script completes with a 0 return code, then a similar sequence of messages to the following would be seen on the original standby server: <timestamp> node2 user:debug root[487538]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock <timestamp> node2 user:debug root[487564]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg lock: 1 <timestamp> node2 user:debug root[487566]: Entering lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock <timestamp> node2 user:debug root[487572]: Exiting lockreqprocessed db2_db2inst1_db2inst1_HADRDB-rg unlock: 0 <timestamp> node2 user:notice hadrV10_start.ksh[548876]: Entering : db2inst1 db2inst1 HADRDB <timestamp> node2 user:debug hadrV10_start.ksh[548884]: su - db2inst1 -c db2gcf -t u -i db2inst1 -i db2inst1 -h HADRDB –L <timestamp> node2 user:notice hadrV10_start.ksh[548888]: Returning 0 : db2inst1 db2inst1 HADRDB <timestamp> node2 user:debug hadrV10_monitor.ksh[696436]: Returning 1 : db2inst1 db2inst1 HADRDB Note the return code of 0 from “hadrV10_start.ksh” meaning a successful takeover. Any other return code would be considered unsuccessful and would need to be diagnosed from a DB2 perspective. May 27, 2015

Serviceability – syslog messages from TSAMP/RSCT
7979 Serviceability – syslog messages from TSAMP/RSCT The following set of messages would indicate a cluster communication problem (domain split) : Firstly, state of the domain changes to PENDING_QUORUM on each node: CONFIGRM_PENDINGQUORUM_ER The operational quorum state of the active peer domain has changed to PENDING_QUORUM. The Automation Engine (RecoveryRM) on each node reports that the other node has left the domain: RECOVERYRM_INFO_4_ST A member has left. Node number = 1 Network TieBreaker is tested and an rc=0 indicates a successful poll of the network TieBreaker: samtb_net[ ]: op=reserve ip= rc=0 log=1 count=2 If the TieBreaker poll is successful, the node regains QUORUM: CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM. May 27, 2015

Serviceability – syslog messages from TSAMP/RSCT
8080 Serviceability – syslog messages from TSAMP/RSCT The following messages are expected when TSAMP is assigning and removing ServiceIP (Virtual IP address) resources : <timestamp> <node_name> daemon:notice GblResRM[ ]: … :::GBLRESRM_IPONLINE IBM.ServiceIP assigned address on device. IBM.ServiceIP en0 <timestamp> <node_name> daemon:notice GblResRM[618532]: … :::GBLRESRM_IPOFFLINE IBM.ServiceIP removed address. IBM.ServiceIP Release TieBreaker and remove TieBreaker block attempts when node has rejoined a domain again: <timestamp> <node_name> daemon:info samtb_net[790758]: op=release ip= rc=0 log=1 count=2 <timestamp> <node_name> daemon:info samtb_net[925932]: remove reserve block /var/ct/samtb_net_blockreserve_ A MonitorCommand for a resource of class IBM.Application reached a defined timeout: <timestamp> <node_name> GblResRM[24275]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,Application.C,1.2.1,2434 :::GBLRESRM_MONITOR_TIMEOUT IBM.Application monitor command timed out. Resource name <resource_name> Similar TIMEOUT messages exist for StartCommand and StopCommand scripts. May 27, 2015

Serviceability – Example of the TSAMP trace summary files
8181 Serviceability – Example of the TSAMP trace summary files First format the trace_summary file(s) rpttr –o dtic <log file dir>/trace_summary > my_trace_summary.txt IBM.RecoveryRM (on “master” node only) traces show: all ‘online/offline order’ statements Binder messages and exceptions 16:10: T(229390) _RCD Offline Request against db2_db2inst1_db2inst1_HADRDB-rs on node node2 16:10: T(229390) _RCD Offline request injected: db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup 16:10: T(229390) _RCD Online request injected: db2_db2inst1_db2inst1_HADRDB-rg /ResGroup/IBM.ResourceGroup 16:10: T(229390) _RCD RIBME-Hist for <NULL>: BINDER: Bind db2_db2inst1_db2inst1_HADRDB- rg /ResGroup/IBM.ResourceGroup IBM.GblResRM (from each individual node) traces show: All start / stop command executions and service IP on / offline 13:56: T(16386) _GBD Monitor reports: Network device "en0:0" (IP address ) flagged UP. Bringing resource “db2ip_10_20_30_42-rs" (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10) online. 13:57: T(163851) _GBD Resource " db2ip_10_20_30_42-rs " (handle 0x6029 0xffff 0x6887df04 0x3589a7c9 0x1005fb59 0xcddd3e10): IP address has been successfully taken offline on network interface "en0:0" May 27, 2015

Serviceability UNKNOWN (0) ONLINE (1) OFFLINE (2) FAILED_OFFLINE (3)
8282 Serviceability UNKNOWN (0) Generally a problematic state … really shouldn’t be deliberately used in the automation scripts ONLINE (1) OFFLINE (2) Offline and should be able to be started here if needed FAILED_OFFLINE (3) Offline and not a possible node to be started If MonitorCommand returns FAILED_OFFLINE then availability can change as soon as MonitorCommand returns something different, like Offline (return code 2) If status is set to FAILED_OFFLINE by StartCommand not succeeding within RetryCount, then manual intervention will be needed to fix underyling resource and reset (resetrsrc) resource. STUCK_ONLINE (4) Manual intervention will be needed to stop the underlying resource PENDING_ONLINE (5) No action is taken in this state, resource should eventually become online, or start attempt will timeout PENDING_OFFLINE (6) No action is taken in this state, resource should eventually become offline or stop attempt will timeout Online FAILED OFFLINE Online Offline May 27, 2015 82

8383 Serviceability Check syslog and trace_summary to see if TSAMP is issuing start / stop orders/commands If yes, then problem is most likely in DB2 automation scripts or core DB2 components If no, problem is most likely in cluster/automation S/W, requiring TSAMP Level 2 involvement If Operational State = UNKNOWN (OpState=0) Check syslog and trace_summary for GBLRESRM _MONITOR_TIMEOUT Fix: Increase MonitorCommandTimeout value chrsrc –s “Name = ‘<resource_name>’” IBM.Application MonitorCommandTimeout=<new value> lsrsrc –s “Name = ‘<resource_name>’” IBM.Application Name MonitorCommandTimeout May 27, 2015 83

Questions/Comments ? May 27, 2015

IBM Tivoli System Automation for Multiplatforms v3.2

Similar presentations

Presentation on theme: "IBM Tivoli System Automation for Multiplatforms v3.2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Tivoli System Automation for Multiplatforms v3.2

Similar presentations

Presentation on theme: "IBM Tivoli System Automation for Multiplatforms v3.2"— Presentation transcript:

Similar presentations

About project

Feedback