Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam

Similar presentations


Presentation on theme: "CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam"— Presentation transcript:

1 CSM support for Blue Gene/P CSM Line item 0XR Skills Transfer Materials by Marty Fullam

2 What's a Blue Gene?  It's IBM's flagship supercomputer offering. It's official name is the "IBM System Blue Gene Solution". See It looks like this:  The Service Node is the administration focal point of a Blue Gene. Among other things, it maintains a DB2 database of configuration, RAS, and environmental data. The Blue Gene system administrator hangs out here.  The Front End Nodes are used for compiling & submitting Blue Gene jobs. End users hang out here.  The File Servers serve files to the other systems.  The I/O and Compute Nodes (aka the Blue Gene core) run the user jobs.  All systems are POWER systems. The Service Node, Front End Nodes, and File Servers run SLES. The I/O and Compute Nodes run custom operating systems. Front End Nodes 1 Gigabit Ethernet 1 Gigabit Ethernet Service Node File Servers DB2 Compute Nodes (65,536) I/O Nodes (1024) Blue Gene/L or Blue Gene/P

3 CSM / Blue Gene Topology 1  Just install a CSM management server on your Blue Gene Service Node, and then add the CSM Blue Gene support. Define no CSM nodes.  Notice that there is no CSM cluster per se, just a management server. And even though the I/O and Compute Nodes of the Blue Gene core are not managed by CSM, you will still be able to monitor them.  This topology is new this release! We call it Stand-alone CSM Blue Gene monitoring support. Blue Gene core (I/O and Compute Nodes) Blue Gene Service Node CSM management server Blue Gene Front End Nodes Blue Gene File Servers Blue Gene with CSM to monitor the Blue Gene DB

4 CSM / Blue Gene Topology 2  Just add CSM to the Blue Gene systems you have. Pick any system to be the management server (though it's probably most typical to use the Service Node).  Define your Service Node, Front End Nodes, and File Servers as CSM nodes.  Notice that the I/O and Compute Nodes of the Blue Gene core are not managed by CSM (mainly because they are not general-purpose Linux systems, and don't need to be burdened with CSM and RSCT software). However, as you will see, you will still be able to monitor them.  We call this topology Full CSM plus Blue Gene monitoring support. Blue Gene with CSM to monitor the Blue Gene DB, and to manage the Service Node, Front End Nodes, and File Servers Blue Gene core (I/O and Compute Nodes) Blue Gene Service Node CSM management server CSM managed node Blue Gene Front End Nodes CSM managed nodes Blue Gene File Servers CSM managed nodes CSM Cluster

5 CSM / Blue Gene Topology 3 Blue Gene as part of a larger CSM cluster  Here, the management server is a system outside of the Blue Gene solution.  And while the Blue Gene Service Node, Front End Nodes, and File Servers are configured as managed nodes in the CSM cluster, they are not the only managed nodes, there can be lots of others completely unrelated to the Blue Gene.  This topology, like Topology 2, is Full CSM plus Blue Gene monitoring support. Blue Gene core (I/O and Compute Nodes) Blue Gene Service Node CSM managed node Blue Gene Front End Nodes CSM managed nodes Blue Gene File Servers CSM managed nodes CSM management server IBM eServer Blue Gene Solution CSM Cluster Other CSM managed node Other CSM managed node Other CSM managed node

6 CSM support for Blue Gene  If you use Full CSM plus Blue Gene monitoring support (Topology 2 or 3 in previous charts), use existing CSM and RSCT function to help manage the Blue Gene Service Node, Front End Nodes, and File Servers. After all, they're just SLES POWER systems. Nothing new here. Use any or all CSM function.  For Stand-alone CSM Blue Gene monitoring support (Topology 1) or Full CSM plus Blue Gene monitoring support (Topology 2 or 3), also use the optional rpm, csm.bluegene, which gives the system administrator the ability to monitor, effectively, the Blue Gene core using standard CSM monitoring capabilities (ERRM conditions and responses). Actually, what we provide is the ability to monitor the Blue Gene DB2 database where the Service Node is continually writing RAS, configuration, and environmental data about the Blue Gene core.

7  An optional part of CSM (only customers with a Blue Gene would care about it!)  Used on the CSM management server (AIX, or Linux i386 or ppc64), and in the Full CSM plus Blue Gene monitoring support case on the Blue Gene Service Node / CSM managed node (SLES ppc64) too, so it is present in the csm-aix-1.7.x.x and csm-linux-1.7.x.x tarballs and on the CDs.  If you want to use it, manual install is required on the management server (installp or geninstall on AIX, rpm –i on Linux), followed by these additional setup steps...  Stand-alone CSM Blue Gene monitoring support case:  Run bgsetupmon on the management server.  Full CSM plus Blue Gene monitoring support case:  On a non-SLES ppc64 management server, use the copycmspkgs -n service_node command to copy the CSM SLES ppc64 packages from the CSM for Linux ppc64 CD (or from the expanded tarball) to the /csminstall directory. (This is not necessary on a SLES ppc64 management server because installms will have already copied the packages to the /csminstall directory.)  The Blue Gene Service Node must be configured as a CSM managed node (whether or not it is also the CSM management server), and it must have the autoupdate package installed.  The IBM.ManagedNode "Properties" attribute of the Blue Gene Service Node must include "BlueGeneNodeType|:|ServiceNode".  Then you must run bgsetupms on the management server, followed by: installnode -n service_node or updatenode -n service_node. (During this install or update, SMS installs csm.bluegene on the Service Node.) csm.bluegene package

8 What's in the csm.bluegene package? It contains management server-specific files and Service Node-specific files (even though all files get installed on both types of systems). Management server files:  /opt/csm/bin/bgsetupmon - the end-user command used to set up Stand-alone CSM Blue Gene monitoring support on the management server.  /opt/csm/bin/bgsetupms - the end-user command used to set up Full CSM plus Blue Gene monitoring support on the management server.  /opt/csm/install/resources/bluegene.ms/IBM.Nodegroup/BlueGeneServiceNodes.pm et al - a set of predefined nodegroups created when bgsetupms calls mkresources.  /opt/csm/install/resources/bluegene.ms/IBM.Condition/*l - a set of predefined ERRM conditions created when bgsetupmon or bgsetupms calls mkresources.  /opt/csm/csmbin/bgsetupsn - a post-install customization script that sets up Blue Gene support on the Service Node (in Full CSM plus Blue Gene monitoring support case only ). It runs on the Service Node (via a mount of the management server's /csminstall directory) when installnode -n service_node or updatenode -n service_node is run. It gets called by csmfirstboot or updatenode.client, respectively. (Note: /opt/csm/csmbin is the installed location; but bgsetupms copies it to /csminstall/csm/scripts, and then creates a couple of symbolic links named 500CSM_bgsetupsn.BlueGeneServiceNodes in /csminstall/csm/scripts/update and /csminstall/csm/scripts/installpostreboot, and it is one of these symbolic links that is used.) (1 of 2)

9 Service Node files:  /opt/csm/bin/bgmksensor - the end-user command used to create Blue Gene- specific sensors to monitor the Service Node's DB2 database for events of interest. (Keep in mind, though, that we do ship a number of predefined Blue Gene sensors, and they may be sufficient for all the monitoring the user cares to do. So this command is not necessarily used.)  /opt/csm/install/resources/bluegene.sn/IBM.Sensor/* - a set of predefined sensors (created when bgsetupmon or bgsetupsn calls mkresources).  /opt/csm/csmbin/bgmanage_trigger - an internally used command called by Blue Gene sensors to create or drop DB2 triggers and sequences as necessary.  /opt/csm/csmbin/bgrun_dbcmds - an internally used command called by bgmksensor and bgmanage_trigger to run db2 commands.  /opt/csm/lib/bgrefresh_sensor.so - an internally used shared library called by the DB2 stored procedure that bgrun_dbcmds creates. It uses RMC's runact-api to call Blue Gene sensors' SetValues() routine.  /opt/csm/pm/BlueGeneUtils.pm - a set of utilities used by the various scripts. (2 of 2) What's in the csm.bluegene package?

10 bgsetupmon  /opt/csm/bin/bgsetupmon is run on the CSM management server in the Stand-alone CSM Blue Gene monitoring support case. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options.  /opt/csm/bin/bgsetupms is run on the CSM management server in the Full CSM plus Blue Gene monitoring support case. It must be run as part of the procedure to install CSM’s Blue Gene support. It must also be run when updating CSM to a new level. It has no significant flags or options. bgsetupms

11 bgmksensor  /opt/csm/bin/bgmksensor is run on the Blue Gene Service Node, if used at all. It is used to create custom IBM.Sensor resources used in monitoring the Blue Gene database. It is a higher level command than SensorRM’s mksensor command. A comparison of their usage statements highlights how different the commands are: mksensor [−n host] [−i seconds] [ −c n ] [ −e 0 | 1 | 2 ] [−u user-ID] [−h] [−v │ −V] sensor_name [″]sensor_command[″] bgmksensor −t table −o {d | i | u} [−w column[,...]] [−x "event_expression"] [−p column[,...]] [−T table] [−O {d | i | u}] [−W column[,...]] [−X "rearm_expression"] [−P column[,...]] [-h] [−v | −V] sensor_name Think of bgmksensor as a wrapper to mksensor; both define an IBM.Sensor resource, but bgmksensor does so much more. In fact, bgmksensor hard-codes most sensor options and is more concerned with providing options related to the Blue Gene DB2 tables, operations, columns, and values that you want to monitor.

12 Monitoring Overview  The Blue Gene Service Node routinely writes to its DB2 database all types of RAS, configuration, and environmental data related to the Blue Gene core (the I/O and Compute Nodes, the midplanes, the various interconnects, power supplies, fans, etc.). And this happens whether or not CSM is in the picture.  The CSM support for Blue Gene gives you a way to ‘watch’ the database for inserts, updates, and deletes that you deem important, and generate RMC events for them.  The resulting RMC events will drive the ERRM responses you specify.  The charts that follow show a monitoring flow example. Step through them to see what’s involved, and what happens when...

13 Monitoring Flow Example (1 of 7) DB2 database Blue Gene software Service NodeManagement Server Blue Gene core

14 Monitoring Flow Example (2 of 7) Blue Gene core Blue Gene software 1. node error 2. insert Service NodeManagement Server Recording of Blue Gene core events in DB2 occurs continually, and occurs whether or not CSM is present. DB2 database

15 Monitoring Flow Example (3 of 7) Blue Gene core Blue Gene software Service NodeManagement Server BGNodeErr Sensor BGNodeErr Condition (upon BGNodeErr Sensor change, is SD.Uint32 > 0?) When CSM and csm.bluegene are installed, there are various predefined Sensors, Conditions, Responses, and commands available “ root anytime” Response DB2 database bgmanage_trigger bgrefresh_sensor.so shared library

16 Monitoring Flow Example (4 of 7) Blue Gene core Blue Gene software Service NodeManagement Server BGNodeErr Sensor BGNodeErr Condition BGNodeErrCSMe DB2 Trigger (upon new row in TBGLNode is STATUS = ‘M’?) BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library “ root anytime” Response When you start monitoring a Blue Gene-related Condition with startcondresp, a number of things are created in DB2 (via bgmanage_trigger, the Command specified in the Sensor): A Trigger or two, a Sequence, and a couple of Procedures bgmanage_trigger DB2 database

17 Monitoring Flow Example (5 of 7) Blue Gene core Blue Gene software Service NodeManagement Server BGNodeErr Sensor BGNodeErr Condition BGNodeErrCSMe DB2 Trigger BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library “ root anytime” Response 1. node error 2. insert 3. evaluate When the Blue Gene software updates a table in the database, DB2 evaluates the Triggers associated with that table DB2 database

18 Monitoring Flow Example (6 of 7) Blue Gene core Blue Gene software Service NodeManagement Server BGNodeErr Sensor BGNodeErr Condition BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library “ root anytime” Response 4. get next5. call 6. call7. call 8. SetValues() In this case, the BGNodeErr_CSM Trigger evaluates ‘true’ and it does its thing: get next sequence number and call BGP_COMMON. And eventually SetValues() is called to write the new sequence number into BGNodeErr Sensors’s SD.Uint32. DB2 database BGNodeErrCSMe DB2 Trigger (new row in TBGLNode & STATUS = ‘M’)

19 Monitoring Flow Example (7 of 7) Blue Gene core Blue Gene software Service NodeManagement Server BGNodeErr Sensor BGNodeErr Condition BGNodeErrCSMe DB2 Trigger BGNodeErr_CSM DB2 Sequence BGP_COMMON DB2 Procedure BGP_COMMON_EXT DB2 Procedure bgrefresh_sensor.so shared library “ root anytime” Response 9. evaluate 10. do response At this point, it’s business as usual for RMC. Since BGNodeErr Sensor’s SD.Uint32 > 0, BGNodeErr Condition is satisfied and the Response occurs. DB2 database

20 Monitoring Details (1 of 3)  When you use bgmksensor to define a Blue Gene-related sensor, we temporarily create in the Blue Gene DB2 database the constructs required for monitoring. (By ‘constructs’ we mean the DB2 Triggers, Sequences, and Stored Procedures.) We do this to expose any errors. If we waited until you actually tried to use the defined sensor in a real monitoring situation, it would be harder to expose the errors. If DB2 gags on any of the constructs, bgmksensor reports the error(s) and creates no sensor. Whether successful or not, it deletes all DB2 constructs it created (they were temps, remember?).  When you use startcondresp to start monitoring a Blue Gene-related condition, the Command stored in the associated sensor gets run. The Command is /opt/csm/csmbin/bgmanage_trigger, and it creates the same Blue Gene DB2 database constructs that bgmksensor had created, but this time they’re not temporary. They stay defined until monitoring is stopped with stopcondresp.

21 Monitoring Details (2 of 3)  For a given Blue Gene-related sensor, we create the following DB2 constructs:  A Trigger to watch for bgmksensor’s -x event expression. If the sensor name is BGFanTempHi, the event Trigger name is BGFanTempHiCSMe.  A Trigger to watch for bgmksensor’s -X rearm expression, if specified. If the sensor name is BGFanTempHi, the rearm Trigger name is BGFanTempHiCSMr.  A Sequence to give us a unique new number for each event or rearm forwarded from a Trigger to BGP_COMMON. If the sensor name is BGFanTempHi, the Sequence name is BGFanTempHi_CSM.  Stored Procedures named BGP_COMMON and BGP_COMMON_EXT, if we don’t already have them. (Unlike the Triggers and Sequences, these are not created on a per sensor basis; there are just the two, and they serve all sensors created.)

22 Monitoring Details (3 of 3)  To provide support for Blue Gene rearm monitoring, we needed (and got) a new feature in RSCT. Normally, for a Condition that has a rearm expression, RMC ‘toggles’ between evaluating the Condition’s event expression, and its rearm expression. And when a Condition corresponds to a single resource, this makes perfect sense. However, in the world of Blue Gene monitoring, a Condition corresponds to a set of resources. So we can’t have RMC toggling; we must do the toggling down at the DB2 Trigger level because it is there where we’re able to distinguish one eventing or rearming resource from another. The bottom line is that when we’re monitoring a Blue Gene DB2 table for events and rearms, the Condition used must be the non-toggling type.  Because we assume the event/rearm toggling responsibilities, we introduced a DB2 table named TCSMEvents to keep track of when it’s proper to forward an event up to RMC, and when to forward a rearm. So be aware that we create this table, and that the DB2 Triggers we create manipulate its contents. TCSMEvents has two columns: sensor and origin. The latter uniquely identifies the event/rearm origin. If a sensor is in TCSMEvents, an event was forwarded last; otherwise a rearm was forwarded last, or no event or rearm was observed yet. TCSMEvents is created when the first Blue Gene monitoring is started. It is dropped when the csm.bluegene rpm is removed from the Service Node.

23 Debugging 1.bgsetupmon, bgsetupms and bgmksensor have a -v flag for verbose output. 2.bgrefresh_sensor.so will write some debug info to a file named /var/log/csm/csm_bg.log on the Service Node if you do the following prior to starting monitoring: Create /var/log/csm/csm_bg.log with 666 permissions. Temporarily modify /opt/csm/pm/BlueGeneUtils.pm in CreateTrigger where you see: CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, '$debug'); Change it to: CALL $SP_name('$trigname', CAST(seq_value AS CHAR(10)), $out_col_stuff2, ‘1'); 3.Some helpful DB2 commands that the bglsysdb userid can run on the Service Node: db2 { stop | start } database manager db2 connect to bgdb0 db2 “select trigname from syscat.triggers where trigname like ‘%CSM%’” db2 “drop trigger bglsysdb.xxxCSMe” db2 “drop trigger bglsysdb.xxxCSMr” db2 “select seqname from syscat.sequences where seqname like ‘%CSM%’” db2 “drop sequence bglsysdb.xxx_CSM” db2 “select procname from syscat.procedures where procname like ‘%BGP%’” db2 “drop procedure bglsysdb.common_bgp” db2 “drop procedure bglsysdb.common_bgp_ext” db2 disconnect bgdb0

24 What’s changed, what’s new this release? 1.We’ve added Blue Gene/P support! 2.We’ve added Stand-alone CSM Blue Gene monitoring support (and bgsetupmon to set this up). 3.Note that in the case of Stand-alone CSM Blue Gene monitoring support, our predefined Blue Gene-related ERRM Conditions are created on the CSM management server / Service Node, and their Management Scope is set to ‘l’ (for ‘local’). Customers who create their own Blue Gene-related ERRM Conditions in the Stand-alone case must do the same!

25 References /project/design/doc/clusters_6B/csm/bluegene/CSM-BGP-CompDes.pdf CSM Blue Gene/P Support - Component Design: CSM Planning and Installation Guide: See section “CSM support for Blue Gene” CSM Administration Guide: See section “Using CSM with the IBM System Blue Gene Solution” CSM Command and Technical Reference: bgsetupmon, bgsetupms and bgmksensor


Download ppt "CSM support for Blue Gene/P CSM 1.7.0 Line item 0XR Skills Transfer Materials by Marty Fullam"

Similar presentations


Ads by Google