Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support

Similar presentations


Presentation on theme: "© 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support"— Presentation transcript:

1 © 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support rfbarker@us.ibm.com

2 IBM Power Systems © 2009 IBM Corporation Overview CEC Concurrent Maintenance (CCM) offers new capabilities in Reliability, Availability and Serviceability (RAS) Concurrent add and upgrade functions enable the expansion of the processor, memory, and I/O hub subsystems without a system outage If prerequisites have been met, repairs can be made on the system processor, memory, I/O hub, and other CEC hardware without a system outage Accomplishing CCM requires careful advance planning and meeting all prerequisites If desired, customers can continue to schedule maintenance during planned outages

3 IBM Power Systems © 2009 IBM Corporation Terminology Concurrent Maintenance: An add, upgrade or repair made while the server is running  Some system elements may be unavailable during maintenance, but a re-IPL is NOT required to reintegrate all resources Concurrent Add/Upgrade: Adds new or exchanges hardware components while the system is running Node: A physical group of processor, memory, and I/O hubs in the system (595 processor book, 570 CEC drawer or module) Node evacuation: Frees up processor and memory resources from the target node and replaces them with CPU and memory from other nodes if available. De-allocates I/O resources so the node can be electrically isolated from the system for concurrent maintenance.

4 IBM Power Systems © 2009 IBM Corporation Terminology Concurrent Cold Repair: Repairs to components electrically isolated from the running system (de-allocated or “garded”) before the current repair action was started  Repairs following a system shutdown and reboot after hardware failure  Reintegration following repair does NOT require a reboot Concurrent Hot Repair: Repairs on components that will be electrically isolated from the running system during the repair action  Reintegration following repair does NOT require a reboot Non-Concurrent Repair: Repairs requiring the system be powered off GX Adapter: An I/O hub which connects I/O expansion units to the processors and memory in the system (e.g., RIO-2, 12X adapters)

5 IBM Power Systems © 2009 IBM Corporation Planning and Prerequisites CCM has both hardware and firmware prerequisites  Power Systems 595 and 570 only  Hardware Management Console V7R3.4.0 MH01163_0401 (SP 1) or later  System firmware EH340_061 and EM340_061 or later  This update has deferred content requiring a re-IPL to activate enhancements  Power System 570 concurrent node add requires that the system cable be connected in advance (cannot be added concurrently)  Adding new GX adapters concurrently requires that sufficient system memory has been reserved in advance; here are the defaults that may need to be increased:  Power Systems 595, 1 additional per node, 2 maximum, if slots available  Power Systems 570, 1 additional maximum, if an empty slot available

6 IBM Power Systems © 2009 IBM Corporation Planning and Prerequisites System configurations should allow for free processors, unused system memory and redundant I/O paths  When a processor node is powered off, all of its resources need to be shifted to another node  Unlicensed Capacity on Demand processors and memory will be used by the system during node evacuation  System CPU and memory usage can be reduced through dynamic reallocation of running partitions, or by shutting down those that are unnecessary  Insufficient processor and memory capacity, or lack of redundant I/O paths, may force shutdown of some or all logical partitions on the system

7 IBM Power Systems © 2009 IBM Corporation Planning and Prerequisites Preparation for concurrent maintenance begins when the system is ordered and configured 1. Customers decide how much system to buy and how to configure it to take advantage of CCM capability 2. Customers decide whether to use concurrent maintenance techniques or schedule planned outages for upgrades and repairs

8 IBM Power Systems © 2009 IBM Corporation Planning Guides for CEC Concurrent Maintenance Follow these guidelines: The system should have enough unused CPU and memory to allow a node to be taken off-line for repair All critical I/O resources should be configured using multi-path I/O solutions allowing failover using redundant I/O paths Redundant physical or virtual I/O paths must be configured through different nodes and GX adapters

9 IBM Power Systems © 2009 IBM Corporation Note the Exposure on Node 3

10 IBM Power Systems © 2009 IBM Corporation Partition Considerations CCM is concurrent from the system point of view, but may not be completely transparent to logical partitions  Temporary reduction in CPU, memory and I/O capabilities could impact performance To take full advantage of concurrent node add or memory upgrade/add, partition profiles should reflect higher maximum processor and memory values than exist before the upgrade  New resources can then be added dynamically after the add or upgrade  Note: higher partition maximum memory values will increase system memory set aside for Partition Page Tables

11 IBM Power Systems © 2009 IBM Corporation Partition Considerations I/O resource planning  To maintain access to data, multi-path I/O solutions must be utilized (i.e., MPIO, SDDPCM, PowerPath, HDLM, etc.)  Redundant I/O adapters must be located in different I/O expansion units that are attached to different GX adapters located in different nodes  This can be either directly attached I/O or virtual I/O provided by dual VIO servers housed in different nodes

12 IBM Power Systems © 2009 IBM Corporation Partition Considerations Check system settings for the server  If shutting down all partitions becomes necessary, make sure the system doesn’t power off during the repair action, prolonging the repair action Leave this box unchecked

13 IBM Power Systems © 2009 IBM Corporation IBM i Planning Considerations To allow for a hot node repair/memory upgrade to take place with i partitions running, the following PTFs are also required: V5R4: MF45678 V6R1: MF45581 If the PTFs are not activated, the IBM i partitions have to be powered off before the CCM operation can proceed.

14 IBM Power Systems © 2009 IBM Corporation Rules for Concurrent Maintenance Operations Guidelines for CCM operations  Only one operation at a time from only one HMC  A second CCM operation cannot be started until the first one has completed successfully  All CCM operations except a 570 GX adapter add must be done by IBM service personnel  On both the 595 and 570, you must have at least two nodes for hot node repair or hot memory add/upgrade  You cannot evacuate a 570 node that has an active system clock  Enable service processor redundancy on a 570 before starting a hot node add, except on a single-node server  Both service processors on a 595 must be functioning  Display Service Effect utility must be run by the system administrator before hot repair or hot memory add/upgrade  Ensure that the system is not in energy savings mode prior to concurrent node add, memory upgrade or concurrent node repair

15 IBM Power Systems © 2009 IBM Corporation Guidelines for All Concurrent Maintenance Operations With proper planning and configuration, enterprise-class Power Servers are designed for concurrent add/upgrade or repair However, changing the hardware configuration or the operational state of electronic equipment may cause unforeseen impacts to the system status or running applications Some highly recommended precautions to consider:  Schedule concurrent upgrades or repairs during off-peak operational hours  Move business-critical applications to another server using the Live Partition Mobility feature or stop them  Back up critical application and system state information  Checkpoint data bases

16 IBM Power Systems © 2009 IBM Corporation Guidelines for All Concurrent Maintenance Operations Features and capabilities that don’t support CCM  Systems clustered using RIO-SAN technology (This technology is used only by i users clustering using switchable towers and virtual OptiConnect technologies)  Systems clustered using InfiniBand technology (This capability is typically used by High Performance Computing clients using an InfiniBand switch)  I/O Processors (IOPs) used by i partitions do not support CCM (Any i partitions that have IOPs assigned must either have the IOPs powered off or the partition must be powered off)  16 GB memory pages, also known as huge pages, do not support memory relocation (Partitions with 16 GB pages must be powered off to allow CCM)

17 IBM Power Systems © 2009 IBM Corporation Guidelines for Concurrent Add/Upgrade For adding or upgrading  All serviceable hardware events must be repaired and closed before starting an upgrade  Firmware enforces node and GX adapter plugging order  Only the next node position or GX adapter slot based on plugging rules will be available  For 570 node add, make sure the system cable is in place before starting  If the concurrent add includes a node plus a GX adapter, install the adapter in the node first, then add the entire unit  This way, the 128 MB of memory required by the adapter will come from the new node when it is powered on

18 IBM Power Systems © 2009 IBM Corporation Guidelines for Concurrent Add/Upgrade For adding or upgrading  For multiple upgrades that include new I/O expansion drawers, as well as node or GX adapter adds, the concurrent node or GX adapter add must be completed first  The I/O drawer can then be added later as a separate concurrent I/O drawer add (a sequential operations)

19 IBM Power Systems © 2009 IBM Corporation Guidelines for Concurrent Repair Repair with same FRU type:  The node repair procedure doesn’t allow for any additional action beyond the repair  The same FRU type must be used to replace a failing FRU, and no additional hardware can be added or removed during the procedure  For example, if a 4GB DIMM fails, it must be replaced with a 4GB DIMM – not a 2GB or 8GB DIMM  A RIO GX adapter must be replaced with a RIO GX adapter, not an InfiniBand GX adapter

20 IBM Power Systems © 2009 IBM Corporation Customer Responsibilities The customer is responsible for deciding whether to do a concurrent upgrade or repair or to schedule a maintenance window The customer must determine whether all prerequisites have been met and the configuration will support a node evacuation, if necessary In the case of an upgrade, the World-wide Customized Install Instructions (WCII) for the order will ship assuming a non-concurrent installation  The WCII will tell you how to obtain instructions for a concurrent upgrade All repairs are the responsibility of IBM service personnel Customers are responsible for adding new 570 GX adapters

21 IBM Power Systems © 2009 IBM Corporation Display Service Effect Utility The Display Service Effect utility needs to be run by the customer prior to concurrent hot node repair or memory add/upgrade  The utility shows memory, CPU and I/O issues that must be addressed before a node evacuation  The utility runs automatically at the start of a hot repair or upgrade, but it can be run manually ahead of time to determine whether the repair or upgrade will be concurrent  Ideally, this utility should be run by the systems administrator before the arrival of the IBM service representative The DSE utility is not required if no node evacuation is required, as during a hot GX adapter add or a hot node add

22 IBM Power Systems © 2009 IBM Corporation Starting the Display Service Effect Utility Note: The Power On/Off Unit dialog box is used only to access the Display Service Effect utility

23 IBM Power Systems © 2009 IBM Corporation Select Display Service Effect Note: The Power On/Off Unit dialog box is used only to access the Display Service Effect utility

24 IBM Power Systems © 2009 IBM Corporation Select Yes – Confirm Advanced Power Control Command This is a misleading message: it does NOT mean you’re about to power off your system!

25 IBM Power Systems © 2009 IBM Corporation Display Service Effect Summary Page Look at the details by clicking the tabs

26 IBM Power Systems © 2009 IBM Corporation Tips on How to View Data When working with the informational and error messages shown on the Node Evacuation Summary Status panel, work with the Platform and Partition messages first (the first and last tabs)  The impacts to the platform and partitions indicated in these messages may lead to the shutdown of partitions on the system for reasons such as I/O resources being used by the partition in the target node  The shutdown of a partition will free up memory and processor resources  If a partition must be shutdown, use the Recheck button to re-evaluate the memory and processor resources

27 IBM Power Systems © 2009 IBM Corporation Platform – Informational Messages Check both Errors and Informational Messages

28 IBM Power Systems © 2009 IBM Corporation Memory Impacts

29 IBM Power Systems © 2009 IBM Corporation Processor Impacts

30 IBM Power Systems © 2009 IBM Corporation Partition Impacts – I/O Related Conflicts

31 IBM Power Systems © 2009 IBM Corporation “White Glove” Tracking Program During the next several months, IBM will track concurrent maintenance operations In the US, potential concurrent CEC add MES orders and repairs will be pro-actively tracked during the feedback period In the NE, SW IOTs and CEEMA GMT (EMEA), they will use the "Install PMH" or "Repair PMH" process to request feedback from SSRs For all geographies, SSRs who perform CCM upgrades, adds and repairs are asked to complete the feedback form located at the following URL:  http://w3.rchland.ibm.com/~cuii/CCM/CCMfeedback_WCII.htmlhttp://w3.rchland.ibm.com/~cuii/CCM/CCMfeedback_WCII.html

32 IBM Power Systems © 2009 IBM Corporation Summary CCM gives customers new options for maintaining availability Careful advance planning is required to make it work Pre-requisites include creating CPU and memory reserves to allow CCM, as well as configuring redundant I/O paths or preparing for loss of I/O routes during concurrent maintenance Customers must run the Display Service Effect utility to determine whether a concurrent repair or memory add/upgrade can be initiated If concurrent repairs are not possible, a regular maintenance window must be scheduled

33 IBM Power Systems © 2009 IBM Corporation Required Reading Technical white paper, “IBM Power 595 and 570 Servers CEC Concurrent Maintenance Technical Overview”, available at: ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/pow03023usen/POW030 23USEN.PDF CEC Concurrent Maintenance article in IBM System Hardware Information Center available at: http://publib.boulder.ibm.com/infocenter/systems/scope/hw/index.jsp?topic=/ ared3/ared3kickoff.htm


Download ppt "© 2009 IBM Corporation IBM Power Systems Implementing CEC Concurrent Maintenance Ron Barker IBM Power Systems Advanced Technical Support"

Similar presentations


Ads by Google