Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Gold from the RMF Data Mountain

Similar presentations


Presentation on theme: "Mining Gold from the RMF Data Mountain"— Presentation transcript:

1 Mining Gold from the RMF Data Mountain
CMG ‘007 – San Diego, CA Mining Gold from the RMF Data Mountain Ivan L. Gelb Gelb Information Systems Corp. Phone: Presentation Abstract: This presentation is about the essential RMF reports for performance management and capacity planning activities. For maximum effectiveness on the job, attendees will learn (a) important considerations for parameters affecting the data collection, (b) the minimum set of reports required to support IT system management (ITSM) activities, (c) the recommended fields on the key reports which are best for “health” check activities, and (d) how to avoid some potential pitfalls during the data collection and analysis. Samples of the most important and useful reports will be presented. The emphasis will be on techniques that help us "mine" the wealth of information collected by RMF. Attendees are encouraged to bring their own RMF data based reports in text file format for support of specific questions. Or the text file a few days before the scheduled session to with your notes about any important information about the situation. Any files ed in advance of will only be used in the session if permission is granted by the provider. 1

2 TRADEMARKS The following are trade or service marks of the IBM Corporation: CICS, CICS TS , CICSPlex, DB2, IBM, MVS, OS/390, z/OS, Sysplex, Parallel Sysplex. Any omissions are purely unintended. © 2007 Gelb Information Systems Corp. URL: Phone: No part of this material can be reproduced by any means without prior written permission from the author and with proper attribution displayed. © 2007 Gelb Information Systems Corp. URL: Phone: Phone 2: No part of this material can be reproduced by any means and altered without prior written permission from the author and with proper attribution displayed.

3 MOTHER OF ALL DISCLAIMERS (MOAD )
All of the information in this document is tried and true. However, this fact alone cannot guarantee that you can get the same results at your place and with your skills. In fact, some of this advice can be hurtful if it is misused and misunderstood. As with all kinds of analysis, anything you may hear or read can be understood and misunderstood in many ways that may seem contradictory to you. Gelb Information Systems Corporation, Ivan Gelb and any one found anywhere assume no responsibility for this information’s accuracy, completeness or suitability for any purpose. Anyone attempting to adapt these techniques to their own environments anywhere do so completely at their own risk.  If anyone finds a more ridiculous and entertaining disclaimer, please bring it to my attention by ing to Thanks, Ivan Gelb,

4 Agenda Your Questions…Now SMF & RMF Introduction RMF Reports Overview
CPU Reports LPAR Reports 5 More Reports Drawing for attendee prizes Note:  symbol flags recommendations PLUS: Rewards for most questions? ? We will focus on the current systems versions: z/OS 1.7 – 1.9. The presented reports were selected for this presentation because they are required in at least 85% of the situations involving any performance of capacity, scalability issues. These report show activity of logical and physical CPUs, Storage, I/O activity. 2

5 SMF & RMF Introduction SMF & RMF Data Collection RMF Record Types
RMF Reports Overview RMF Report Types Monitor I Reports Monitor II Reports Monitor III Reports RMF records collects system wide statistics. They can be customized though to the level of a single transaction, job, or system task. RMF provides real-time reports for ad hoc analysis, and collected records for “post mortem” analysis.

6 SMF & RMF Data Collection - 1
ERBRMF00 or 02 member for Monitor I options. Examples: CYCLE(250) /* Sample every 250 msec. SYNCH(SMF) /* Use SMFPRMxx time values SMFPRMxx member for SMF options. Examples: INTVAL(mm) /* recording interval (30) SYNCVAL(mm) /* recording synchronization (00) INTERVAL(hhmmss) /* NOINTERVAL is default for SMF 30s SMF,SYNC /*type 30s sync-d based on SYNCVAL

7 SMF & RMF Data Collection - 2
 Processor overhead for record collection increases as CYCLE value is decreased.  Shorter INTVAL produces more SMF and RMF records and higher collection related overhead  Recommended service definition coefficients: MSO =   CPU = 1.0 SRB = 1.0 IOC = 1.0 or less by orders of 10 (0.1 or 0.01; IBM recommends 0.5)  Note potential impact on chargeback algorithms if they use service units in their calculations.

8 RMF Record Types Summary
70-1 Processor 70-2 Crypto processor 71 Paging activity 72-1 Workload PGN-s (compat. mode) 72-3 Workload service classes (goal mode) 73 Channel path activity 74-1 Device activity 74-2 XCF activity 74-5 Cache activity 74-7 FICON director activity 75 Paging activity 77 Enqueue activity 78-2 Virtual storage activity

9 RMF Report Types Monitor I – 20+ real-time reports and long-term data collection Monitor II – 20+ activity snapshot reports Monitor III – 50+ interactive performance analysis reports and long-term data collection Other RMF data based reporting tools (downloads):  Spreadsheet reporter  RMF PA (Performance Analyzer) Monitor I can produce these reports at the end of each collection interval, on demand, or they can be produced by the Postprocessor component at a later time. Monitor II can produce snapshot reports on demand or at definable intervals. Monitor III can produce Sysplex wide or for a single system reports of the delays experienced by a job, group of jobs, service class, TSO, OMVS, enclaves, etc…. The two free tools can ease analysis based on RMF metrics without having to buy anything else. The Spreadsheet Reporter ports RMF metrics to Excel or Lotus for analysis and graphics preparation. The RMF Performance Analyzer, another free product, produces a full set of RMF reports and charts.

10 Monitor I Reports CACHE – Cache Subsystem
CF – Coupling Facility Activity CHAN – Channel Path Activity CPU – CPU Activity CRYPTO – Crypto Hardware Activity DEVICE – Device Activity DOMINO – Lotus Domino Server ENQ – Enqueue Activity FCD – FICON Director Activity HFS – Hierarchical File System HTTP – HTTP Server IOQ – IO Queuing Activity OMVS – OMVS Kernel Activity PAGESP – Page/Swap Data Set Activity PAGING – Paging Activity SDEVICE – Shared Device Activity TRACE – Trace Activity VSTOR – Virtual Storage Activity WKLD – Workload Activity (compat mode) WLMGL – Workload Activity (goal mode) XCF – Cross-system Coupling Activity RMF Monitor 1 reports are available for type z/OS resource and systems activity to support applications. Monitor I can produce these reports at the end of each collection interval, or they can be produced by the Postprocessor component at a later time.

11 Monitor II Reports ARD / ARDJ – Address Space Resource Data
ASD / ASDJ – Address Space Data ASRM / ASRMJ – Address Space SRM Data CHANNEL – Channel Path Activity DDMN – Domain Activity DEV / DEVV Device Activity HFS – Hierarchical File System ILOCK – IRLM Long Lock Detect IOQUEUE – IO Queuing Activity LLI - Library List PGSP – Page/Swap Data Set Activity SDS – Sysplex Data Server SENQ – Systems ENQ Contention SENQR – System ENQ Reserve SPAG – Paging Activity SRCS – Central Storage / Processor / SRM TRX – Transaction Activity RMF Monitor II reports provide insights into the applications from an address space point of view. All resources consumed by an application can be reported. Monitor II can produce snapshot reports on demand or at definable intervals.

12 Monitor III Reports Monitor III can produce Sysplex wide or for a single system reports of the delays experienced by a job, group of jobs, service class, TSO, OMVS, enclaves, etc…. We will present just 7 of more than 50 available reports: Delay Report Processor Delays CF Overview CF Systems Device Delays VSAM LRU Overview VSAM RLS Activity by Storage Class and by Data Set RMF Monitor III provides an interactive tool for “fire fights” when someone is complaining. However it can function as a fire preventer if the effort is made to set it up with meaningful threshold values to gauge when a system maybe heading into a problem situation rather than wait until it gets there.

13 Who, What, How Much, & Analysis
RMF Delay Report CPU Activity Report & Analysis LPAR Activity Report & Analysis CF Activity Report and Analysis Workload Activity Report & Analysis I/O Device Activity Report & Analysis File I/O Activity Report & Analysis Seven reports will be examined next.

14 M3- Which Resources Cause Delays
This is the version 1.8 summary Delay Report. Please note that this report is produced from statistical samples so they are subject to distortions cause by not being based on enough samples. The reports provide first the tasks state analysis broken down into USAGE, DELAY, IDLE, and UNKNOWN. UNKNOWN often dominates the distribution, Next is a summary of % of time a task was found delayed for processor, devices, storage, subsystems, operator, and enqueue. The last column list the primary reason for delays.

15 Which System Resources Cause Delays… NOTES
Use to quickly establish which system resources are delaying the work. “% Delayed for” indicators are: PRC = in/ready but work not being dispatched on CPU DEV = delayed for disk or tape STR = delayed for storage liked COMM, LOCAL, SWAP, XMEM, or found on OUT & READY queue SUB = delayed by JES, HSM, XCF OPR = delayed by operator message, or mount request, or quiesce command by operator ENQ = delayed waiting for any enqueued resource Address space type column – CX: A = ASCH O = as second char. Indicates OMVS process for this task B = Batch S = Started Task E = Enclave T = TSO O = OMVS ? = invalid/missing data Cr column indicates CPU critical or Storage critical attribute for this address space

16 CPU, LPAR & CF Activity Reports
CPU Activity Reports Processor Delays What is Your LPAR’s Guaranteed Capacity? LPAR Partition Data Report Coupling Facility Activity (CF) Report

17 PP- CPU Activity Report - Part 1
Provides only 100% accurate CPU utilization figures for all LPAR-s and each LPAR individually. Use it in conjunction with workload activity measurements to establish CPU utilization capture ratios Observe and consider: ONLINE TIME – less than 100% indicated CPU being varied on- or offline. IRD or manual process may cause this. LPAR BUSY % - what % of each allocated CPU this LPAR utilized. Less than 100% indicates possible capacity issues. MVS BUSY % - LPAR’s % CPU utilization. 100% should cause performance and capacity concerns if (a) anyone complains, and (b) critical workloads + SYSTEM make up 90-95%+ of the utilization QUEUE LENGTHS (%) – indicates how many others you may have to wait behind for CPU access IN READY - address spaces ready to run but CPU not available OUT READY – even worst than IN READY if the OUT-s are workloads you care about. See workload activity reports to determine the victims

18 PP- CPU Activity Report - Part 2
Provides only 100% accurate CPU utilization figures for all LPAR-s and each LPAR individually. Use it in conjunction with workload activity measurements to establish CPU utilization capture ratios Observe and consider: ONLINE TIME – less than 100% indicated CPU being varied on- or offline. IRD or manual process may cause this. LPAR BUSY % - what % of each allocated CPU this LPAR utilized. Less than 100% indicates possible capacity issues. MVS BUSY % - LPAR’s % CPU utilization. 100% should cause performance and capacity concerns if (a) anyone complains, and (b) critical workloads + SYSTEM make up 90-95%+ of the utilization QUEUE LENGTHS (%) – indicates how many others you may have to wait behind for CPU access IN READY - address spaces ready to run but CPU not available OUT READY – even worst than IN READY if the OUT-s are workloads you care about. See workload activity reports to determine the victims

19 CPU Activity Report…. NOTES
 Provides 100% accurate CPU utilization figures for all LPAR-s and each LPAR individually. Use it in conjunction with workload activity measurements to establish CPU utilization capture ratios. Observe and consider: ONLINE TIME – less than 100% indicated CPU being varied on- or offline. IRD or manual process may cause this. LPAR BUSY % - what % of each allocated CPU this LPAR utilized. Less than 100% indicates possible capacity issues. MVS BUSY % - LPAR’s % CPU utilization. 100% should cause performance and capacity concerns if (a) anyone complains, and (b) critical workloads + SYSTEM make up 90-95%+ of the utilization QUEUE LENGTHS (%) – indicates how many others you may have to wait behind for CPU access IN READY - address spaces ready to run but CPU not available OUT READY – even worst than IN READY if the OUT-s are workloads you care about. See workload activity reports to determine the victims

20 Monitor III (M3) Processor Delays - 1
Processor delays report identifies who is delayed and by ABOUT how much. DLY % = (# of Delay Samples / # of Samples) * 100 is % of time task is delayed from getting CPU time USG % = (# Using Samples / # Samples ) * 100 is % of time the task is receiving CPU service Holding Job(s) – up to three tasks that most contributed to delay

21 Monitor III Processor Delays - 1... NOTES
Processor delays report identifies who is delayed and by ABOUT how much. DLY % = (# of Delay Samples / # of Samples) * 100 is % of time task is delayed from getting CPU time USG % = (# Using Samples / # Samples ) * 100 is % of time the task is receiving CPU service Holding Job(s) – up to three tasks that most contributed to delay Note that delays are collected via statistical sampling! MVS reduced preemption approach, the cause of always present CPU delay

22  What is Your LPAR’s Guaranteed Capacity?
LPAR’s share is determinant of physical CP capacity LPAR weights & # logical CPUs determine share Share = LCPU/Tot-PCPU * LPAR weight / ∑ LPAR weights Example: If two LPARS, PRODA 700 weight and PRODB weight 300, with access to the total of 10 physical CPs each: PRODA Capacity = 10/10 * 700/1000 = 7.0 CPs PRODB Capacity = 10/10 * 300/1000 = 3.0 CPs LPAR weights are ONLY enforced if Physical CP BUSY = 100% or if LPAR is capped by PR/SM If PRODA only utilizes 2.0 CPs most of the time, PRODB could get the other 8.0 CPs if it needs them! When PRODA gets busy using its maximum share, PRODB will be !

23 PP- LPAR Partition Data Report
To minimize LPAR overhead, try to define a ratio no greater than 2 logical CPUs defined per physical CPU. This ratio is calculated by adding the logical CPUs defined in all LPARs and dividing this total by the number of available physical CPUs.

24 LPAR Partition Data Report… NOTES
Partition Data Report is from the RMF post processor. This is the most useful single place where we can see defined and actual LPAR capacity reporting. WGT – LPAR’s weight/Total defined weight is the % SHARE this LPAR will be dispatched by PRSM if it needs CPU service MSU DEF and ACT – defined and actual LPAR MSUs CAPPING DEF – partition’s capping option CAPPING WLM% - % of time WLM capped this LPAR LPAR MGT – LPAR management overhead Type = AAP for zAAP-s processors Type = IIP for zIIP-s processors

25 CF Activity Reports and Analysis
Data collection controlled by ERBRMFxx option of CFDETAIL or NOCFDETAIL CFDETAIL collects a lot of SMF data! To reduce system overhead, data collection is done only on one member of Sysplex as decided automatically by RMF Sysplex Data Server

26 M3- CF Activity - 1 If PROCESSOR UTIL% is high (95%+???):
Under PR/SM, dedicate CPs or add CPs to partition Rebalance by moving structures to lower utilized CF if available Buy more or faster CFs

27 M3- CF Activity - 2 AVG SERV in microseconds! Do compare Async. Serv. to Disk Serv.!  CHNG% percent of requests changed from sync to asynch  DEL% percent of requests delayed by subchannel contention or dump serialization

28 Workload Activity Reports & Analysis
RMF Workload Measurements RMF Workload Activity – 1 CICS Service RMF Workload Activity – 2 TSO Service RMF Workload Activity Report Analysis

29 RMF Workload Measurements
You can basically put that the BTE number is the TORS point of view of the response time versus the EXE that is the other stuff. We could have actually drawn another box that could have been an FOR so it would be a subset of EXE. The transactions with multiple regions….will have multiple EXE lines. DB2 activity is issued from AOR-s to DB2 Source: Chris Baker, IBM

30 PP- RMF Workload Activity - 1 CICS
REPORT BY: POLICY=HPTSPOL1 WORKLOAD=PRODWKLD SERVICE CLASS=CICSHR RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=HIGH -TRANSACTIONS-- TRANSACTION TIME HHH.MM.SS.TTT AVG ACTUAL MPL QUEUED ENDED EXECUTION END/SEC STANDARD DEVIATION #SWAPS EXECUTD RESPONSE TIME BREAKDOWN IN PERCENTAGE STATE------ SUB P TOTAL ACTIVE READY IDLE WAITING FOR SWITCHED TIME (%) TYPE LOCK I/O CONV DIST LOCAL SYSPL REMOT TIMER PROD MISC LOCAL SYSPL REMOT CICS BTE CICS EXE This is a sample RMF post processor (ERBRMFPP) output with option SYSRPTS(WLMGL(SCPER)) PP = RMF Post processor Report This report shows two connected MRO regions so we have one each of BTE and EXE reported lines. Response time breakdown into percentages is great source of SLA (service level agreements) components analysis. Just remember that because this is data sampled at variable rates of either 4/second if transaction level management, and 1 / 2.5 seconds if region level management, the breakdown may be missleading if sample is not statistically significant.  How is RMF finding this out?…….these numbers are vulnerable to all kinds of things because they are from CICS Performance Block (PB) sampling based on the above given rates. From the CICS Measurement Facility (CMF) you get the response time number and its elements as measured by CICS if this is not enough. PROD column on this report is the % of time CICS “thinks” this transaction is waiting for DB2, IMS, MQSeries, or some other subsystem’s activity to complete. Watch out for MISC field being the largest component within RESPONSE TIME BREAKDOWN. This field accumulates the transaction’s time that CICS can not identify. Source: Chris Baker, IBM

31 PP- RMF Workload Activity - 2 TSO Part 1
3 6 4 5 7 8 8 8 The WORKLOAD ACTIVITY report is only useful for performance and capacity management if the service class contains homogeneous work. If the work is non- homogeneous in a service class or service class period, then report service classes should be created for the business critical (typically at least all IMPORTANCE=1) work.

32 PP- RMF Workload Activity – 2 TSO Part 2

33 RMF Workload Activity – 2…. NOTES
CPU and STORAGE Service class attributes TRANSACTIONS - Number of transactions and related statistics TRANS. TIME – various transaction time measures DASD I/O – rate and response time components SERVICE RATES  PAGE-IN RATES monitor them carefully  MSO coefficient should be 0! Other values will produce unstable performance on zSeries processors!  APPL% can be greater than 100%. If a single CICS region is in report, it can track CICS TCB saturation risk. APPL% includes AAPCP and IIPCP time! AAPCP & IIPCP are time zAAP and zIIP eligible work spent on standard CPs. PROJECTCPU option in SYS1. PARMLIB member IEAOPTxx needed for AAPCP and IIPCP APPL% calculation: IIT= I/O interpt. HST= Hiperspace RCT= Region Ctl.

34 RMF Workload Activity Report Analysis
Response time distribution report is best and usually least overhead causing source for design of repose time goals. Workload activity response time distribution report can be produced for a variety of report classes in support of service policy development activities.  Quick and low overhead source of service and utilization data.  Watch out for “funny” samples in STATE SAMPLE BREAKDOWN (%) – WAITING FOR. Each state sample category’s value, except OTHR, is based on the last 14 non-zero values.

35 I/O Activity Reports & Analysis
Device Activity Components I/O Device Activity - 1 I/O Device Activity – 2 Device Delays Device Activity Tuning VSAM File I/O Activity

36  Device Activity Components
CONN = due to data transfer time DISC = time disconnected from channel that consists of SEEK and SET SECTOR, Latency (wait for record to be under head), RPS (obsolete with ESS – Sharks) PEND = I/O delays in access path. May include delays caused by channel, control unit, director port delay. Often caused by shared DASD! IOSQ = wait for another task on the same system to finish using this device. What I/O response time is too high? WARNING: this is a trick question. Analyze response time components to decide what to do.

37 I/O Device Activity (RMF PP Report)
DASD Activity report tells us all we need to know about a single volume. Please note that DEV ACTIVITY RATE is in IOs/sec. AVG RESP TIME and all timing timing fields are given in milliseconds (ms). Possibly unproductive activity to watch out for: IOSQ TIME IOS - queue – mostly eliminated via use of WLM managed dynamic or static Parallel Access Volumes (PAV) DPB DLY - director port delay DB DLY - delay due to device busy PEND TIME – total pending time IO delayed DISC TIME – total disconnect time AVG CONN TIME the time required for data transfer; large blocks and certain utilities can drive this to be almost all of the %DEV UTIL. %DEV CONN - % of time device was connected %DEV UTIL - % of time device was busy. If %DEV CONN is very close to %DEV UTIL, it indicates that little can be done to tune this device other than reduce size or number or IOs. %DEV RESV - % of time device reserved. If this is > than 10 – 20% of %DEV UTIL, its cause should be determined and eliminated if possible. AVG NUMBER ALLOC – reveals how many files were open on the volume. Did you expect to be alone? %MT PENDING – mount pending time should be zero.

38 M3- Device Delays Device I/O activity delays report shows which devices delay a particular workload, and what are the chief contributors to these delays. DLY % - delay this job experienced USG % - using % CON % - connect % MAIN DELAY VOLUME(S) - % delay contributed by top 4 volumes The values for “C” column can be: B for batch S for system tasks T for TSO

39  Device Activity Tuning - 1
I/O priority ON (check for APAR OW47667) CONN = due to data transfer time DISC, IOSQ, PEND are I/O delays Enable Parallel Access Volumes (PAV) to reduce / eliminate IOSQ Manage static PAVs to minimize IOSQ Manage number of dynamic PAVs via policy to minimize IOSQ ESS (Shark) multiple allegiance support reduces contention reported as PEND time. Track cache performance and manage it as needed

40  Device Activity Tuning - 2
DISC > 2-5 msec with cache may indicate problem(s). Not enough Non-Volatile Storage (NVS) or NVS get filled. Poor cache hit ratio on IBM ESS. High physical disk utilization. May need to move data to balance the activity between available resources. Very high disk to cache transfer activity rate. DISC > 13 msec may indicate RPS misses due to path contention. This should not occur on IBM ESS. If %DEV UTIL > 35%, work to reduce activity rate on device: Balance activity better across available resources Isolate or Do not cache files and volumes that are BAD cache candidates Tune based on analysis of caching activity

41 M3- File I/O Tuning – VSAM LRU - 1
Buffer goal limit defaults to 100 MB; can be 1.5 GB max; see IGDSMxx in your PARMLIB for details  “Accel %” when LRU aging algorithms were accelerated;  “Reclaim %” when aging algorithms were to reclaim buffers  “Read BMF%” data found in local buffers  “Read CF%” data found in Coupling Facility (CF) cache  “Read DASD%” data read from DASD Monitor average CPU time used by BMF LRU

42 M3- File I/O Tuning – VSAM RLS - 1
VSAM RLS activity by data set. Also available by Storage Class.

43 File I/O Tuning – VSAM RLS…NOTES
“LRU Status” status of local buffers under Buffer Manager Facility (BMF) control  GOOD = BMF at or below goal  ACCELERATED = buffer aging algorithms accelerated because BMF is over goal  RECLAIMED = buffer aging bypassed accelerated because BMF is over goal “BMF Valid %” percent of BMF reads that were valid NOTE: BMF read hits is sum of valid and invalid hits. Buffers can be invalid because (A) data altered, or (B) CF lost track of buffer status BMF READ HIT% = BMF READ% / BMF VALID% * 100 BMF INVALID READ HIT% = BMF READ HIT% - BMF READ% Please see the notes provided on this slide.

44 Summary We examined just 7 main types of reports out of the 90+ available from RMF real-time or via post-processor. They are: RMF Delay Report CPU Activity Report LPAR Activity Report CF Activity Report Workload Activity Report I/O Device Activity Report VSAM File I/O Activity Report With practice, you should be able to find the “gold” and solve performance “mysteries” by looking at just 1 – 3 RMF reports.         Just summary of session as stated on the slide.

45 Need/Want to Know More…and
Start at Documentation: SC RMF Report Analysis SC RMF Performance Management Guide SA z/OS MVS Planning: Workload Management RMF Newsletters IBM and SHARE presentations – Computer Measurement Group Large Systems Performance Reference:


Download ppt "Mining Gold from the RMF Data Mountain"

Similar presentations


Ads by Google