Presentation on theme: "Mining Gold from the RMF Data Mountain"— Presentation transcript:
1Mining Gold from the RMF Data Mountain CMG ‘007 – San Diego, CAMining Gold from the RMF Data MountainIvan L. GelbGelb Information Systems Corp.Phone:Presentation Abstract:This presentation is about the essential RMF reports for performance management and capacity planning activities. For maximum effectiveness on the job, attendees will learn (a) important considerations for parameters affecting the data collection, (b) the minimum set of reports required to support IT system management (ITSM) activities, (c) the recommended fields on the key reports which are best for “health” check activities, and (d) how to avoid some potential pitfalls during the data collection and analysis. Samples of the most important and useful reports will be presented.The emphasis will be on techniques that help us "mine" the wealth of information collected by RMF. Attendees are encouraged to bring their own RMF data based reports in text file format for support of specific questions. Or the text file a few days before the scheduled session to with your notes about any important information about the situation. Any files ed in advance of will only be used in the session if permission is granted by the provider.1
3MOTHER OF ALL DISCLAIMERS (MOAD ) All of the information in this document is tried and true. However, this fact alone cannot guarantee that you can get the same results at your place and with your skills. In fact, some of this advice can be hurtful if it is misused and misunderstood. As with all kinds of analysis, anything you may hear or read can be understood and misunderstood in many ways that may seem contradictory to you. Gelb Information Systems Corporation, Ivan Gelb and any one found anywhere assume no responsibility for this information’s accuracy, completeness or suitability for any purpose. Anyone attempting to adapt these techniques to their own environments anywhere do so completely at their own risk. If anyone finds a more ridiculous and entertaining disclaimer, please bring it to my attention by ing toThanks,Ivan Gelb,
4Agenda Your Questions…Now SMF & RMF Introduction RMF Reports Overview CPU ReportsLPAR Reports5 More ReportsDrawing for attendee prizes Note: symbol flags recommendations PLUS: Rewards for most questions? ?We will focus on the current systems versions: z/OS 1.7 – 1.9.The presented reports were selected for this presentation because they are required in at least 85% of the situations involving any performance of capacity, scalability issues.These report show activity of logical and physical CPUs, Storage, I/O activity.2
5SMF & RMF Introduction SMF & RMF Data Collection RMF Record Types RMF Reports OverviewRMF Report TypesMonitor I ReportsMonitor II ReportsMonitor III ReportsRMF records collects system wide statistics. They can be customized though to the level of a single transaction, job, or system task.RMF provides real-time reports for ad hoc analysis, and collected records for “post mortem” analysis.
6SMF & RMF Data Collection - 1 ERBRMF00 or 02 member for Monitor I options. Examples:CYCLE(250) /* Sample every 250 msec.SYNCH(SMF) /* Use SMFPRMxx time valuesSMFPRMxx member for SMF options. Examples:INTVAL(mm) /* recording interval (30)SYNCVAL(mm) /* recording synchronization (00)INTERVAL(hhmmss) /* NOINTERVAL is default for SMF 30sSMF,SYNC /*type 30s sync-d based on SYNCVAL
7SMF & RMF Data Collection - 2 Processor overhead for record collection increases as CYCLE value is decreased. Shorter INTVAL produces more SMF and RMF records and higher collection related overhead Recommended service definition coefficients:MSO = CPU = 1.0SRB = 1.0IOC = 1.0 or less by orders of 10 (0.1 or 0.01; IBM recommends 0.5) Note potential impact on chargeback algorithms if they use service units in their calculations.
9RMF Report TypesMonitor I – 20+ real-time reports and long-term data collectionMonitor II – 20+ activity snapshot reportsMonitor III – 50+ interactive performance analysis reports and long-term data collectionOther RMF data based reporting tools (downloads): Spreadsheet reporter RMF PA (Performance Analyzer)Monitor I can produce these reports at the end of each collection interval, on demand, or they can be produced by the Postprocessor component at a later time.Monitor II can produce snapshot reports on demand or at definable intervals.Monitor III can produce Sysplex wide or for a single system reports of the delays experienced by a job, group of jobs, service class, TSO, OMVS, enclaves, etc….The two free tools can ease analysis based on RMF metrics without having to buy anything else.The Spreadsheet Reporter ports RMF metrics to Excel or Lotus for analysis and graphics preparation.The RMF Performance Analyzer, another free product, produces a full set of RMF reports and charts.
10Monitor I Reports CACHE – Cache Subsystem CF – Coupling Facility ActivityCHAN – Channel Path ActivityCPU – CPU ActivityCRYPTO – Crypto Hardware ActivityDEVICE – Device ActivityDOMINO – Lotus Domino ServerENQ – Enqueue ActivityFCD – FICON Director ActivityHFS – Hierarchical File SystemHTTP – HTTP ServerIOQ – IO Queuing ActivityOMVS – OMVS Kernel ActivityPAGESP – Page/Swap Data Set ActivityPAGING – Paging ActivitySDEVICE – Shared Device ActivityTRACE – Trace ActivityVSTOR – Virtual Storage ActivityWKLD – Workload Activity (compat mode)WLMGL – Workload Activity (goal mode)XCF – Cross-system Coupling ActivityRMF Monitor 1 reports are available for type z/OS resource and systems activity to support applications.Monitor I can produce these reports at the end of each collection interval, or they can be produced by the Postprocessor component at a later time.
11Monitor II Reports ARD / ARDJ – Address Space Resource Data ASD / ASDJ – Address Space DataASRM / ASRMJ – Address Space SRM DataCHANNEL – Channel Path ActivityDDMN – Domain ActivityDEV / DEVV Device ActivityHFS – Hierarchical File SystemILOCK – IRLM Long Lock DetectIOQUEUE – IO Queuing ActivityLLI - Library ListPGSP – Page/Swap Data Set ActivitySDS – Sysplex Data ServerSENQ – Systems ENQ ContentionSENQR – System ENQ ReserveSPAG – Paging ActivitySRCS – Central Storage / Processor / SRMTRX – Transaction ActivityRMF Monitor II reports provide insights into the applications from an address space point of view. All resources consumed by an application can be reported.Monitor II can produce snapshot reports on demand or at definable intervals.
12Monitor III ReportsMonitor III can produce Sysplex wide or for a single system reports of the delays experienced by a job, group of jobs, service class, TSO, OMVS, enclaves, etc…. We will present just 7 of more than 50 available reports:Delay ReportProcessor DelaysCF OverviewCF SystemsDevice DelaysVSAM LRU OverviewVSAM RLS Activity by Storage Class and by Data SetRMF Monitor III provides an interactive tool for “fire fights” when someone is complaining. However it can function as a fire preventer if the effort is made to set it up with meaningful threshold values to gauge when a system maybe heading into a problem situation rather than wait until it gets there.
13Who, What, How Much, & Analysis RMF Delay ReportCPU Activity Report & AnalysisLPAR Activity Report & AnalysisCF Activity Report and AnalysisWorkload Activity Report & AnalysisI/O Device Activity Report & AnalysisFile I/O Activity Report & AnalysisSeven reports will be examined next.
14M3- Which Resources Cause Delays This is the version 1.8 summary Delay Report. Please note that this report is produced from statistical samples so they are subject to distortions cause by not being based on enough samples.The reports provide first the tasks state analysis broken down into USAGE, DELAY, IDLE, and UNKNOWN. UNKNOWN often dominates the distribution,Next is a summary of % of time a task was found delayed for processor, devices, storage, subsystems, operator, and enqueue. The last column list the primary reason for delays.
15Which System Resources Cause Delays… NOTES Use to quickly establish which system resources are delaying the work. “% Delayed for” indicators are:PRC = in/ready but work not being dispatched on CPUDEV = delayed for disk or tapeSTR = delayed for storage liked COMM, LOCAL, SWAP, XMEM, or found on OUT & READY queueSUB = delayed by JES, HSM, XCFOPR = delayed by operator message, or mount request, or quiesce command by operatorENQ = delayed waiting for any enqueued resourceAddress space type column – CX:A = ASCH O = as second char. Indicates OMVS process for this taskB = Batch S = Started TaskE = Enclave T = TSOO = OMVS ? = invalid/missing dataCr column indicates CPU critical or Storage critical attribute for this address space
16CPU, LPAR & CF Activity Reports CPU Activity ReportsProcessor DelaysWhat is Your LPAR’s Guaranteed Capacity?LPAR Partition Data ReportCoupling Facility Activity (CF) Report
17PP- CPU Activity Report - Part 1 Provides only 100% accurate CPU utilization figures for all LPAR-s and each LPAR individually. Use it in conjunction with workload activity measurements to establish CPU utilization capture ratiosObserve and consider:ONLINE TIME – less than 100% indicated CPU being varied on- or offline. IRD or manual process may cause this.LPAR BUSY % - what % of each allocated CPU this LPAR utilized. Less than 100% indicates possible capacity issues.MVS BUSY % - LPAR’s % CPU utilization. 100% should cause performance and capacity concerns if (a) anyone complains, and (b) critical workloads + SYSTEM make up 90-95%+ of the utilizationQUEUE LENGTHS (%) – indicates how many others you may have to wait behind for CPU accessIN READY - address spaces ready to run but CPU not availableOUT READY – even worst than IN READY if the OUT-s are workloads you care about. See workload activity reports to determine the victims
18PP- CPU Activity Report - Part 2 Provides only 100% accurate CPU utilization figures for all LPAR-s and each LPAR individually. Use it in conjunction with workload activity measurements to establish CPU utilization capture ratiosObserve and consider:ONLINE TIME – less than 100% indicated CPU being varied on- or offline. IRD or manual process may cause this.LPAR BUSY % - what % of each allocated CPU this LPAR utilized. Less than 100% indicates possible capacity issues.MVS BUSY % - LPAR’s % CPU utilization. 100% should cause performance and capacity concerns if (a) anyone complains, and (b) critical workloads + SYSTEM make up 90-95%+ of the utilizationQUEUE LENGTHS (%) – indicates how many others you may have to wait behind for CPU accessIN READY - address spaces ready to run but CPU not availableOUT READY – even worst than IN READY if the OUT-s are workloads you care about. See workload activity reports to determine the victims
19CPU Activity Report…. NOTES Provides 100% accurate CPU utilization figures for all LPAR-s and each LPAR individually. Use it in conjunction with workload activity measurements to establish CPU utilization capture ratios.Observe and consider:ONLINE TIME – less than 100% indicated CPU being varied on- or offline. IRD or manual process may cause this.LPAR BUSY % - what % of each allocated CPU this LPAR utilized. Less than 100% indicates possible capacity issues.MVS BUSY % - LPAR’s % CPU utilization. 100% should cause performance and capacity concerns if (a) anyone complains, and (b) critical workloads + SYSTEM make up 90-95%+ of the utilizationQUEUE LENGTHS (%) – indicates how many others you may have to wait behind for CPU accessIN READY - address spaces ready to run but CPU not availableOUT READY – even worst than IN READY if the OUT-s are workloads you care about. See workload activity reports to determine the victims
20Monitor III (M3) Processor Delays - 1 Processor delays report identifies who is delayed and by ABOUT how much.DLY % = (# of Delay Samples / # of Samples) * 100 is % of time task is delayed from getting CPU timeUSG % = (# Using Samples / # Samples ) * 100 is % of time the task is receiving CPU serviceHolding Job(s) – up to three tasks that most contributed to delay
21Monitor III Processor Delays - 1... NOTES Processor delays report identifies who is delayed and by ABOUT how much.DLY % = (# of Delay Samples / # of Samples) * 100 is % of time task is delayed from getting CPU timeUSG % = (# Using Samples / # Samples ) * 100 is % of time the task is receiving CPU serviceHolding Job(s) – up to three tasks that most contributed to delayNote that delays are collected via statistical sampling!MVS reduced preemption approach, the cause of always present CPU delay
22 What is Your LPAR’s Guaranteed Capacity? LPAR’s share is determinant of physical CP capacityLPAR weights & # logical CPUs determine share Share = LCPU/Tot-PCPU * LPAR weight / ∑ LPAR weightsExample: If two LPARS, PRODA 700 weight and PRODB weight 300, with access to the total of 10 physical CPs each:PRODA Capacity = 10/10 * 700/1000 = 7.0 CPsPRODB Capacity = 10/10 * 300/1000 = 3.0 CPsLPAR weights are ONLY enforced if Physical CP BUSY = 100% or if LPAR is capped by PR/SMIf PRODA only utilizes 2.0 CPs most of the time, PRODB could get the other 8.0 CPs if it needs them! When PRODA gets busy using its maximum share, PRODB will be !
23PP- LPAR Partition Data Report To minimize LPAR overhead, try to define a ratio no greater than 2 logical CPUs defined per physical CPU. This ratio is calculated by adding the logical CPUs defined in all LPARs and dividing this total by the number of available physical CPUs.
24LPAR Partition Data Report… NOTES Partition Data Report is from the RMF post processor. This is the most useful single place where we can see defined and actual LPAR capacity reporting.WGT – LPAR’s weight/Total defined weight is the % SHARE this LPAR will be dispatched by PRSM if it needs CPU serviceMSU DEF and ACT – defined and actual LPAR MSUsCAPPING DEF – partition’s capping optionCAPPING WLM% - % of time WLM capped this LPARLPAR MGT – LPAR management overheadType = AAP for zAAP-s processorsType = IIP for zIIP-s processors
25CF Activity Reports and Analysis Data collection controlled by ERBRMFxx option of CFDETAIL or NOCFDETAILCFDETAIL collects a lot of SMF data!To reduce system overhead, data collection is done only on one member of Sysplex as decided automatically by RMF Sysplex Data Server
26M3- CF Activity - 1 If PROCESSOR UTIL% is high (95%+???): Under PR/SM, dedicate CPs or add CPs to partitionRebalance by moving structures to lower utilized CF if availableBuy more or faster CFs
27M3- CF Activity - 2AVG SERV in microseconds! Do compare Async. Serv. to Disk Serv.! CHNG% percent of requests changed from sync to asynch DEL% percent of requests delayed by subchannel contention or dump serialization
29RMF Workload Measurements You can basically put that the BTE number is the TORS point of view of the response time versus the EXE that is the other stuff.We could have actually drawn another box that could have been an FOR so it would be a subset of EXE.The transactions with multiple regions….will have multiple EXE lines.DB2 activity is issued from AOR-s to DB2Source: Chris Baker, IBM
30PP- RMF Workload Activity - 1 CICS REPORT BY: POLICY=HPTSPOL1 WORKLOAD=PRODWKLD SERVICE CLASS=CICSHR RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=HIGH-TRANSACTIONS-- TRANSACTION TIME HHH.MM.SS.TTTAVG ACTUALMPL QUEUEDENDED EXECUTIONEND/SEC STANDARD DEVIATION#SWAPSEXECUTDRESPONSE TIME BREAKDOWN IN PERCENTAGE STATE------SUB P TOTAL ACTIVE READY IDLE WAITING FOR SWITCHED TIME (%)TYPE LOCK I/O CONV DIST LOCAL SYSPL REMOT TIMER PROD MISC LOCAL SYSPL REMOTCICS BTECICS EXEThis is a sample RMF post processor (ERBRMFPP) output with option SYSRPTS(WLMGL(SCPER)) PP = RMF Post processor ReportThis report shows two connected MRO regions so we have one each of BTE and EXE reported lines.Response time breakdown into percentages is great source of SLA (service level agreements) components analysis. Just remember that because this is data sampled at variable rates of either 4/second if transaction level management, and 1 / 2.5 seconds if region level management, the breakdown may be missleading if sample is not statistically significant. How is RMF finding this out?…….these numbers are vulnerable to all kinds of things because they are from CICS Performance Block (PB) sampling based on the above given rates.From the CICS Measurement Facility (CMF) you get the response time number and its elements as measured by CICS if this is not enough. PROD column on this report is the % of time CICS “thinks” this transaction is waiting for DB2, IMS, MQSeries, or some other subsystem’s activity to complete.Watch out for MISC field being the largest component within RESPONSE TIME BREAKDOWN. This field accumulates the transaction’s time that CICS can not identify.Source: Chris Baker, IBM
31PP- RMF Workload Activity - 2 TSO Part 1 36457888The WORKLOAD ACTIVITY report is only useful for performance and capacity management if the service class contains homogeneous work. If the work is non- homogeneous in a service class or service class period, then report service classes should be created for the business critical (typically at least all IMPORTANCE=1) work.
33RMF Workload Activity – 2…. NOTES CPU and STORAGE Service class attributesTRANSACTIONS - Number of transactions and related statisticsTRANS. TIME – various transaction time measuresDASD I/O – rate and response time componentsSERVICE RATES PAGE-IN RATES monitor them carefully MSO coefficient should be 0! Other values will produce unstable performance on zSeries processors! APPL% can be greater than 100%. If a single CICS region is in report, it can track CICS TCB saturation risk. APPL% includes AAPCP and IIPCP time!AAPCP & IIPCP are time zAAP and zIIP eligible work spent on standard CPs.PROJECTCPU option in SYS1. PARMLIB member IEAOPTxx needed for AAPCP and IIPCPAPPL% calculation: IIT= I/O interpt. HST= Hiperspace RCT= Region Ctl.
34RMF Workload Activity Report Analysis Response time distribution report is best and usually least overhead causing source for design of repose time goals.Workload activity response time distribution report can be produced for a variety of report classes in support of service policy development activities. Quick and low overhead source of service and utilization data. Watch out for “funny” samples in STATE SAMPLE BREAKDOWN (%) – WAITING FOR. Each state sample category’s value, except OTHR, is based on the last 14 non-zero values.
36 Device Activity Components CONN = due to data transfer timeDISC = time disconnected from channel that consists of SEEK and SET SECTOR, Latency (wait for record to be under head), RPS (obsolete with ESS – Sharks)PEND = I/O delays in access path. May include delays caused by channel, control unit, director port delay. Often caused by shared DASD!IOSQ = wait for another task on the same system to finish using this device.What I/O response time is too high? WARNING: this is a trick question.Analyze response time components to decide what to do.
37I/O Device Activity (RMF PP Report) ①②⑦③⑧④⑨⑤⑩⑥⑪DASD Activity report tells us all we need to know about a single volume. Please note that DEV ACTIVITY RATE is in IOs/sec. AVG RESP TIME and all timing timing fields are given in milliseconds (ms). Possibly unproductive activity to watch out for:IOSQ TIME IOS - queue – mostly eliminated via use of WLM managed dynamic or static Parallel Access Volumes (PAV)DPB DLY - director port delayDB DLY - delay due to device busyPEND TIME – total pending time IO delayedDISC TIME – total disconnect timeAVG CONN TIME the time required for data transfer; large blocks and certain utilities can drive this to be almost all of the %DEV UTIL.%DEV CONN - % of time device was connected%DEV UTIL - % of time device was busy. If %DEV CONN is very close to %DEV UTIL, it indicates that little can be done to tune this device other than reduce size or number or IOs.%DEV RESV - % of time device reserved. If this is > than 10 – 20% of %DEV UTIL, its cause should be determined and eliminated if possible.AVG NUMBER ALLOC – reveals how many files were open on the volume. Did you expect to be alone?%MT PENDING – mount pending time should be zero.
38M3- Device DelaysDevice I/O activity delays report shows which devices delay a particular workload, and what are the chief contributors to these delays.DLY % - delay this job experiencedUSG % - using %CON % - connect %MAIN DELAY VOLUME(S) - % delay contributed by top 4 volumesThe values for “C” column can be:B for batchS for system tasksT for TSO
39 Device Activity Tuning - 1 I/O priority ON (check for APAR OW47667)CONN = due to data transfer timeDISC, IOSQ, PEND are I/O delaysEnable Parallel Access Volumes (PAV) to reduce / eliminate IOSQManage static PAVs to minimize IOSQManage number of dynamic PAVs via policy to minimize IOSQESS (Shark) multiple allegiance support reduces contention reported as PEND time.Track cache performance and manage it as needed
40 Device Activity Tuning - 2 DISC > 2-5 msec with cache may indicate problem(s).Not enough Non-Volatile Storage (NVS) or NVS get filled.Poor cache hit ratio on IBM ESS.High physical disk utilization. May need to move data to balance the activity between available resources.Very high disk to cache transfer activity rate.DISC > 13 msec may indicate RPS misses due to path contention. This should not occur on IBM ESS.If %DEV UTIL > 35%, work to reduce activity rate on device:Balance activity better across available resourcesIsolate or Do not cache files and volumes that are BAD cache candidatesTune based on analysis of caching activity
41M3- File I/O Tuning – VSAM LRU - 1 Buffer goal limit defaults to 100 MB; can be 1.5 GB max; see IGDSMxx in your PARMLIB for details “Accel %” when LRU aging algorithms were accelerated; “Reclaim %” when aging algorithms were to reclaim buffers “Read BMF%” data found in local buffers “Read CF%” data found in Coupling Facility (CF) cache “Read DASD%” data read from DASDMonitor average CPU time used by BMF LRU
42M3- File I/O Tuning – VSAM RLS - 1 VSAM RLS activity by data set.Also available by Storage Class.
43File I/O Tuning – VSAM RLS…NOTES “LRU Status” status of local buffers under Buffer Manager Facility (BMF) control GOOD = BMF at or below goal ACCELERATED = buffer aging algorithms accelerated because BMF is over goal RECLAIMED = buffer aging bypassed accelerated because BMF is over goal“BMF Valid %” percent of BMF reads that were valid NOTE: BMF read hits is sum of valid and invalid hits. Buffers can be invalid because (A) data altered, or (B) CF lost track of buffer statusBMF READ HIT% = BMF READ% / BMF VALID% * 100BMF INVALID READ HIT% = BMF READ HIT% - BMF READ%Please see the notes provided on this slide.
44SummaryWe examined just 7 main types of reports out of the 90+ available from RMF real-time or via post-processor. They are:RMF Delay ReportCPU Activity ReportLPAR Activity ReportCF Activity ReportWorkload Activity ReportI/O Device Activity ReportVSAM File I/O Activity ReportWith practice, you should be able to find the “gold” and solve performance “mysteries” by looking at just 1 – 3 RMF reports. Just summary of session as stated on the slide.
45Need/Want to Know More…and Start atDocumentation:SC RMF Report AnalysisSC RMF Performance Management GuideSA z/OS MVS Planning: Workload ManagementRMF NewslettersIBM and SHARE presentations– Computer Measurement GroupLarge Systems Performance Reference: