September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt.

September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt Capital One Services, Inc. igor.trubin@capitalone.com

September 2004Page 2 Introduction: Environment Capital One –6th largest card issuer in the United States –Capital One to S&P 500 in 1998 –Fortune 500 company starting in 2000 –Managed loans at $ 71.8 billion –Accounts at 46.7 million –CIO 100 Award “Master of the Customer Connection” –Information Week “Innovation 100” Award Winner –ComputerWorld “Top 100 places to work in IT”

September 2004Page 3 Statistical Analysis of Mainframe Performance Data SEDS - Statistical Exception Detection System based on Multivariate Adaptive Statistical Filtering (MASF) technique. SEDS is used for automatically scanning through large volumes of performance data and identifying measurements that differ significantly from their expected values. MASF is extension of Statistical Process Control or (Quality Control), which was developed by Walter Shewhart of Bell Telephone Laboratories in the 1920s. MASF procedure was designed and presented in CMG by BGS Systems, Inc. in 1995. SEDS is developed by this author and presented as the best paper in CMG 2002.

September 2004Page 4 Review of the Existing Tools –SAS/QC (Quality Control): –JMP from SAS: –BEZsystems for Oracle and Teradata; –Concord eHealth – DFN (Deviation From Normal) –The Patrol Perform and Predict tool from BMC software: The common output is Control charts for monitoring variations in process under statistical control

September 2004Page 5 SEDS Structure –Exception detectors for the most important metrics; –SEDS Database with history of exceptions; –statistical process control daily profile chart generator; –exception server name list generator; –Leader/Outsider servers/workload detector and detector of defective (runaway) processes ; and –Leaders/Outsiders bar charts generator.

September 2004Page 6 CPU Utilization Control Chart for Web Report: The full "7 days X 24 hours” adaptive filtering policy is applied to calculate the average, upper, and lower statistical limits of a particular metric for each weekday for the past six months.

September 2004Page 7 SEDS against Unisys and Tandem Platforms Performance Data VariableData TypeDescription CPUTOTPERCENT1CPU Utilization DATETIME Datetime DURATIONTIMEDuration of interval HOURGAUGEHour the event occurred LSTPDATEDATETIMELast process date MACHINESTRINGMachine Name MAVAILKFLOATMemory Available Kword MEMNUSEPERCENT1Memory in Use MONTHFORMULA MOVRLYPERCENT1Memory Overlayable READYQFLOATCPU Ready Queue VariableData TypeDescription CPUNUMSTRINGCPU Number CPUQUEINTRun-queue CPUTOTINTCPU Utilization DATETIME Datetime of sample/event DATETIME Datetime of sample/event DISKSTRINGDisk Name DISKIORATEDisk I/Os Per Second DURATIONTIMEDuration of interval DUTILPERCENT1Disk Utilization EXTENTINT Maximum Free Extent (Megabytes) HOURGAUGE Hour Summary Variable LSTPDATEDATETIMELast process date MACHINESTRINGMachine Name MEMUTILINTMemory Utilization MONTHFORMULA SPFREEINTFree Space (Megabytes) SPUSEDINTSpace Used (Megabytes) SWAPSRATE Memory Page (4K) Swaps per Second SEDS works with hourly or daily performance data. The schemas of the “day” tables in ITRM for Unisys and Tandem platforms are shown here. Good candidates to be used for SEDS are marked by red.

September 2004Page 8 Examples of Captured Exceptions for Unisys and Tandem The Unisys server had unusual low utilization that might indicate Disk or Database performance problems The Tandem server, in contrast, had two unusual spikes of CPUs utilization that crossed the upper limit.

September 2004Page 9 Global Performance Data for MVS Platform VariableData TypeDescription CLUSTERSTRINGCluster Name (Simplex name) CPUMIPSINTCPU Cycles Used (MIPS) CPUMIPXINTCPU Processor Maximum MIPS CPUTOMXFORMULACPU Utilization for Interval Max CPUTOTFORMULACPU Utilization for Interval Average DATETIME Datetime of sample/event DISKIOFLOATDisk I/O (EXCPs) DURATIONTIMEDuration of interval HOURINTHour Summary Variable LPARSTRINGLogical Partition LSTPDATEDATETIMELast process date MACHINESTRINGServer Name (Footprint) MEMNUSEPERCENT1Total Memory Utilization MONTHFORMULA READYQFLOATCPU Queue length SHIFTSTRINGShift of work week. A set of nightly batch jobs - dumps remaining active accounting data, - consolidates the data, - processes the data in SAS and - updates the ITRM PDB The schemas of the “day” tables in ITRM for MVS platforms are shown in the Table Good candidates for use in SEDS are marked by red

September 2004Page 10 Examples of Captured Exceptions for One of the Logical Partitions (LPAR) Since this chart is not about the entire system’s utilization but only about LPAR utilization within a shared system, the problem is that 100% is not a true threshold. However, SEDS gives a more accurate and dynamic threshold which is a statistical one.

September 2004Page 11 BMC Visualizer MASF vs. SEDS You can use BMC Visualizer to find any other exceptions based on other filtering policies. For that, the BMC collector needs to be installed on the server and BMC Visualizer must be used manually to capture any MASF exceptions. BMC Visualizer example: the System Hierarchy (spectrum) and Control charts SEDS is preferable as the automated MASF chart generator. In addition, SEDS can automatically notify a performance analyst if the statistical exception occurred

September 2004Page 12 Application Level SEDS for MVS Platform One problem is that, based on LPAR level data, it is impossible to figure out what particular workloads are responsible for an exception. BUT the Data collection process provides application level data across all LPARs. Looking at a stacked workload data chart, it’s difficult to find an application, which is responsible for spikes in overall CPU usage. SEDS shows that Appl1 was responsible for the global maxima in the overall MIPS chart.

September 2004Page 13 Other Reasons to Generate a Workload Control Chart 1. To capture an unusual behavior of a relatively small application that was not big enough to create a global exception: 2. To prove a stable behavior of any essential or critical application :

September 2004Page 14 Service Class/Period Type of Metrics under SEDS - Hourly SUM of the average response per transaction - RESP, (It shows the values consistently larger than average) - Hourly SUM of ended transaction count - TRANS - Hourly SUM of elapsed tasks duration - CPUsec (not always reported correctly for long-running servers ) ElapsedSec = (number of tasks) * 3600 seconds.

September 2004Page 15 Performance Status Automatic Recognition, WEB Report and E-mail Notification A green color in the WEB table indicates no exceptions. A Magenta indicates that the exceptions only exceeded the lower limit. A yellow color means an exception occurred on a particular server or LPAR. (NUP - NLOW) – Is the severity or type of the exceptions under the link to an MASF chart, where NUP – number of upper limit exceptions and NLOW – number of lower limit exceptions during the previous day. the number of applications or Service Classes with exceptions

September 2004Page 16 Links to the Workload Control Charts

September 2004Page 17 ExtraVolume is the numeric estimation of the exception magnitude. For CPU utilization it’s an ExtraTime: It calculates the area between the limit curve and the actual data curve (for periods when the exceptions occurred). For CPU metrics the physical meaning is the CPU time (or MIPS) the server has taken that exceeds a standard deviation. Exception Database and “Extra Volume” Metric The SEDS database keeps history of exceptions and has the following structure:

September 2004Page 18 TOP LPAR Leaders/Outsiders Charts –The system automatically produces ExtraTime calculation for the last day and records that in the SEDS database. –This data is used for publishing Leaders/Outsiders charts bar charts for the last day, last week and last month. If the SERVER showed a positive ExtraVolume for the previous day, it means that more capacity was used on the server than in the past. If the server showed a negative ExtraVolume metric, less capacity was used than usual. (not necessarily good thing)

September 2004Page 19 SUMMARY Statistical techniques can be used to automatically detect and report exceptions in resource utilization and service levels. The author’s site previously used MASF techniques to track global and application level CPU, disk and memory exceptions for a large number of UNIX and WINTEL servers. The workload level analysis enabled the authors’ site to expand the scope of this process to encompass large mainframe class servers. Although the analysis of global exceptions at an LPAR level has limited value for a system that shares workloads across logical systems, a workload-oriented system allows for quick detection of exceptions and immediate drill-down capabilities for the Capacity Planner and Performance Analyst. The authors recommend that the reader evaluate and understand any built-in statistical processes within his/her product set and consider developing ways to notify appropriate analysts when exceptions occur.

September 2004Page 20 References Trubin, Igor, Ph. D. and Mclaughlin, Kevin, “Exception Detection System, Based on the Statistical Process Control Concept," Proceedings of the Computer Measurement Group, 2001 Global and Application level Exception Detection System, Based on the MASF Technique Global and Application level Exception Detection System, Based on the MASF Technique Trubin, Igor, Ph. D., "Global and Application level Exception Detection System, Based on the MASF Technique," Proceedings of the Computer Measurement Group, 2002Global and Application level Exception Detection System, Based on the MASF Technique Thanks! Igor Trubin IT Capacity Planning, Capital One Services, Inc. igor.trubin@capitalone.com

September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt.

Similar presentations

Presentation on theme: "September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt.

Similar presentations

Presentation on theme: "September 2004Page 1 Mainframe Global and Workload Levels Statistical Exception Detection System, Based on MASF Igor Trubin, Ph.D. and Linwood Merritt."— Presentation transcript:

Similar presentations

About project

Feedback