Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resetting Your Pain Threshold Andy Ward – Senior Principal Consultant December 2015.

Similar presentations


Presentation on theme: "Resetting Your Pain Threshold Andy Ward – Senior Principal Consultant December 2015."— Presentation transcript:

1 Resetting Your Pain Threshold Andy Ward – Senior Principal Consultant December 2015

2 2 © 2015 CA. ALL RIGHTS RESERVED.  Monitoring Overhead  WLM and Its Role  System Address Spaces – keeping CPU down  DDF  Log Activity  Locking and Latching  zIIPs  IO Suspension  Buffer Pools  What we’re not going to cover – How to report on all this information in individual monitors

3 3 © 2015 CA. ALL RIGHTS RESERVED. Conventions Used In This Presentation  There are a number of vendors of DB2 performance monitors – CA, IBM, BMC, ASG etc. – To keep things simple whenever IFCID metrics are specified I have used the IBM IFCID field name  These can be found in hlq.DSNIVPD(DSNWMSGS)  If vendors use other field-names (as CA do) then it is likely they will provide the IBM equivalent (if available) too (as CA do)  To avoid confusion, especially in calculations, I have listed the IFCID field name in underlined italic text in all places  A number of metrics in this presentation are relevant to monitoring over a set time interval – to be decided by yourself – In some cases your statistics interval or perhaps an aggregation of these  I’ve done my very best to ensure calculations are correct

4 4 © 2015 CA. ALL RIGHTS RESERVED. The Missing Bullet Holes  Abraham Wald – A statistician in World War 2 – Tasked with finding out where best to put the armour on planes  These are the numbers he had to work with for returning planes  His conclusion? – Put the armour on the engines – Sometimes it’s where you DON’T look rather than where you do look Sometimes the answer is where you are not looking… Section of PlaneBullet Holes Per Square Foot Engine1.11 Fuselage1.73 Fuel System1.55 Rest of Plane1.8

5 Monitoring Overhead

6 6 © 2015 CA. ALL RIGHTS RESERVED. Monitoring Overhead  All DB2 performance monitors generate overhead  This can be realised in – The monitors STCs – DB2 STCs – Your applications  Small changes to DB2 monitor parameters can have big effects – Certain IFCIDs result in big overheads, there is a reason you can’t see everything out of the box.  Ensure temporary monitoring requirements remain temporary – And qualify them as much as possible

7 7 © 2015 CA. ALL RIGHTS RESERVED. Monitoring Your Monitoring The damage the wrong IFCID can do…

8 8 © 2015 CA. ALL RIGHTS RESERVED. Using Your Monitor  If your monitor allows, consider setting up your own DB2 Health report – Online (preferably) and/or batch (good for regular execution to view trends)  Consider outputting to a comma separated file (if available) for easy export into Excel or offload data to DB2 tables  Incorporate the metrics discussed in this PPT – There is also a very good Redbook ‘Subsystem and Transaction Monitoring and Tuning with DB2 11 for z/OS’  Despite the name - it’s not only for DB2 11 shops  When issues occur this information may well point you to the area of concern quickly

9 Workload Manager

10 10 © 2015 CA. ALL RIGHTS RESERVED. Workload Manager  Don’t need to be experts but may need enough knowledge to make a case  Sometimes we need to monitor outside DB2 to ensure things are running well inside – Look for WLM resource group capping delays (your z/OS monitor should show you these)  Capping of workloads entering DB2 should be avoided – Potential for longer holding of locks if class is suspended  Understand Service Classification Goals – Average response time goal – Percentile response time goal – Velocity

11 11 © 2015 CA. ALL RIGHTS RESERVED. Workload Manager…cont'd  The ‘Performance Index’ – A PI of 1 means the goal is being exactly achieved – <1 performing better than goal requirements – >1 performing worse than goal requirements – Good measure but relies on service classification goals being correctly set – In a Sysplex there is a local PI to indicate if a particular LPAR is suffering – DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL (screenshot follows)  DB2 10 and greater contains WLM details  PI is multiplied by 100 in this output  Periods – Each Service Class can have multiple periods  Work moves down the periods once it has consumed n service units – WLM importance level is likely to drop through these  1 to 5 (discretionary workload is 6)

12 12 © 2015 CA. ALL RIGHTS RESERVED. DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL DSNV401I !S10A DISPLAY THREAD REPORT FOLLOWS - DSNV402I !S10A ACTIVE THREADS - NAME ST A REQ ID AUTHID PLAN ASID TOKEN DB2CALL T 11 PTX#S170 ACFSTCID PSAA1700 00D7 154 V437-WORKSTATION=DB2CALL, USERID=ACFSTCID, APPLICATION NAME=PTX#S170 V482-WLM-INFO=CLSONL:1:5:95 DB2CALL T * 7 WARAN01 WARAN01 DB2C1800 002B 155 V437-WORKSTATION=DB2CALL, USERID=WARAN01, APPLICATION NAME=WARAN01 V441-ACCOUNTING=1 V482-WLM-INFO=*:1077952576:*:1077952576 Service Class:Period:Importance:Performance Index

13 13 © 2015 CA. ALL RIGHTS RESERVED. WLM Best Practices  Ensure DB2 Address spaces are always running at a higher importance than the work entering them  Should only run 25-35 concurrent service class periods on an LPAR  Understand your threads that are regularly exceeding their PI – DISPLAY THREAD(*) TYPE(ACTIVE) DETAIL – Consider moving them to another class or perhaps altering the existing class  Ensure there is an entry for SUBSYS=DDF in your WLM policy linked to, at the least, a catch-all Service Class  Assess DB2 Service Goals at least annually – If a PI is regulary seen below 1 then perhaps response time goals need to be tightened

14 System Address Spaces

15 15 © 2015 CA. ALL RIGHTS RESERVED. System Address Space CPU Time  Sometimes zIIP processing is ignored because it is not perceived to have a cost – Important to take into account across releases – CPU may have been moved here – But it is still processing – Especially important if you are comparing across releases  Still should be accounted for tuning purposes too  Understanding CPU time – Class 1 (in-application)  From getting a thread to end of execution – Class 2 (in-DB2)  CPU related to SQL execution – Class 3  Wait times (around 20 buckets)  The dreaded ‘not-accounted’

16 16 © 2015 CA. ALL RIGHTS RESERVED. System Address Space CPU Time…cont’d  Some rules of thumb – TCB times should be low compared to MSTR and DBM1 SRB times – IRLM SRB time should be relatively low compared to MSTR and DBM1 SRB time  MSTR – Reduce TCB  Bigger Logs  Not always achievable under certain processing but… – Try and keep 6 hours of log data across your active logs – Reduce SRB  Investigate REALSTORAGE_MANAGEMENT zPARM – AUTO by default – Consider ON for constrained development environments

17 17 © 2015 CA. ALL RIGHTS RESERVED. System Address Space CPU Time…cont’d  DBM1 – Reduce TCB  Monitor dataset open and close processing  Should be less than 0.1 – 1 opens per second during normal processing  IFCID 0002 QBSTDSO shows cumulative dataset opens since start of stats interval  IRLM – Reduce SRB  P-lock negotiation for pagesets constantly moving in and out of GBP dependency can add to SRB time – Look at pseudo close controlling zPARMS PCLOSEN & PCLOSET (consider increasing these if GBP dependency thrashing is recognised)  Especially if you are seeing synchronous IO increase – Also look at CLOSE settings  If the object Is known to be GBP dependent CLOSE NO may be best

18 EDM Pool

19 19 © 2015 CA. ALL RIGHTS RESERVED. EDM Pool  Aim for very high CT and PT hit rate ratios 95-99% – Cursor Table  ((No. of CT Requests – No. of CT misses) / No. of CT Requests) * 100  IFCID 0002 – ((QISECTG-QISECTL)/QISECTG) * 100 – Package Table  ((No. of PT Requests – No. of PT misses) / No. of PT Requests) * 100  IFCID 0002  ((QISEKTG-QISEKTL)/QISEKTL) * 100  DBD hit rate ratio should be 95-99% – ((No. of DBD Requests – No. of DBD misses) / No. of DBD Requests) * 100 – IFCID 0002  ((QISEDBDG-QISEDBDL)/QISEDBDG) *100

20 20 © 2015 CA. ALL RIGHTS RESERVED. EDM Pool…cont’d  Dynamic cache statement hit ratio should be high – Global Statement Cache  Held in EDM Pool – active when zPARM CACHEDYN=YES – A copy of the PREPAREd statement is held here  Hit ratio calculation – IFCID 0002 – (QXSTNFND + QISEDSI) / QXSTNFND – Local Statement Cache  Held in thread storage – Active when KEEPDYNAMIC bind option used – zPARM MAXKEEPD limits the number of statements stored  Hit ratio calculation – IFCID 0002 – (QXSTNPRP + QXSTIPRP) / QXSTNPRP – Allows for full PREPARE avoidance

21 21 © 2015 CA. ALL RIGHTS RESERVED. EDM Pool…cont’d  Concentrate Statements With Literals (CSWL) – Introduced in DB2 10 (NFM) – In this day and age it is difficult to have control over all SQL (especially dynamic) – DB2 can now re-use dynamic statements, even without parameter markers, if the only difference is in the literal values – To make use of this feature code  Specify CONCENTRATE STATEMENTS WITH LITERALS on the ATTRIBUTES clause of the PREPARE statement

22 DDF

23 23 © 2015 CA. ALL RIGHTS RESERVED. DDF  PKGREL_COMMIT = YES (DB2 11) – A PACKAGE bound RELEASE(DEALLOCATE) will be implicitly released at COMMIT/ROLLBACK if BIND/REBIND or OLS is queued that effects it  MAXDBAT – -DISPLAY DDF DETAIL  Look at QUEDBAT – Cumulative counter of queued DBATs since DB2 was recycled  Look at CONQUED – Currently queued DBATs (also shown in IFCID 0001 QDSTQDBT)  If either of these are high consider increasing zParm MAXDBAT – Carefully consider the storage implications of doing this though  WLMHEALTH – Shows how WLM is treating the address space

24 24 © 2015 CA. ALL RIGHTS RESERVED. -DISPLAY DDF DETAIL -D1B1 DIS DDF DET DSNL080I -D1B1 DSNLTDDF DISPLAY DDF REPORT FOLLOWS: DSNL081I STATUS=STARTD DSNL082I LOCATION LUNAME GENERICLU DSNL083I DB1B USIBMSC.SCPD1B1 -NONE DSNL084I TCPPORT=39400 SECPORT=39401 RESPORT=39402 IPNAME=-NONE DSNL085I IPADDR=::9.12.6.70 DSNL086I SQL DOMAIN=wtsc63.itso.ibm.com DSNL086I RESYNC DOMAIN=wtsc63.itso.ibm.com DSNL089I MEMBER IPADDR=::9.12.6.70 DSNL090I DT=I CONDBAT= 10000 MDBAT= 200 DSNL091I MCONQN = 200 MCONQW = 120 DSNL092I ADBAT = 0 QUEDBAT = 0 INADBAT = 0 CONQUED = 0 DSNL093I DSCDBAT= 0 INACONN = 0 DSNL094I WLMHEALTH=100 CLSDCONQN= …

25 Log Activity

26 26 © 2015 CA. ALL RIGHTS RESERVED. Log Activity  In DB2 11 Log Buffers use 1MB page frames if they are available – Consider using these (LFAREA setting) – All buffers reside in HCSA  Log Buffers can be a bottleneck - writing – Remember if they are unavailable all INSERT, UPDATE and DELETE processing is suspended – In DB210 and above OUTBUFF specification is going to be allocated and fixed at DB2 startup – correct sizing is important – Calculate Log data volume at peak times  IFCID 0001 – QJSTBFFL (Log CI’s created) * 4K/monitoring_interval or stats_interval – If QJSTWTB (Log buffers unavailable) is > 0 at non-peak times increase OUTBUFF

27 27 © 2015 CA. ALL RIGHTS RESERVED. Log Activity  Log writes and log prefetch are zIIP eligible in DB2 11 – zIIP queuing may potentially cause issues  A little more on this later  I’m sure many of you have already implemented these, but to reduce log activity ensure – Compression is in use, especially for large row lengths  And compression dictionaries are current – RRF is implemented for objects that contain VARCHARs  Especially those where VARCHARs are not at the end of the table definition

28 Locking and Latching

29 29 © 2015 CA. ALL RIGHTS RESERVED. Locking and Latching  Making the most of lock avoidance – CLSN, PUNC bits, Datasharing page latches – Are unlock requests/commits high for an interval (>5)?  If so lock avoidance may not be working well  Lack of lock avoidance can also result in IRLM latch contention  Lots of pointer records? - Reorg – DB211 - PCTFREE_UPD (-1 learns the best value)  Pseudo deleted index rows – Count in SYSIBM.SYSINDEXSPACESTATS – REORGPSUEDODELETES  Non-data sharing this should be no more than 10% of total entries  Data sharing no more than 5% of total entries – A S-Lock will be required if CLSN processing fails  To clean-up - Reorg or if using DB211 an automated process is available

30 30 © 2015 CA. ALL RIGHTS RESERVED. Locking and Latching…cont'd  Long URs can result in lock avoidance not being utilised – Consider using the following ZPARMs  URCHKTH – number of checkpoint cycles before COMMIT – DB2 10> default is 5  URLGWTH – number of log writes between COMMITs – DB2 10> default is 10K  IRLM latch contention is considered high if IRLM suspensions are approx. 5% of the total IRLM lock requests – IRLM latch contention rate  IFCID 0002  QTXASLAT / (QTXALOCK + QTXAUNLK + QTXAQRY + QTXACHNG) * 100

31 31 © 2015 CA. ALL RIGHTS RESERVED. Data Sharing Contentions  Global contention rate (%) over time period – Next slide has calculation  False contention (IFCID 0002 – QTGSFLMG) – This is a per member rate  Assuming OA12164 and PK85159 are on (old APARs) – Should be less than 3% of total XES IRLM requests  Calculation shown on the next slide  XES contention (IFCID 0002 - QTGSSGLO) – XES only has 2 locking modes S and X – IRLM supports many more – the lock may be compatible – V8 introduced ‘locking protocol 2’ to improve XES/IRLM lock mapping  Real Contention (IFCID 0002 – QTGSIGLO)

32 32 © 2015 CA. ALL RIGHTS RESERVED. Calculations  Global contention rate – (QTGSIGLO + QTGSSGLO + QTGSFLMG) / Total XES IRLM requests  Total XES IRLM requests – (QTGSLSLM + QTGSCSLM + QTGSUSLM) + (QTGSKIDS + QTGSFLSE) + (QTGSIGLO + QTGSSGLO + QTGSFLMG)

33 33 © 2015 CA. ALL RIGHTS RESERVED. Tuning Data Sharing Locks  TRACKMOD NO – Limit spacemap updates – Downside – incremental copies will run longer due to having to scan the space to find changed pages  Reduce index leaf page splits – IFCID 359  INSERT intensive workload? – Consider MEMBER CLUSTER  INSERTS will be added where there is space rather than in CLUSTERing order – Likely to result in reducing P-locks – Although an increase in spacemap pages  Each spacemap only covers 199 pages when using MEMBER CLUSTER  MAXROWS 1 LOCKSIZE PAGE if locking is intensive – Consider for small objects to mimic row level locking

34 zIIPs

35 35 © 2015 CA. ALL RIGHTS RESERVED. zIIP  As already stated, including zIIP processing is important when assessing overall CPU resource consumption  SYS1.PARMLIB(IEAOPTxx) parameters – IIPHONORPRIORITY (YES/NO)  This setting specifies whether GCP can assist when zIIPs are flooded – YES – workload can be passed back to GCP (preferred setting) – NO – workload will queue until zIIP capacity is available  DB2 11 system workload will not be offloaded (i.e. prefetch) – ZIIPAWMT  The amount of time workload will queue before being returned to the GCP – Minimum setting 1600 (1.6 milliseconds) – Default  HIPERDISPATCH = YES – 3200 (3.2 milliseconds)

36 36 © 2015 CA. ALL RIGHTS RESERVED. zIIP  WLM Activity Report (RMF) – APPL%IIPCP shows how much work was offloaded to zIIP and returned to the GCP  Rule of thumb to minimise dispatching delays – Total zIIP utilization thresholds  1 zIIP < 30%, 2 zIIPs < 60%, 3 zIIPS < 70% – Don’t drive your zIIPs in the same way you drive GCP  In most shops the number of GCP engines far outweigh the number of zIIP engines  Queuing theory dictates that at relatively low utilization values response time increases (see graph on next slide)

37 37 © 2015 CA. ALL RIGHTS RESERVED. Markov’s Equation

38 IO Suspensions

39 39 © 2015 CA. ALL RIGHTS RESERVED. IO Suspensions  There is a need to monitor IO (read and write) elapsed time – Looking for suspensions – Generally there is more read IO – a good place to start  General high processor contention may manifest itself as an IO problem – Once IO completes the application is not immediately dispatched  What is a good response time? – As always it depends! – A baseline for your environment is required  Busier IO subsystems are of course going to have higher response times – Data residing on SSDs should certainly see a better response time than traditional DASD

40 40 © 2015 CA. ALL RIGHTS RESERVED. IO Suspensions…cont’d  Synchronous IO – Most monitors will show an average for synchronous IO  Perhaps broken down by read and write – Understand the issue, is it a large number of small suspensions or a small number of large suspensions? - Use IFCID 0003 fields  QWACARNE – shows the number of synchronous IO wait events  QWACAWTI – shows the accumulated wait time – Number of suspensions high  Poor access path, poor page residency (covered in next section), disorganised objects, index splits (IFCID 359) – Time per suspension large  Disk inefficiency – Over utilization, ensure disk features are being used such as PAV on ESS – Log synchronous read IO waits on a separate counter to assist with problem determination – IFCID 0003 QWACARLG  Lots of ROLLBACKS?

41 Buffer Pools

42 42 © 2015 CA. ALL RIGHTS RESERVED. Buffer Pools - Local  Look at your hit ratios  Data Management threshold reached – This should be kept at 0 – Threads processing multiple rows in same page will result in 1 GETPAGE per row if this is hit  Page residency time – System Residency Time – look for a goal of > 60 seconds  Calculation on next slide  Reducing IO – Add 50MB to a BP in increments, once IO reduction drops off you've likely hit the top of the curve  Check for pre-fetched pages getting stolen before they are used (IFCID 0002 QBSTSIO)

43 43 © 2015 CA. ALL RIGHTS RESERVED. Page Residency Time Calculations  All IFCID fields from IFCID 0002  Total Pages Read – QBSTRIO + QBSTSIO + QBSTSPP + QBSTLPP + QBSTDPP  System Residency Time (seconds) – VPSIZE / Total Pages Read per second  Random Page Residency Time (seconds) – MAX (System Residency Time, (buffer pool size * (1-VPSEQT/100) / sync pages (QBSTSIO) read per second)  Sequential Page Residency Time (seconds) – MIN (System Residency Time, (buffer pool size * (VPSEQT/100) / sync pages (QBSTSIO) read per second)

44 44 © 2015 CA. ALL RIGHTS RESERVED. Buffer Pool – Page Fixing  The pool must be TOTALLY backed by real storage – IFCID 0002 QBSTRPI + QBSTWPI to see if paging is becoming an issue for page fixed pools – If page fixing still a requirement keep critical objects in pool, remove others and downsize  Base which pools to fix on IO intensity if you are storage constrained – IO intensity - (BP pages read + BP pages written) / number of buffers – Also ensure these pools benefit from large frames if available  Keep an eye on Large Frame high water marks – D VIRTSTOR,LFAREA (APAR 0A31116)  AUTOSIZE – New DB211 parms to assist here VPSIZEMIN/VPSIZEMAX

45 45 © 2015 CA. ALL RIGHTS RESERVED. Group Buffer Pools  Coupling Facility settings – ALLOWAUTOALT(YES)  Not designed for sudden floods but gradual workload increases  No rebuild of the structure required – Set MINSIZE equal to INITSIZE (assuming your initial sizing is good ) – FULLTHRESHOLD – 80-90% – SIZE – 1.3 to 2 times INITSIZE  Synchronous Read Cross Invalidation ratio should be < 10% – QBGLXR / (QBGLXR + QBGLXD)  Failures due to lack of storage (IFCID 0002 QBGLWF) – Ideally these should be 0 but certainly aim for <1%  Group Buffer Pool Write Around (IFCID 0002 QBGLWA) – If this is > 0 it is indicative of the GBP becoming flooded

46 In Summary

47 47 © 2015 CA. ALL RIGHTS RESERVED.  We’ve covered a lot of ground  Hopefully you have found a few things to take back to your environments in this PPT – All the fields/calculations have been included for you to use at your shops – Disclaimer – I have done my best to ensure the calculations are correct  Remember many of the metrics (individual and calculated) should be viewed over an interval or time period – What that is is down to you – perhaps a peak hour?  Make use of your monitors to become more proactive – But keep an eye on monitoring overhead, there is a happy medium between collecting too much and too little information  Remember the missing bullet holes!

48 Senior Principal Consultant andrew.ward@ca.com Andy Ward


Download ppt "Resetting Your Pain Threshold Andy Ward – Senior Principal Consultant December 2015."

Similar presentations


Ads by Google