Presentation on theme: "Wait-Time Based Oracle Performance Management"— Presentation transcript:
1 Wait-Time Based Oracle Performance Management Prepared for NCOUG Presented by Matt Larson CTO, Confio Software
2 Who am I? Founder and CTO of database performance software company Former DBA consultant specializing in Oracle performance tuningCo-author of three Oracle books (Oracle Development Unleashed, Oracle Unleashed2nd Edition, Oracle8 Server Unleashed)Co-author of two other database related books
3 Agenda Foundation Case Study One: PL/SQL Issue Case Study Two: Full Table ScansCase Study Three: Inefficient IndexesCase Study Four: Locking ProblemsQ&A
4 Working the Wrong Problems After spending an agonizing week tuning Oracle buffers to minimize I/O operations, management typically rewards you with:A. An all expense paid vacationB. A free lunchC. A stale donutD. Reward? Nobody even noticed!
5 Tuning Success (or lack thereof) Your role in the rollout of a new customer facing application results in:A. Keys to drive the CEO’s PorscheB. Keys to use the executive restroomC. A mop to use in the executive restroomD. Your office has been moved to the restroom
6 Conventional Tools Measure System Health… Assumption: If I make the database healthy, users benefitSymptomsDBA finds “big” problem and fixes it, users report no impactLots of data to review and things to fix, not sure which to do firstUnclear view of performance leads to Finger-pointingKey Message – Other tools show lots of statistics, but they are not relevant to the end user experience.It’s your Code!It’s your Database!IT staffDeveloper or vendor
7 …RMM Focuses on User Wait-Time Identify each bottleneck affecting the userRank bottlenecks by user impactImplement proven suggestionsSet correct expectations on impact of fixShow proof the fix helped usersEnd UserConfio addresses 3 requirements that are needed to find the real performance problems – if your system does not meet these, you are cannot find the real problem.
8 RMM: Confio’s Underlying Methodology Resource Mapping Methodology: Industry best-practice optimizing performance tuning for maximum business impactThree Key Principles of RMM1. SQL View: All statistics at SQL statement level2. Time View: Measure Time, not number of times a resource is utilized3. Full View: Separately measure every resource to isolate source of problems
9 Illustrating example: SQL View Principle Example: ‘CEO’ measuring ‘employee’ outputAveraging over entire company gives no useful dataMust measure each job separatelyDBA must manage database similarlyMeasure and identify bottlenecks for each SQL independently
10 Illustrating example: Time View Principle Example: ‘CEO’ counting ‘tasks’ vs. ‘time to complete’Counting system statistics not meaningfulMust measure Time to completeSystem stats (buffer size, hit ratios, I/O counts) do not identify where database customers are waitingIdentify and optimize Wait Time for each SQL as best indicator of performance
11 Illustrating example: Full View Principle Example: ‘CEO’ measuring results with blind spot hiding key processesWithout direct visibility, valuable info is lostMust have visibility to every process stepDistinctly identify and measure each Oracle resource for each distinct SQL
12 RMM-compliant Performance Tool Types Two Primary Types of ToolsSession Specific ToolsTools that focus on one session at a time often by tracing the processExamples: tkprof (Oracle), OraSRP Profiler (open source)Continuous DB Wide Monitoring ToolsTools that focus on all sessions by sampling OracleExample: Confio IgniteBoth tools have a place in the organization
13 Tracing Tracing with wait events complies with RMM Should be used cautiously in non-batch environments due to session statistics skew80 out of 100 sessions have no locking contention issues20 out of 100 have spent 99% of time waiting for locked rowsIf you trace one of the “80” sessions, it appears as if you have no locking issues (and spend time trying to tune other items that may not be important)If you trace one of the “20” sessions, it appears as if you could fix the locking problems and reduce your wait time by 99%
14 Tracing (cont)Very precise statistics, may be only way to get certain statisticsBind variable information is availableDifferent types of tracing available providing detail analysis even deeper than wait eventsIdeal if a known problem is going to occur in the future (and known session)Difficult to see trends over timePrimary audience is technical user
15 Continuous DB Wide Monitoring Tools 24/7 sampling provides real-time and historical perspectiveAllows DBA to go back in time and retrieve information even if problem was not expectedNot the level of detail provided by tracingMost of these tools have trend reports that allow communication with others outside of the groupWhat is starting to perform poorly?What progress have we made while tuning?
17 Problem ObservedCritical situation: application performance unsatisfactoryResponse time between 240 and 900 secondsMost times users shutdown applicationVery high network traffic (3x—4x normal), indicating time-outs and user refreshes“CritSit” declared: major effort to resolve problem
18 Wait Events During Problem library cache locklibrary cache pinRemember the “Willie Sutton” Rule.
20 What does RMM tell us? Which SQL: CERN_PROFILE Truncate Which Resource: library cache pinlibrary cache lockHow much time: up to 16 Hours of wait time per hour
21 Results Found an invalid trigger Insert statement was trying to fire triggerTruncate was locked behind itResponse time improvement from 60,000 seconds (worst case) to 0 secondsConfigured alert to notify DBA when the problem starts next timeProblem should not occur for 22 hours without anyone knowing
23 Problem ObservedProblem: Login taking 4 minutes for each user everyday they started their dayHigh wait accumulation from 6:30 – 8:30 am600 Users X 4 Minutes = 40 Hours Every Day40 Hours lost productivity every dayApplied RMM approach to problem identificationIdentify Wait Time, offending SQL, offending Resource
26 What does RMM tell us? Which SQL: LoginLookup UpdateInventory Which Resource: Scattered ReadBuffer Busy WaitsHow much time: HourEvery Day
27 Hypotheses: Oracle Interpretations Two Alternative paths for optimization:Eliminate Full Table ScanThere isn’t a need to read the whole table, so we need to find the right shortcutImprove response timeWe need to read most or all of the table anyway, so let’s just figure out how to do it fasterKey Questions:Is full table scan necessary?What causes a full table scan for this SQL Statement?
28 I. Unnecessary Full Table Scan? Solutions:Add / Modify index(es) on the tableUpdate table and/or index statistics if proper index not being usedAdd hint to use existing indexOptimize the application
29 Full Table Scan is Needed Two alternative paths for optimization:I. Eliminate Full Table ScanThere isn’t a need to read the whole table, so we need to find the right shortcutII. Improve response timeWe need to read most or all of the table anyway, so let’s just figure out how to do it faster
30 II. Improve Response Time for Db File Scattered Reads Solutions:Use Parallel ReadsSet Database ParametersImprove I/O SpeedOptimize the applicationLarger Database Caches (64-bit)
31 1. Use Parallel Reads = Faster FTS Can be set at the table level (use with caution)Alter table customer parallel degree 4;Normally used by hinting in the SQL Statementselect /*+ FULL(customer) PARALLEL(customer, 4) */ customer_name from customer;A delicate tradeoffsacrifice the performance of others for the running query.Not necessarily efficient, just fasterParallel Reads may actually do twice the work of a sequential query but have four workers, thus finishing in half the time while using 8x resource
32 2. Set database parameters DB_FILE_MULTIBLOCK_READ_COUNTspecifies the maximum number of blocks read in one I/O operation during a sequential scanImpacts the optimizerReduces number of I/Os requiredFor OLTP, typically between 4 to 16Optimizer will more likely to FTS if set too highEnsure that the database read requests are synced up with the O/S.This gets tricky if different block sizes are used in different tablespaces
33 3. Improve I/O speed Get your SA involved Investigate I/O sub-system Iostat, vmstat, sar, … for potential problemsMonitor during high activityInvestigate contention at the disk/controller level. Learn which disks share common resourcesUse more disks to spread I/O and reduce hot spotsInvestigate caching on disk sub-system and current memory usage
34 4. Optimizing the Application Review application – do you have access to code for changes?Understand the code around the problem SQLTechniques to Optimize a statement:Reduce the number of calls for a SQLCaching the data in the applicationCreating a summary table (perhaps via a materialized view)Eliminating the need for the dataRetrieve Less Data with each statementAdd fields to the WHERE clauseCombine SQLs for fewer callsCombine several SQLs with different bind variables into one large statement that retrieves all the data in one shot
35 5. Larger Database Caches (64-bit) Larger cache means fewer disk readsMay need large increase to have significant impactPerformanceGain% of database in memory
36 Results Added indexes to underlying tables Added Materialized View Full Table Scan Fixed
38 Problem Observed Data Warehouse loads were taking too long Noticed high wait times on db file sequential read wait eventDBAs were confused – why are data loads “reading” dataApplied RMM approach to problem identificationIdentify Wait Time, offending SQL, offending Resource
39 Investigation SQL Sequential read time Sequential read time by object for SQL
40 What does RMM tell us? Which SQL: 3 Insert Statements Which Resource: DB File Sequential ReadHow much time: 5 hour % of wait time
41 Investigating db file sequential reads Often considered a “good” readDB file sequential reads normally occur during index lookupsOften a single-block read although it may retrieve more than one block.Sequential Read may also be seen for reads from:datafile headersrebuilding the control filedumping datafile headers
42 Hypotheses: Oracle Interpretations of Sequential Reads Causes of excessive wait times:Reading too many index leaf blocksNot finding block in buffer cache forces disk readSlow disk readsContention for certain blocksHigh Read time on INSERT statements
43 I. Reading too many index and table blocks (cont) Rebuild Fragmented Indexesalter index rebuild [online];Compress Indexesalter index rebuild compress;Uses more CPUMulti-column indexesAvoid the table lookupWill create a larger indexPre-sort Table data
44 II. Not finding block in buffer cache forces disk read Db File sequential reads occur because the block is not in the buffer cache.How do we make sure more blocks are already in the cache?SolutionsIncrease the size of the buffer cache(s)Put the object in a cache where it is less likely to get flushed out
45 III. Slow disk readsWith databases, it often comes down to this – the disk just needs to be fasterPut certain objects on the fastest diskO/S file caching using special software that makes normal files perform like raw filesIncrease Storage System Caching – such as an EMC cache
46 ResultsInserts were updating indexes that had low cardinality leading columnsReordered columns in the index and got a 50% performance improvementLog file sync wait event was then the largest wait eventData was being committed too oftenTuning is an iterative process
48 Problem Observed Problem: High Wait on CPPFPROD Accumulated wait 9.5 hours (34,000 sec) during am hourEnd users were complaining loudlyApplied RMM approach to problem identification:Identify Wait Time, offending SQL, offending Resource
49 Investigation: Drill down to Top SQL & Identify likely source of Problem
50 enqueue Causes TX enqueue TM enqueue ST enqueue HW enqueue Locks held for the life of a transaction until a COMMIT or ROLLBACK.TM enqueueLocks being held when foreign key constraints are not indexed properly.ST enqueueLocks held during dynamic space allocations.HW enqueueSerialization for the allocation of space beyond the high water mark
51 enqueue TX Generally due to application or table setup issues Is acquired when a transaction initiates an UPDATERow is locked by the sessionOthers may select from the row (read consistency)Others wanting to UPDATE same row must waitLock is held until a COMMIT or ROLLBACK is issued
52 enqueue TX Waits caused by “normal” active transactions Just issue a COMMIT or ROLLBACKDetermine what the true unit of work is
53 enqueue TX Waits due to Insufficient 'ITL' slots in a Block The ITL (interested transaction list) is an area at the top of each data block where Oracle keeps track of which rows are locked by which transactionEvery transaction wanting to change a block requires a slot in the ITL list of the blockThe number of ITL slots is controlled byINITRANS, initial number of slots at block creationMAXTRANS, total allowable slots over timeITL list will expand to allow MAXTRANS only if space is available
54 enqueueTXWaits due to rows being covered by the same BITMAP index fragmentBitmap indexes allow one index entry to cover many rows within a tableIf two sessions try to insert or update the same key value the second session has to wait
55 enqueue ST Caused by space management operations Unnecessary Sorting Can happen with small extent sizes, allocation of temporary segments for sortingMay get an ORA indicating a timeoutUnnecessary SortingDisk sorting requires space management and thus contention on the ST enqueueEliminate as much disk sorting as possibleEvaluate SORT_AREA_SIZE or PGA_AGGREGATE_TARGET parameters
56 enqueue ST General TEMP tablespace advice SMON in parallel environment SMON cleanup of temporary spaceSet PCTINCREASE equal to zero to stop cleanup/coalesceSet temporary tablespaces as TEMPORARYSMON in parallel environmentSMON Cleanup operations are magnifiedHanging or slow systemSide effect of many processes on the ST enqueue
57 enqueue HW Acquired to move the HW mark High volume of inserts across concurrent session will cause a wait on this contentionRecreate / modify the object with larger extentsPre-allocate extentsALTER TABLE … ALLOCATE EXTENTV$LOCK.ID1 is the tablespace number.V$LOCK.ID2 is the relative dba of segment header of the object for which space is being allocated.
58 What is blocking session waiting on? Idle SessionDB File Scattered ReadsAnother session
59 Idle Session ScenarioSallyJimUpdate customer 147Goes to LunchLocked trying to update customer 147Jim will needlessly wait a long time. DBA can kill Sally’s session IF they can tell that the session is idle.
60 Missing Index Scenario SallyJimUpdate customer 147Selects from order table with missing indexLocked trying to update customer 147DBA can tell that Jim is really waiting because of a missing index on the order table – even though Jim isn’t using the order table.
61 Idle Session ScenarioSallyJimUpdate customer 147Selects from order table with missing indexUpdated warehouse 22Locked trying to update customer 147BobLocked trying to update warehouse 22A chain of locks occurs even though both locked users aren’t accessing the table with missing indexes
62 Wait Events for Development Tuning SQL for optimal performanceDebug/test/integrate/pilot processUnderstand impact on existing databaseUnderstand Oracle impact on application performanceView into production for better development prioritization and feedbackReduce finger-pointing
63 ConclusionConventional Tuning focus on “system health” and lead to finger-pointing and confusionWait event tuning implemented according to RMM is the new way to tuneTwo RMM-compliant tools typesTracing toolsContinuous DB-wide monitoring toolsQuestions & Answers
64 Who is Confio?Oracle product is “Ignite for Oracle”, fast install, free trial atOrganizations who trust Confio to monitor their most critical applications include:
65 Thank you for coming Matt Larson Founder/Chief Technology Officer Contact Informationext. 110Company website