Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wait-Time Based Oracle Performance Management Prepared for NCOUG Presented by Matt Larson CTO, Confio Software.

Similar presentations


Presentation on theme: "Wait-Time Based Oracle Performance Management Prepared for NCOUG Presented by Matt Larson CTO, Confio Software."— Presentation transcript:

1 Wait-Time Based Oracle Performance Management Prepared for NCOUG Presented by Matt Larson CTO, Confio Software

2 2 Who am I?  Founder and CTO of database performance software company  Former DBA consultant specializing in Oracle performance tuning  Co-author of three Oracle books (Oracle Development Unleashed, Oracle Unleashed 2 nd Edition, Oracle8 Server Unleashed)  Co-author of two other database related books

3 3 Agenda  Foundation  Case Study One: PL/SQL Issue  Case Study Two: Full Table Scans  Case Study Three: Inefficient Indexes  Case Study Four: Locking Problems  Q&A

4 4 Working the Wrong Problems  After spending an agonizing week tuning Oracle buffers to minimize I/O operations, management typically rewards you with: A. An all expense paid vacation B. A free lunch C. A stale donut D. Reward? Nobody even noticed!

5 5 Tuning Success (or lack thereof)  Your role in the rollout of a new customer facing application results in: A. Keys to drive the CEO’s Porsche B. Keys to use the executive restroom C. A mop to use in the executive restroom D. Your office has been moved to the restroom

6 6 Conventional Tools Measure System Health…  Assumption: If I make the database healthy, users benefit  Symptoms DBA finds “big” problem and fixes it, users report no impact Lots of data to review and things to fix, not sure which to do first Unclear view of performance leads to Finger-pointing Developer or vendor It’s your Code! It’s your Database! IT staff

7 7 …RMM Focuses on User Wait-Time 1. Identify each bottleneck affecting the user 2. Rank bottlenecks by user impact 3. Implement proven suggestions 4. Set correct expectations on impact of fix 5. Show proof the fix helped users End User

8 8 RMM: Confio’s Underlying Methodology  Resource Mapping Methodology: Industry best-practice optimizing performance tuning for maximum business impact  Three Key Principles of RMM 1. SQL View: All statistics at SQL statement level 2. Time View: Measure Time, not number of times a resource is utilized 3. Full View: Separately measure every resource to isolate source of problems

9 9 Illustrating example: SQL View Principle  Example: ‘CEO’ measuring ‘employee’ output  Averaging over entire company gives no useful data  Must measure each job separately  DBA must manage database similarly  Measure and identify bottlenecks for each SQL independently

10 10 Illustrating example: Time View Principle  Example: ‘CEO’ counting ‘tasks’ vs. ‘time to complete’  Counting system statistics not meaningful  Must measure Time to complete  System stats (buffer size, hit ratios, I/O counts) do not identify where database customers are waiting  Identify and optimize Wait Time for each SQL as best indicator of performance

11 11 Illustrating example: Full View Principle  Example: ‘CEO’ measuring results with blind spot hiding key processes  Without direct visibility, valuable info is lost  Must have visibility to every process step  Distinctly identify and measure each Oracle resource for each distinct SQL

12 12 RMM-compliant Performance Tool Types Two Primary Types of Tools  Session Specific Tools Tools that focus on one session at a time often by tracing the process Examples: tkprof (Oracle), OraSRP Profiler (open source)  Continuous DB Wide Monitoring Tools Tools that focus on all sessions by sampling Oracle Example: Confio Ignite  Both tools have a place in the organization

13 13 Tracing  Tracing with wait events complies with RMM  Should be used cautiously in non-batch environments due to session statistics skew 80 out of 100 sessions have no locking contention issues 20 out of 100 have spent 99% of time waiting for locked rows If you trace one of the “80” sessions, it appears as if you have no locking issues (and spend time trying to tune other items that may not be important) If you trace one of the “20” sessions, it appears as if you could fix the locking problems and reduce your wait time by 99%

14 14 Tracing (cont)  Very precise statistics, may be only way to get certain statistics  Bind variable information is available  Different types of tracing available providing detail analysis even deeper than wait events  Ideal if a known problem is going to occur in the future (and known session)  Difficult to see trends over time  Primary audience is technical user

15 15 Continuous DB Wide Monitoring Tools  24/7 sampling provides real-time and historical perspective  Allows DBA to go back in time and retrieve information even if problem was not expected  Not the level of detail provided by tracing  Most of these tools have trend reports that allow communication with others outside of the group What is starting to perform poorly? What progress have we made while tuning?

16 Case Study One PL/SQL Issue

17 17 Problem Observed  Critical situation: application performance unsatisfactory Response time between 240 and 900 seconds Most times users shutdown application Very high network traffic (3x—4x normal), indicating time-outs and user refreshes “CritSit” declared: major effort to resolve problem

18 18 Wait Events During Problem library cache lock library cache pin

19 19 Investigation

20 20 What does RMM tell us?  Which SQL: CERN_PROFILE Truncate  Which Resource: library cache pin library cache lock  How much time: up to 16 Hours of wait time per hour

21 21 Results  Found an invalid trigger Insert statement was trying to fire trigger Truncate was locked behind it  Response time improvement from 60,000 seconds (worst case) to 0 seconds  Configured alert to notify DBA when the problem starts next time  Problem should not occur for 22 hours without anyone knowing

22 Case Study Two DB File Scattered Reads

23 23 Problem Observed  Problem: Login taking 4 minutes for each user everyday they started their day High wait accumulation from 6:30 – 8:30 am 600 Users X 4 Minutes = 40 Hours Every Day 40 Hours lost productivity every day  Applied RMM approach to problem identification Identify Wait Time, offending SQL, offending Resource

24 24 Wait Events During Problem

25 25 Investigation

26 26 What does RMM tell us?  Which SQL: LoginLookup UpdateInventory  Which Resource: Scattered Read Buffer Busy Waits  How much time: 40+ Hour Every Day

27 27 Hypotheses: Oracle Interpretations Two Alternative paths for optimization: I.Eliminate Full Table Scan There isn’t a need to read the whole table, so we need to find the right shortcut II.Improve response time We need to read most or all of the table anyway, so let’s just figure out how to do it faster  Key Questions: 1.Is full table scan necessary? 2.What causes a full table scan for this SQL Statement?

28 28 I. Unnecessary Full Table Scan? Solutions: 1. Add / Modify index(es) on the table 2. Update table and/or index statistics if proper index not being used 3. Add hint to use existing index 4. Optimize the application

29 29 Full Table Scan is Needed Two alternative paths for optimization: I. Eliminate Full Table Scan There isn’t a need to read the whole table, so we need to find the right shortcut II. Improve response time We need to read most or all of the table anyway, so let’s just figure out how to do it faster

30 30 Solutions: 1. Use Parallel Reads 2. Set Database Parameters 3. Improve I/O Speed 4. Optimize the application 5. Larger Database Caches (64-bit) II. Improve Response Time for Db File Scattered Reads

31 31 1. Use Parallel Reads = Faster FTS  Parallel Reads Can be set at the table level (use with caution) Alter table customer parallel degree 4; Normally used by hinting in the SQL Statement select /*+ FULL(customer) PARALLEL(customer, 4) */ customer_name from customer;  A delicate tradeoff sacrifice the performance of others for the running query.  Not necessarily efficient, just faster Parallel Reads may actually do twice the work of a sequential query but have four workers, thus finishing in half the time while using 8x resource

32 32 2. Set database parameters  DB_FILE_MULTIBLOCK_READ_COUNT specifies the maximum number of blocks read in one I/O operation during a sequential scan Impacts the optimizer Reduces number of I/Os required For OLTP, typically between 4 to 16 Optimizer will more likely to FTS if set too high  Ensure that the database read requests are synced up with the O/S.  This gets tricky if different block sizes are used in different tablespaces

33 33 3. Improve I/O speed  Get your SA involved  Investigate I/O sub-system Iostat, vmstat, sar, … for potential problems Monitor during high activity  Investigate contention at the disk/controller level. Learn which disks share common resources Use more disks to spread I/O and reduce hot spots  Investigate caching on disk sub-system and current memory usage

34 34 4. Optimizing the Application  Review application – do you have access to code for changes?  Understand the code around the problem SQL  Techniques to Optimize a statement: Reduce the number of calls for a SQL –Caching the data in the application –Creating a summary table (perhaps via a materialized view) –Eliminating the need for the data Retrieve Less Data with each statement –Add fields to the WHERE clause Combine SQLs for fewer calls –Combine several SQLs with different bind variables into one large statement that retrieves all the data in one shot

35 35 5. Larger Database Caches (64-bit)  Larger cache means fewer disk reads  May need large increase to have significant impact Performance Gain % of database in memory

36 36 Results  Added indexes to underlying tables  Added Materialized View Full Table Scan Fixed

37 Case Study Three DB File Sequential Reads

38 38 Problem Observed  Data Warehouse loads were taking too long  Noticed high wait times on db file sequential read wait event  DBAs were confused – why are data loads “reading” data  Applied RMM approach to problem identification Identify Wait Time, offending SQL, offending Resource

39 39 Investigation SQL Sequential read time Sequential read time by object for SQL

40 40 What does RMM tell us?  Which SQL: 3 Insert Statements  Which Resource: DB File Sequential Read  How much time: 5 hour+ 90% of wait time

41 41 Investigating db file sequential reads  Often considered a “good” read  DB file sequential reads normally occur during index lookups  Often a single-block read although it may retrieve more than one block.  Sequential Read may also be seen for reads from: datafile headers rebuilding the control file dumping datafile headers

42 42 Hypotheses: Oracle Interpretations of Sequential Reads Causes of excessive wait times: I. Reading too many index leaf blocks II. Not finding block in buffer cache forces disk read III. Slow disk reads IV. Contention for certain blocks V. High Read time on INSERT statements

43 43 I. Reading too many index and table blocks (cont) 1. Rebuild Fragmented Indexes alter index rebuild [online]; 2. Compress Indexes alter index rebuild compress; Uses more CPU 3. Multi-column indexes Avoid the table lookup Will create a larger index 4. Pre-sort Table data

44 44 II. Not finding block in buffer cache forces disk read  Db File sequential reads occur because the block is not in the buffer cache.  How do we make sure more blocks are already in the cache?  Solutions 1.Increase the size of the buffer cache(s) 2.Put the object in a cache where it is less likely to get flushed out

45 45 III. Slow disk reads  With databases, it often comes down to this – the disk just needs to be faster  Put certain objects on the fastest disk  O/S file caching using special software that makes normal files perform like raw files  Increase Storage System Caching – such as an EMC cache

46 46 Results  Inserts were updating indexes that had low cardinality leading columns  Reordered columns in the index and got a 50% performance improvement  Log file sync wait event was then the largest wait event  Data was being committed too often  Tuning is an iterative process

47 Case Study Four Enqueue

48 48 Problem Observed  Problem: High Wait on CPPFPROD Accumulated wait 9.5 hours (34,000 sec) during am hour End users were complaining loudly  Applied RMM approach to problem identification: Identify Wait Time, offending SQL, offending Resource

49 49 Investigation: Drill down to Top SQL & Identify likely source of Problem

50 50 enqueue TX enqueue  Locks held for the life of a transaction until a COMMIT or ROLLBACK. TM enqueue  Locks being held when foreign key constraints are not indexed properly. ST enqueue  Locks held during dynamic space allocations. HW enqueue  Serialization for the allocation of space beyond the high water mark Causes

51 51  Generally due to application or table setup issues  Is acquired when a transaction initiates an UPDATE Row is locked by the session Others may select from the row (read consistency) Others wanting to UPDATE same row must wait  Lock is held until a COMMIT or ROLLBACK is issued TX enqueue

52 52 Waits caused by “normal” active transactions  Just issue a COMMIT or ROLLBACK  Determine what the true unit of work is TX enqueue

53 53 Waits due to Insufficient 'ITL' slots in a Block  The ITL (interested transaction list) is an area at the top of each data block where Oracle keeps track of which rows are locked by which transaction  Every transaction wanting to change a block requires a slot in the ITL list of the block  The number of ITL slots is controlled by INITRANS, initial number of slots at block creation MAXTRANS, total allowable slots over time ITL list will expand to allow MAXTRANS only if space is available TX enqueue

54 54 Waits due to rows being covered by the same BITMAP index fragment  Bitmap indexes allow one index entry to cover many rows within a table  If two sessions try to insert or update the same key value the second session has to wait TX enqueue

55 55  Caused by space management operations Can happen with small extent sizes, allocation of temporary segments for sorting May get an ORA indicating a timeout  Unnecessary Sorting Disk sorting requires space management and thus contention on the ST enqueue Eliminate as much disk sorting as possible Evaluate SORT_AREA_SIZE or PGA_AGGREGATE_TARGET parameters ST enqueue

56 56  General TEMP tablespace advice SMON cleanup of temporary space Set PCTINCREASE equal to zero to stop cleanup/coalesce Set temporary tablespaces as TEMPORARY  SMON in parallel environment SMON Cleanup operations are magnified  Hanging or slow system Side effect of many processes on the ST enqueue ST enqueue

57 57  Acquired to move the HW mark  High volume of inserts across concurrent session will cause a wait on this contention  Recreate / modify the object with larger extents  Pre-allocate extents ALTER TABLE … ALLOCATE EXTENT  V$LOCK.ID1 is the tablespace number.  V$LOCK.ID2 is the relative dba of segment header of the object for which space is being allocated. HW enqueue

58 58 What is blocking session waiting on?  Idle Session  DB File Scattered Reads  Another session

59 59 Idle Session Scenario JimSally  Update customer 147  Goes to Lunch Locked trying to update customer 147 Jim will needlessly wait a long time. DBA can kill Sally’s session IF they can tell that the session is idle.

60 60 Missing Index Scenario JimSally  Update customer 147  Selects from order table with missing index Locked trying to update customer 147 DBA can tell that Jim is really waiting because of a missing index on the order table – even though Jim isn’t using the order table.

61 61 Idle Session Scenario JimSally  Update customer 147  Selects from order table with missing index Updated warehouse 22 Locked trying to update customer 147 A chain of locks occurs even though both locked users aren’t accessing the table with missing indexes Bob Locked trying to update warehouse 22

62 62 Wait Events for Development  Tuning SQL for optimal performance  Debug/test/integrate/pilot process  Understand impact on existing database  Understand Oracle impact on application performance  View into production for better development prioritization and feedback  Reduce finger-pointing

63 63 Conclusion  Conventional Tuning focus on “system health” and lead to finger-pointing and confusion  Wait event tuning implemented according to RMM is the new way to tune  Two RMM-compliant tools types Tracing tools Continuous DB-wide monitoring tools  Questions & Answers

64 64 Who is Confio?  Oracle product is “Ignite for Oracle”, fast install, free trial at  Organizations who trust Confio to monitor their most critical applications include:

65 65 Thank you for coming Matt Larson Founder/Chief Technology Officer Contact Information ext. 110 Company website


Download ppt "Wait-Time Based Oracle Performance Management Prepared for NCOUG Presented by Matt Larson CTO, Confio Software."

Similar presentations


Ads by Google