Presentation on theme: "DATABASE MYSTERIES: BATCH JOB DIAGNOSTICS Chris Lawson May 4, 2007 "It is the brain, the little gray cells on which one must rely. -- Hercule Poirot."— Presentation transcript:
DATABASE MYSTERIES: BATCH JOB DIAGNOSTICS Chris Lawson May 4, 2007 "It is the brain, the little gray cells on which one must rely. -- Hercule Poirot
Our Agenda for Today A fun trivia quiz Overview: different types of diagnostics Preparing for Oracle Statspack Running Statspack reports Limitations of Statspack Some helpful Sql scripts Use the little gray cells Hastings And now, the Trivia Quiz... Case study: Diagnosing eco_out batch job. Awarding of prize for correct trivia quiz answer
TRIVIA QUIZ Oracles Statspack does not show … Snapshot start/end times High disk i/o Sql High # of execution Sql High # of parse Sql Init.ora parameters ITL locking events Top 5 wait events Database privilege problems Database memory statistics Vote now--prize at end of presentation! I say, Poirot!
Here are 2 Kinds of Diagnostics Simple run-time summaries, provided by application, show runtime patterns. Graphs show server load or disk i/o over time. Oracle-specific utilities, such as Statspack, that identify problem areas. Custom scripts that identify long-running transactions & poorly tuned Sql. Diagnostics that identify problems: Diagnostics that help resolve problems:
Run-Time Summaries Most batch job/data warehouse jobs provide runtime stats. Some even show the runtimes for each step. BillPay lists batch job stats in table Wfjob_Ctl: SQL> desc Wfjob_Ctl Name Null? Type BATCH_DATE NOT NULL VARCHAR2(8) JOB_NAME NOT NULL VARCHAR2(40) JOB_RUN_NUMBER NOT NULL NUMBER JOB_STATUS VARCHAR2(2) JOB_RETURN_CODE NUMBER JOB_START_TIME DATE JOB_END_TIME DATE HOSTNAME VARCHAR2(20)
Sample Diagnostic: Find All Slow Batch Jobs Select Job_name, To_char(job_start_time,'mon-dd Hh24:mi') STTME, To_char(job_end_time,'mon-dd Hh24:mi') ENDTME, Round((job_end_time-job_start_time)*24*60) MINS From Wfjob_ctl Where Job_start_time > Sysdate -.5 And (Job_end_time-job_start_time)*24*60 > 20 Order By 2 Show jobs in past ½ day running > 20 mins:
Find Slow Batch Jobs JOB_NAME STTME ENDTME MINS BP.westbp.gen_dailyrun.C dec-18 04:34 dec-18 06:08 94 BP.westbp.sam_gen_tkconf.C dec-18 06:00 dec-18 06:36 36 BP.westbp.prc_instalerts.C dec-18 06:01 dec-18 06:33 32 BP.westbp.cfin_prc_svcrs2.C dec-18 08:11 dec-18 08:31 20 List jobs in past ½ day running longer than 20 mins:
Sample Report: Find ecoout Runtimes select job_name,to_char(job_start_time, 'DAY') wekday, TO_CHAR(job_start_time,'MON-DD hh24:mi') STTME, TO_CHAR(job_end_time,'MON-DD hh24:mi') ENDTME, ROUND((job_end_time-job_start_time)*24*60) MINS from wfjob_ctl WHERE JOB_START_TIME > SYSDATE - 3 and job_name like '%ecoout_gen_pmt%' order by job_start_time JOB_NAME WEKDAY STTME ENDTME MINS BP.westbp.ecoout_gen_pmt SATURDAY DEC-16 08:01 DEC-16 08:48 48 Runtimes for ecoout job for last 3 days:
Find alert Runtimes select job_name, TO_CHAR(job_start_time,'MON-DD hh24:mi') STTME, TO_CHAR(job_end_time,'MON-DD hh24:mi') ENDTME, ROUND((job_end_time-job_start_time)*24*60) MINS from wfjob_ctl WHERE JOB_START_TIME > SYSDATE - 2 and job_name like '%alrt%' order by job_start_time JOB_NAME STTME ENDTME MINS BP.westbp.alrt_prc_daily.C DEC-16 10:43 DEC-16 11:39 56 BP.westbp.alrt_cln_pmt.C DEC-16 11:39 DEC-16 11:44 5 BP.westbp.alrt_cln_daily.C DEC-16 11:39 DEC-16 13:08 89 BP.westbp.alrt2_gen_pmtmd.C DEC-18 03:31 DEC-18 03:46 16 BP.westbp.alrt2_gen_ebnr.C DEC-18 03:31 DEC-18 03:38 7 Runtimes for all alerts job for last 2 days:
Another Type of Diagnostic: Server Load Charts This is a good way to spot trends and can help in capacity planning. Weve used these graphs to launch investigation after odd spikes. Server load graphs help isolate the time period having trouble. This method is typically a visual inspection.
Problem Identification: Server Load Graphs
Spikes Help Identify Problem
Bill-Pay Late January Problem Root cause: CSA users began running up to 8 bad SQL simultaneously
Introduction to Oracle Statspack Statspack is an Oracle utility that lists performance statistics for a given database. This is called a Statspack Report. You define the time period of interest by specifying the start-time and end-time. The stats come from taking regular snapshots of the database--typically once per hour (changeable.)
Oracle Statspack: Snaps The snaps taken are listed in Oracle table Stats$Snapshot Heres how to find the snaps taken in last day: Select Snap_id, To_char(snap_time,'dd-mon-yy-hh24:mi') Snaptime From Stats$snapshot Where Snap_time > Sysdate 1 Order By 1 SNAP_ID SNAPTIME dec-06-02: dec-06-03: dec-06-04:49 * * *
Starting a Statspack Report Specify the Begin and End Snapshot Ids ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Enter value for begin_snap: 1001 Enter value for end_snap: 1002 Specify the Report Name ~~~~~~~~~~~~~~~~~~~~~~~ The default report file name is sp_1001_1002. To use this name, press to continue, otherwise enter an alternative. Enter value for report_name: TESTREPORT
Statspack Key Sections Top 5 resource-consuming events All events causing waits Top SQL ordered by logical i/o Top SQL ordered by physical i/o Top SQL ordered by # of executions Aggregate db statistics Most-read objects Most-locked objects
Statspack: Top 5 Events Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time db file sequential read 6,672,182 25, CPU time 11, db file scattered read 2,611,855 2, log file sync 63, ARCH wait on SENDREQ Single-block reads Multi-block reads Write of transaction after Commit
Statspack: List of Top Sql If you know the time period, Statspack will usually zero-in on the problem. Shows resource hogs that consume lots of disk, logical reads, physical reads, or huge number of executions. Nearly all bad Sql will be detectedbut a few can slip-through undetected! E.g., what if user is blocked for a long time? That consumes neither disk or much logical i/o, so its missed. More exceptions later.
Statspack: Top Sql Example Top resource consumerlogical reads CPU Elapsed Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value ,093, ,303 1, select cur1.pae_id, cur1.mpa_id, cur1.mpa_nickname, nvl(cur1.mpasta_id, -1) mpa_status, cur3.pmt_start_date last_pay_date, cur3.pmt_amt pmt_amt_complete, cur2.pmt_start_date schedule_date, * * * CPU per Elap per Executions Rows Processed Rows per Exec Exec (s) Exec (s) Hash Value , , Module: p_direct_check_inbound_pkg INSERT INTO CBPAYMENT_INSTRUCTION_AUDIT (PMTINSTR_ID, PMT_ID, IN STR_ID, MEM_ID, PAE_ID, MPA_ID, INSTRSTA_ID, PMTINSTRBAT_ID, WF_ * * * Top resource consumer# of executions
Another type of Report: sprepsql This report shows you actual execution plan for a particular Sql run previously. You first identify the Sql by its hash value, shown in the first statspack report: Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value ,988,572 1,530 15, UPDATE CBNOTICE_MEMBER SET NTCSTA_ID = 3, NTCMEM_MODIFIER_ID = :1, NTCMEM_MODIFIED_DATE = :2 WHERE NTC_ID = :3 AND MEM_ID = :4 Note this value
sprepsql: Sample Output for an Update Statement | Operation | PHV/Object Name | Rows | Bytes| Cost | |UPDATE STATEMENT | | | | | |UPDATE | | | | | | HASH JOIN SEMI | | 925K| 24M| | | TABLE ACCESS FULL |MEMBER_ACTIVATION | 954K| 18M| 732 | | TABLE ACCESS FULL |CBPAYMENT | 92M| 705M| | Statement Total Per Execute Total Buffer Gets: 20,474,697 20,474, Disk Reads: 19,895,404 19,895, Rows processed: 4,401 4,401.0 CPU Time(s/ms): 3,808 3,807,950.0 Elapsed Time(s/ms): 6,369 6,369,227.7 UPDATE MEMBER_ACTIVATION MA SET PMT_ACTIVATION_INDICATOR = :B2, PMT_ACTIVATION_DATE = SYSDATE, MODIFIED_BY = :B1, MODIFIED_DATE = SYSDATE WHERE MA.PMT_ACTIVATION_INDICATOR IS NULL AND EXISTS (SELECT NULL FROM * * *
Statspack Good Practices: Tips & Traps Old Snapshots are occasionally purged, so you cant go back forever! Certain problems can fly under the radar. Example: Statspack lists resource-intensive Sql. But what if the problem is due to the cumulative effect of millions of similar (not identical) Sql? Statspack will miss this because each Sql is below the reporting threshold.
Statspack Good Practices: Tips & Traps Heres one that most of us missed: Oracles statistics are reset if table stats are gathered, so Statspack will rate that Sql as perfect! SECS EXEC BUFFER_GETS DISK_READS SQL_TEXT select count(*) from chris_waivers SECS EXEC BUFFER_GETS DISK_READS SQL_TEXT select count(*) from chris_waivers Ha! I knew that! After Analyze of Chris Waivers
Statspack Good Practices: Summary Statspack is an excellent diagnostic aid in identifying possible bottlenecksnot solving them. Statspack usually uncovers the resource drivers for a specified period of time. Remember, however, it will not find everything--a few things can slip by. Like any tool, it cant fix the design for you.
My Favorite Diagnostic Scripts 1.Show all Sql running on database. 2.Show the object a user is waiting on. 3.List cumulative wait events for a user. 4.List resource-intensive Sql run recently. 5.Show average speed of disk i/o. I use custom scripts far more often than Statspack. Here are some favorites:
Script 1: Show all Active Sql This gives you an idea of some of the critical Sql for an application. If you see the same Sql over and over, it either is inefficient, or run extremely often. On most OLTP systems, only a handful of sessions are usually active at one point because everyone else has already got their answer.
Select DISTINCT Sid, username, substr(sql_text,1,200) stext From V$Session, V$Sql Where status = 'ACTIVE' And username is not null And v$session.sql_hash_value = hash_value and v$session.sql_address = v$sql.address and sql_text not like '%sql_text%' Example: Show all Active Sql SID USERNAME STEXT BMAPP SELECT PM_ACCT_NO FROM CBPAYMENT_METHOD WHERE MEM_ID = :1 FOR UPDATE NOWAIT This eliminates this query itself This eliminates Oracle background processes
This is especially helpful in isolating bottlenecks for a particular Sql. This shows you where your Sql is stuckyou may be surprised! On FFIEC, this revealed that most queries were always reading one particular table (or its indexes.) Script 2: Show Object User is Waiting on
SeleCT DISTINCT username,object_name, sql_text From V$Session, V$Sql, dba_objects o Where v$session.status = 'ACTIVE' And username is not null and o.object_id = row_wait_obj# And v$session.sql_hash_value = hash_value and v$session.sql_address = v$sql.address and username <> 'SYS' Actual Example: Show Object User is Waiting on USERNAME OBJECT_NAME SQL_TEXT ADMAPP PMT_PK WITH bills AS (SELECT /*+ Materialize */ bls.bls_external_billid bls_external_billid, blr.blr_bspbiller_id blr_bspbiller_id, bls.mem_id mem_id, bls.mpa_id mpa_id, bls.bls_id bls_id, TO_CHAR (bls.bls_first_seen, 'yyyyMMddHHmiss') bls_first_seen, bls.bls_notify_sent_to_bsp bls_notify_sent_to_bsp FROM cbbill_s Scanning the primary key index
Script 3: Session Cumulative Wait Events This summarizes the bottlenecks for the totality of a particular session. Especially useful for a long-running batch job. This shows how much time is due to each type of bottleneck. This helps you avoid solving problems that arent really the bottleneck.
Select Sid, Event, Total_waits,time_waited/100 Timewaited, Average_wait/100 Avgwait, Round(100*total_waits/Time_waited) RATE From V$session_event Where Sid = 339 And Time_waited > And Event Not Like '%Net% Order By Timewaited Actual Example: Session Cumulative Wait Events SID EVENT TOTAL_WAITS TIMEWAITED AVGWAIT RATE db file scattered read log file sync enqueue Blocked by another user wait times are in centiseconds
Script 4: Resource-Intensive Sql Pretty similar to the Statspack Sql report. This often highlights serious problems. Quickly shows you resource hogs for a databasebig CPU or disk users. You set a thresholdsuch as all SQL consuming over 1 million disk reads, or SQL that has run over 10 hours. This script exemplifies why we dont need to guess at the cause of performance problems. Although finding root cause is usually easy, applying the fix may not be.
Select Round(elapsed_time/ /1) Secs, Rows_processed Rowct, Executions, Buffer_gets, Disk_reads, Sql_text From V$sql Where Elapsed_time > And Executions > 0 And Abs(buffer_gets)/(Rows_processed+.01) > 100 Order By Elapsed_time Actual Example: Resource-Intensive Sql SECS ROWCT EXECUTIONS BUFFER_GETS DISK_READS select vcrypttrac0_."USER_NODE_LOG_ID" as USER1_, vcrypttrac0_."REQUEST_ID" as REQUEST2_25_, vcrypttrac0_."CLIENT_DEVICE_ID" as CLIENT3_25_, * * * Surprise! Sql doesnt do anything! (actual production case)
Script 5: Show Disk i/o Rate On many systems, disk i/o is a significant performance driver. You want to eliminate the disk i/o first, but its still helpful to know roughly how fast you can perform a typical read. Numbers typically range from single- block reads per second Note: Oracle uses the non-intuitive term, sequential read for a single read. Multi-block reads are called scattered reads.
Select EVENT, TOTAL_WAITS, TIME_WAITED, Round(100*total_waits/Time_waited) Rate From V$system_event Where Time_waited > 1000 And Event Like '%Db File Sequential Read%' Example: Show Disk I/O Rate for 2 Different Systems EVENT TOTAL_WAITS TIME_WAITED RATE db file sequential read EVENT TOTAL_WAITS TIME_WAITED RATE db file sequential read BILLPAY FFIEC Why is disk so slow?
Case Study: Diagnosing eco Batch Called ecoout_gen_pmt, it is critical Billpay batch, starting about 03:30 daily. Tight SLAEspecially for big payment days. Job typically runs several hours. Business is concerned that SLA is at risk, especially for Tuesday runs.
Ecoout: Identify Root Cause The resource-intensive script finds this Sql running during the time period: UPDATE CBPAYMENT PMT_UPDATE SET (COLS) = (SELECT stuff FROM CBMEMBER M, CBMEMBER_PAYEE_ACCOUNT MPA, CBELEMENT E, ETC. Script shows that Sql ran for 3 hours, consuming 25 million logical reads to process 250k rows This suggests problem with the way Oracle executed this transaction.
Eco-out Root Cause Running Explain plan for Sql shows Oracle combines 2 indexes: Bitmap conversion from ROWID. Execution plan shows these indexes Used: Pmtinstr_instrsta_id_idx Pmtinstr_instr_idx ?? Non-Selective! So Oracle combines two indexes, one of which is a terrible choice. The optimizer should simply do the obvious: just use the one good index. BTW: This has happened more than one.
Eco-out Root Cause This optimizer tactic is a notorious weakness in Oracle 9i--some shops turn-off this feature. In some Billpay batch jobs, we issue an Alter Session command to disallow this feature. Our fix: Sql hint to specify the desired index: /*+index(I PMTINSTR_INSTRSTA_ID_IDX) */ So what happened after Sql hint applied?
Eco-Payment New Runtimes
CHALLENGE QUESTION Oracles Statspack does not show … Snapshot start/end times High disk i/o Sql High # of execution Sql High # of parse Sql Init.ora parameters Database privilege problems Database memory statistics ITL locking events Top wait events
PRIVILEGE PROBLEMS QUIZ ANSWER This is NOT part of Statspack.
Case Closed : Good Diagnostic Aids Consider logs that show job runtime Statspack: –spreport shows resource usage –sprepsql shows execution plan –In Oracle 10g, use AWR Custom scripts that show resource usage such as high CPU or disk i/o Dont forget server CPU load graphs.
Case Closed: Tips & Traps There isnt just one way to always identify performance bottlenecks. Statspack cannot detect everything; some things fly under the radar. No one has a script to cover every possible problem. So, develop a toolkit of approachescustom Sql scripts, graphs, tools, etc.
Questions? The world is full of obvious things which nobody by any chance ever observes. Sherlock Holmes, The Hound of the Baskervilles.