Presentation is loading. Please wait.

Presentation is loading. Please wait.

Root Cause and Other DBA Urban Legends Brian Hitchcock OCP 10g DBA Sun Microsystems SunFed.

Similar presentations


Presentation on theme: "Root Cause and Other DBA Urban Legends Brian Hitchcock OCP 10g DBA Sun Microsystems SunFed."— Presentation transcript:

1 Root Cause and Other DBA Urban Legends Brian Hitchcock OCP 10g DBA Sun Microsystems brian.hitchcock@sun.com brhora@aol.com www.brianhitchcock.net SunFed DBA Brian Hitchcock October 19, 2006Page 1

2 SunFed DBA Brian Hitchcock October 19, 2006Page 2 For DBA Issues  I'm told that I must find root cause, – Can't resolve the issue without root cause  Must have testing environment that is – Similar enough to production to recreate the issue – And the fix  Need extensive expertise to solve performance issues – 10046 trace – Wait events – Extents size/number – Physical spindles – Rebuilding indexes – Undocumented init parameters

3 SunFed DBA Brian Hitchcock October 19, 2006Page 3 What is My Experience  Centralized DBA support team  2000+ databases  Database users open cases with DBA team  Cases randomly assigned to DBAs – Not assigned based on experience or expertise – My cases are probably typical of all the cases

4 SunFed DBA Brian Hitchcock October 19, 2006Page 4 My Experience  Finding root cause costs money  Building and maintaining test system(s) costs money  Test system – Needs to be able to recreate production issue – Database, network, apps and web servers, load balancers – Users around the world  Root cause and test system(s) are only worthwhile – If having them is less expensive than not having them.  Perhaps these are Urban Legends? – Root cause – Test system(s) – Specific expertise required

5 SunFed DBA Brian Hitchcock October 19, 2006Page 5 My Experiences  Review major cases I worked on over the last 2 years – Cases are presented in chronological order, oldest first  For each case – What worked, i.e. what was the real-world solution? – Was root cause identified? – Was a test system available to verify root cause  and the solution? – What extensive expertise was required?

6 SunFed DBA Brian Hitchcock October 19, 2006Page 6 Keeping Score  For each case, look for 3 things – Root cause – Test system(s) – Specific expertise  Listed earlier (10046, wait events, extents, etc.) – Were any of the following useful in resolving the issue?  Open Mind (anything can and will happen)  Communication (it's difficult)  Simple solutions ­Any active db users? ­Reboot one or more components?

7 SunFed DBA Brian Hitchcock October 19, 2006Page 7 Case 1 – CRM Local Language  Moved to UTF8, add local language columns  SQL ran slower  10046 trace, explain plan  Found the single step slowing execution  Existing index wasn’t updated to have new local language column  Recreated index – Performance returned to normal

8 SunFed DBA Brian Hitchcock October 19, 2006Page 8 Case 1 – Three Things  Root cause – Existing indexes not optimal for new schema – Corrected indexes fixed the problem  Test system identified the issue  Specific Expertise – 10046 – Explain plan – Indexes

9 SunFed DBA Brian Hitchcock October 19, 2006Page 9 Case 2 – Storefront Slow  App support team reports database slow  Check database, 1-3 active users at most  Active users gone in 1-2 seconds  App users report 30 second response time  Have App support person use app – Watch database  See single active user connect, complete, disconnect – 2 seconds maximum  Remainder of response time – Web servers, load balancers etc.  Application server was locked up – reboot fixed it

10 SunFed DBA Brian Hitchcock October 19, 2006Page 10 Case 2 – Three Things  Root cause – Not identified  Test Environment – Exists, but not even close to production – Complex production application environment – Accurate test environment  Expensive to build and maintain – Hard to find qualified testers – No one completely understands how production was built  Impact of layoffs, outsourcing  Hard to recreate what you don’t understand  No specific expertise required

11 SunFed DBA Brian Hitchcock October 19, 2006Page 11 Case 3 – Contracts Slow  App users report message – ‘Problem contacting the database: No available resource…’  Sar shows CPU 90% idle  Statspack snapshot for last hour shows top wait events – 70% CPU time – 20% db file read  App server rebooted, performance returns to normal

12 SunFed DBA Brian Hitchcock October 19, 2006Page 12 Case 3 – Three Things  Root cause unknown – How to determine what caused the app server to hang?  Test system does not have all the components of prod – Don’t know why production locked up – Tough to reproduce in test system  Simple solutions – no specific expertise required – Reboot app server – Low-risk, cheap, quick – No special experience required – If it doesn’t work, time to put more resources into the issue

13 SunFed DBA Brian Hitchcock October 19, 2006Page 13 Case 4 – Storefront  Users report app hanging  Few or no active users in database  Active users taking much longer than normal  Isolate single SQL statement that is running slow  Explain plans show good/bad execution plans – Changes from good to bad at random times  Various attempts to isolate give conflicting results – We assume that the problem is stable, it isn’t  10053 trace, watch optimizer choose execution plan  Optimizer changes from good to bad plan – one column of one table has 2 versus 3 distinct values

14 SunFed DBA Brian Hitchcock October 19, 2006Page 14 Case 4 – Three Things  Root cause – not sure…  App developers gone (outsourced) – No one knows how code really works – Why are these values appearing and disappearing? – App is inserting/deleting the rows with the 3 rd distinct value – This change causes optimizer to choose bad plan – Bug or feature?  Fix? – Create rows so column always has 3 distinct values

15 SunFed DBA Brian Hitchcock October 19, 2006Page 15 Case 4 – Three Things (cont’d)  No test system – Test system has small percentage of production data – App system too complex to reproduce  Multiple strings, load balancers, app servers  Hardware, support, upgrades, patches  Difficult to get testing resources – people are expensive – Changes are tested for functionality – Changes aren’t tested for performance  Until released in production  Specific expertise – Explain plan – 10053 trace

16 SunFed DBA Brian Hitchcock October 19, 2006Page 16 Case 5 – Customer Demo  Users report slow performance  Database shows 4 inactive sessions  Inactive sessions holding locks  Kill these 4 sessions – Performance is fine

17 SunFed DBA Brian Hitchcock October 19, 2006Page 17 Case 5 – Three Things  Root cause – unknown – Why some sessions inactive and holding locks? – App developers gone – Easier to kill sessions once in a while – Real root cause would be expensive to find  Test system – Don’t have test system – app developed on the cheap – Don’t know how app works  No specific expertise required – Kill database sessions

18 SunFed DBA Brian Hitchcock October 19, 2006Page 18 Case 6 – Executive Dashboard  User reports database is sorting dates incorrectly – Phone on mute – laugh  Users have really good drugs today!  Connect to db – Verify that indeed dates are being sorted backwards  From NLS experience, look at actual bytes of dates – There are extra bytes that I can’t explain – Go back to the docs for DATE type – Expand DATE format to include all the possible fields – Suddenly it all makes sense!

19 SunFed DBA Brian Hitchcock October 19, 2006Page 19 Case 6 – What’s the Story?  The dates are being correctly sorted – because – They are from BC!  OK, now what? – This is sales data from Fred Flintstone? – Is Barney setting up his own IT department?  Check the basics – App code uses 10g JDBC  User says this was a requirement – But the database is 9i – it just gets better and better  User finds Metalink note – 10g JDBC issues with 9i database – Doesn’t describe our issue, but it’s a start…

20 SunFed DBA Brian Hitchcock October 19, 2006Page 20 Case 6 – What to Do?  Setup a 10g database and test the app code against it – No test system, in fact, no other system of any kind – Critical executive reporting system, can’t be down for long – Very limited disk space  Why not upgrade existing 9i database? – Upgrades can cause problems – If this isn’t a 10g to 9i issue, why risk the db upgrade?  Install 10g db, full export 9i db, import into 10g db  User tests app code against 10g database – No more dates from BC!  Remove 9i database, expand 10g database to match  User happy – Executive’s reports don’t show sales to the Flintstones

21 SunFed DBA Brian Hitchcock October 19, 2006Page 21 Case 6 – Three Things  Root cause – Not clear  Tried 10g database and issue went away  Did we really identify the root cause?  Was it a feature of 10g JDBC? A bug?  Test system – None  Expertise required – Minimal configuration control would have prevented this – Hardest part was accepting what Oracle was telling us  Dates from BC – Not at all the way things are supposed to be

22 SunFed DBA Brian Hitchcock October 19, 2006Page 22 Case 7 – Customer Demo  Users calls, application is slow  Blocking processes, restart database twice in one day – App still slow  Ask user to connect and start using app – I watch database for active users – Over 20 seconds before new db user appears – Db user is done and gone in less than a second – User reports 30 seconds before results appear in browser – I tried ping between the db server and the app server  200ms with packet loss  Ask network group to investigate ­Network switch in data center has failed

23 SunFed DBA Brian Hitchcock October 19, 2006Page 23 Case 7 – Three Things  Root cause – Network switch – no question – This was not a database problem  But we could have wasted a lot of time with SQL tracing etc.  Need to confirm that the database is the problem – Then work the issue as a db tuning issue  No test system available – How would you reproduce the data center network?  Expertise required – Look for active users in database – ping between db and app server

24 SunFed DBA Brian Hitchcock October 19, 2006Page 24 Case 8 – Data Warehouse  User reports application slow – User can’t truncate a table – command hangs – Loading data into warehouse is also hanging  Watch database while user starts load process  During load – User is using third-party app to do the load – No active users performing inserts  When no load is happening – Truncate table runs quickly  User contacts app vendor – Vendor “changes some parameters for the load process” – Data load runs normally, truncate table runs normally

25 SunFed DBA Brian Hitchcock October 19, 2006Page 25 Case 8 – Three Things  Root cause – Looking at the db showed no active inserts during data load – Needed to verify that db wasn’t the issue  Force vendor to perform  Test system – This was a dev database so there was a ‘test’ system – Data load issues identified, resolved in dev environment  Expertise required – Quickly check for database problems – Communicate with user to understand what is happening – Politics – dealing with vendor  Vendor really, really wants it to be a database problem

26 SunFed DBA Brian Hitchcock October 19, 2006Page 26 Case 9 – Software Registry  User trying to truncate table – Part of a data feed process  Other users in database – Performing deletes on same table  Blocking truncate  Kill delete processes – Truncate still blocked  Watch database processes – Delete process starts every hour on the hour  Cron job starts delete hourly – Part of data feed process – Why don’t the app owners know this?

27 SunFed DBA Brian Hitchcock October 19, 2006Page 27 Case 9 – Three Things  Root cause – Unknown – Why was cron in place that user didn’t know about?  Test system – exists – But doesn’t have the production data feeds  Expertise required – None – basic DBA skills – Not a database performance issue – Basic app configuration was the issue

28 SunFed DBA Brian Hitchcock October 19, 2006Page 28 Case 10 – CRM Reporting  User reports app is too slow  Specific selects are taking 10-15 seconds  STATSPACK snapshots are being taken  Check snapshots over same time for last week  Performance of the top SQL hasn’t changed  User agrees that performance hasn’t changed  Resources not available to make performance better  When is a performance problem not a problem? – When you don’t want to pay to fix it

29 SunFed DBA Brian Hitchcock October 19, 2006Page 29 Case 10 – Three Things  Root cause – User perception  Test system – None  Expertise required – Document that performance hasn’t changed – Offer to work the issue if user will get funding

30 SunFed DBA Brian Hitchcock October 19, 2006Page 30 Case 11 – Configurator  Users getting error – Can’t allocate memory in shared pool  What is causing this?  Spend some time looking at the database – Looks normal  Reboot database to clear shared pool  Watch database after restart – Shared pool doesn’t fill up

31 SunFed DBA Brian Hitchcock October 19, 2006Page 31 Case 11 – Three Things  Root cause – Don’t know why shared pool fills up  Test system – How to recreate problem in test system?  Expertise needed – Basic DBA skills – Flush shared pool  Problem didn’t reoccur – One-time or infrequent set of circumstances

32 SunFed DBA Brian Hitchcock October 19, 2006Page 32 Case 12 – Pricing Database  User reports slow performance  Check database server – CPU and iowait very high – Watch SQL executing – Explain plan for worst SQL shows full table scans – No indexes on tables being scanned  Test server – Same tables do have indexes  While recreating indexes – All the needed indexes reappear  Cron job for pricing data runs every two weeks – Rebuilds indexes and loads data

33 SunFed DBA Brian Hitchcock October 19, 2006Page 33 Case 12 – Three Things  Root cause – What caused indexes to be dropped?  Unknown  Test system – Used to verify that indexes were missing – Can’t reproduce whatever dropped the indexes  Expertise required – Basic DBA skills

34 SunFed DBA Brian Hitchcock October 19, 2006Page 34 Case 13 – Contracts Database  User reports error when selecting from table – Invalid row id  Some queries on same table don’t report error  Looking at table and row ids – Some queries use index  No errors – Other queries use table  Specific rows have invalid row ids  Rebuild table – Create table as select * from

35 SunFed DBA Brian Hitchcock October 19, 2006Page 35 Case 13 – Three Things  Root cause – Why did row ids become invalid?  Unknown  Test system – Exists, can’t reproduce cause of invalid row ids  Expertise required – Basic DBA skills

36 SunFed DBA Brian Hitchcock October 19, 2006Page 36 Case 14 – Software Download  User reports application slow in Dev environment – SQL so slow, application server times out – Sar shows CPU near 0% idle  User executes problem SQL – Watch database – SQL completes, very slowly, database is working – Explain plan shows full table scans  Look at tables involved – Production – last analyzed NULL, indexes, runs fast – Test – recently analyzed, indexes, runs slow  Drop stats, rebuild indexes, reanalyze – Performance returns to normal

37 SunFed DBA Brian Hitchcock October 19, 2006Page 37 Case 14 – Three Things  Root cause – What happened to indexes, statistics in production? – Unknown  Test system – Exists – Doesn’t match production – which is correct? – Can’t reproduce dropped indexes and/or statistics  Expertise required – Basic DBA skills

38 SunFed DBA Brian Hitchcock October 19, 2006Page 38 Case 15 – BRIO  User reports SQL failing with error – Can’t allocate memory in shared pool  Watch database – Same SQL runs most of the time – Error occurs once in a while  Solutions – Reduce sort area size  Free up more memory for shared pool – Increase physical memory assigned to database  Error occurred so infrequently – Users decided not to change system

39 SunFed DBA Brian Hitchcock October 19, 2006Page 39 Case 15 – Three Things  Root cause – Unknown – Just too many users for a brief time?  Test system – Yes, but how to reproduce this issue (user load)?  Expertise required – Basic DBA skills

40 SunFed DBA Brian Hitchcock October 19, 2006Page 40 Case 16 – Revenue App  Automated alert – Report log switching hung – Database stopped for transactions  Examine database – Can’t find anything wrong  Restart database – Log switch works – No further alerts – No user problems

41 SunFed DBA Brian Hitchcock October 19, 2006Page 41 Case 16 – Three Things  Root Cause – Unknown  Test system – Exists, but no help  Expertise required – Basic DBA skills

42 SunFed DBA Brian Hitchcock October 19, 2006Page 42 Scoreboard

43 SunFed DBA Brian Hitchcock October 19, 2006Page 43 Observations  How many times did we – Identify root cause? – Have a complete test system? – Need specific expertise?  Un-scientific results – Root cause – 2 out of 16 = 13% – Test system – 3 out of 16 = 19% – Specific expertise – indexes 1/16 = 6%  Simple solutions worked 94% of the time

44 SunFed DBA Brian Hitchcock October 19, 2006Page 44 Conclusions  Brian’s off his meds – Ignore him – this isn’t typical  Urban legends? – You don’t have to find root cause – You don’t have to have a complete test environment – You don’t need extensive expertise in specific DBA areas  You do need – DBA experience – Open mind – strange stuff happens all the time – Communication skills – you don’t know what’s happening  Looking at training resources – Why become more expert at things that are rarely needed? – Why not become familiar with things you know little about?

45 SunFed DBA Brian Hitchcock October 19, 2006Page 45 Opinion  Specific expertise can be obtained when needed  Based on the results shown – 94% of the time, don’t need expertise to be productive – 15 of 16 cases solved with general DBA skills  Some expertise can be automated or outsourced – Tracing, wait events  Focus on skills that can’t be put into a GUI – Communication – Ability to solve problems, not just database issues


Download ppt "Root Cause and Other DBA Urban Legends Brian Hitchcock OCP 10g DBA Sun Microsystems SunFed."

Similar presentations


Ads by Google