Common Database Problems Common Database Solutions Mike Furgal PROGRESS Bravepoint – Database Services
2 Introduction Mike Furgal Progress Employee since 1989 Developer of the OpenEdge database Joined Bravepoint in 2012 Heads up Database Services Including Managed Database Services Bravepoint Largest Progress/OpenEdge consulting firm Founded in 1987 Purchased by Progress in April 2014 Specializes in all things OpenEdge Database Services Programming QAD Pro2SQL Real-time Replication to SQL Target
3 A series of case studies of issue that the PROGRESS BravePoint Managed Database Services Team has encountered over the years.
4 The case of the Missing Files
5 A large distribution center had a power failure. When the power came back on the machine booted but the database did not start : (43) ** Cannot find or open file /agility/prod/prod_db/platte_11.d5, errno = 2. : (451) prostrct list session begin for root on /dev/pts/0. : (12475) Unable to get file status for extent /agility/prod/prod_db/platte_11.d5 : (334) prostrct list session end.
6 Specifics Database was 80 GB Last good backup was 1 week old Not running After Imaging Platform was Linux
7 WHAT WOULD YOU DO?
8 Approach Made a copy of the existing database incase we made a mistake Used PROSTRCT LIST to determine which files were missing We were lucky that the missing file was part of a storage area that only held indexes Tools Available PROSTRCT UNLOCK PROSTRCT BUILDDB
9 Solution Restored the missing extent from the week old backup and ran PROSTRCT UNLOCK Rebuilt the indexes # proutil db –C idxbuild all BUT……. Index rebuild failed due to finding back blocks in the storage area where the records were stored
10 NOW WHAT?
11 Back to the Beginning Copied the backed up database to start over. Since Index Rebuild failed, we needed to start over Good thing we had copied all the files in the first place Add the missing extent Truncate the BI and do a DBRPR scan Fix bad blocks Fix bad records
12 Dump and Load After all the corruption was removed it was time to dump and load Need to do an ASCII Dump to dump around some bad records
13 Lessons Learned This Database was important to this customer, hence they wanted it back when it got corrupted. They need to treat the Database better Daily Backups After Imaging A good DR plan saves a lot of heartache
14 Next Steps Implement a good Disaster Recover plan which includes Frequent backups After Imaging implemented Test the Disaster Recover Plan Annually Disaster Recover Plan needs to be on Paper Can’t be just on the computer Need a backup plan incase the DR plan fails
15 The case of the Micro Manager
16 A brand name US bank had SAN corruption. This prevented Crash Recovery from completing. They had a Hot Standby machine and database using OE Replication.
17 Specifics Had a local backup and local AI files, but the backup would not restore Previous backup was not available Replica Database was up to date Platform was Windows Database size 200 GB OpenEdge 10.1C04
18 What’s the Problem Customer refused to fail-over They never tested running on the fail-over machine. Had little confidence that the application would run in the fail-over environment. Customer worried about the time it takes to fail-back once failed over.
19 Making Matters Worse Copying the DR database to the production machine is measured in days Options presented to Management included FORCED ACCESS to the Database
20
21 What Next Forced into the database – This skips Crash Recovery Index Rebuild DOES NOT fix the database Dump and Load DOES NOT fix the database
22 Lesson Learned Have confidence in your Disaster Recovery Plan There is no sense of having one if you are never going to use it Be Careful of the “QUICK FIX” Non-technical people will ALWAYS choose the fastest approach to the solution without understanding the consequences
23 Next Steps Worked with the customer to do a fail-over test. Made the fail-over testing an annual event
24 Schools Out For Summer
25 A Large school district needs to get their reports cards out to 30,000+ students. They discovered they had corruption in the database because backups stopped working for about a week
26 Specifics 10.2B05 Windows 64bit OpenEdge Last good backup is 1 week old All report card data for 30,000+ students entered since that last good backup After Imaging is turned on, but AI file retention was less than 1 week Database is about 300 GB They have the 1 week old backup restored to a different location
27 WHAT WOULD YOU DO?
28 Approach We had 2 plans Plan A – Get the corruption out of the live database Use any and all tools to remove the corruption Plan B – Revert back to the week old database See if we can take all the report card data from the live database and import it into the week old database.
29 Plan A The database.lg file showed the extents where the corruption was located. Each storage area was a single variable length extent Corruption was in an 80 GB extent (Ugh!) Used DBRPR to scan and fix bad blocks This took hours to run on this large extent In the end this failed
30 Plan B Worked with the vendor to find all the tables that made up the report card processing This was about 12 tables Dumped these tables from the live database There was no corruption in these tables Had to figure out how to get the table data into the week old database
31 HMMMMM…..
32 Plan B Dumped the schema for the 12 tables Went into the dictionary and renamed the tables Added _old to the end of the table name Loaded the schema for the 12 tables Loaded the data for the 12 tables This is a very useful trick Didn’t need to recompile – the application worked
33 Plan A (revisited) Dumped and Loaded the plan A database There were 5 tables where the dump and load failed. Did a 4GL dump FOR EACH … BY field. EXPORT… FOR EACH … BY field DESCENDING. EXPORT … Didn’t trust the data, so we use the same table rename technique to get these tables from the week old backup.
34
35 But Wait – There’s More A week later they found they also had corruption in a different database That was solved by restore and roll forward Needed to upgrade to 10.2B08 for Roll Forward to work properly –Windows 64bit 10.2B06 has a roll forward bug that prevented it from working.
36 Next Steps Implement a DR solution OpenEdge Replication Rolling Forward AI Restore the backup and roll forward on the same machine This verifies the backup is functional DB block corruption does not get replicated from roll forward
37 A case of the spins
38 A large medical center patched their software over the weekend. On Monday the performance of the system was unacceptable. The vendor says the patch was minor and could not be the cause of the issue. The customer says nothing else changed.
39 Specifics OpenEdge bit Windows bit Database is 321 GB Number of users is 3,000
40 Some Metrics – Month View Date CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 05/16/15 (Sat)865807,948,358,747149,508,775531,072,918,966425,5421,922,412374,270175,980172,163116,49014,798 05/15/15 (Fri)2612,792 12,987,557, ,936, ,465,462,9101,520,2963,227,5681,626,6921,886,768449,449295,333115,003 05/14/15 (Thu)2633,011 11,000,344,09 056,940, ,090,165,0021,639,5643,475,092871,8082,023,720454,097298,08873,017 05/13/15 (Wed)3233,126 10,371,051,21 355,142, ,879,551,6622,250,1683,423,7601,099,9302,378,306525,070374,006885,294 05/12/15 (Tue)2793,089 10,567,333, ,530,655751,901,654,8031,797,5203,397,2381,043,8492,068,487496,165328,450943,510 05/11/15 (Mon) Restart 05/10/15 (Sun) ,806,473, ,617,341522,307,235,660307,3771,804,694368,589100,764150,257102,579244,087 05/09/15 (Sat)885045,704,394,38982,411, ,023,191483,4791,423,644516,069171,617165,926115,092186,064 05/08/15 (Fri) Restart 05/07/15 (Thu)2712,940 10,046,740, ,481,723691,596,756,3581,705,6613,503,669924,0822,153,671455,228306,003128,058 05/06/15 (Wed)3382,9899,830,327,570153,056,406641,442,212,5612,247,9143,525,5461,225,8262,453,942557,639374,309129,160 05/05/15 (Tue)2932,967 10,392,149, ,806,221671,593,242,3562,000,3923,366,9551,126,1772,324,533488,949329,102171,067 05/04/15 (Mon)4882,971 10,483,718, ,479,487651,547,975,2672,311,1793,733,3071,363,4092,678,057712,951528,518212,059 05/03/15 (Sun) ,161,696, ,504,812511,884,717,953331,0061,783,8681,243,395270,902222,981156,12823,917 05/02/15 (Sat)1, ,114,391,833164,345, ,325,461444,5681,853,37624,483,1713,078,6551,889,1511,496,027132,360 05/01/15 (Fri)3742,735 11,611,724, ,877,164921,815,943,4552,450,9873,063,5771,458,1662,046,184590,195411,7053,268, ,879,551,662 1,901,654,803 1,593,242,356 1,547,975,267
41 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595, ,840,5809,66431,2572,338 19,1894,8584,49988, :20:01 2 3,052102,685,94627,744, ,858,41210,92333,0242,407 21,0265,8584,66099, :35:02 2 3,08297,655,6451,303, ,250,59313,81438,9473,318 27,6117,0755,046107, :50:01 3 3,08981,674,3921,293, ,030,50922,28936,3215,409 25,4287,6045,516100, :05:0173,086214,447,1211,716, ,973,39640,12258,59529,27659,98713,9198,50930, :20:0153,039155,915,7671,492, ,202,28525,75857,49414,19748,0548,7784,9934, :35:0143,040156,151,5011,434, ,103,82427,30460,0457,79148,3238,2854,5713, :50:0152,888146,245,4141,666, ,019,80133,60560,46311,37952,6068,5665,2262,711 Bad Day – 15 minute samples Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343, ,340,78330,48960,8057,666 55,8187,7574,7745, :20:01 5 2,812145,441,8711,755, ,387,07026,49059,67613,279 53,1956,6424,9623, :35:01 4 2,877151,783,5161,876, ,653,29730,44661,75411,262 54,9067,8995,1926, :50:01 7 2,894143,780,0801,877, ,215,54346,23463,98019,392 66,42911,8206,7747, :05:02 4 2,912158,495,0871,808, ,191,42834,80663,07012,040 59,0419,2155,28410, :20:0162,897155,845,1102,259, ,841,34634,72760,41612,77059,9988,9295,4977, :35:0182,938150,662,8222,195, ,976,74470,23983,41920,19382,77712,5528,5426, :50:0142,914138,147,8041,774, ,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples
42 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595, ,840,5809,66431,2572,338 19,1894,8584,499 88, :20:01 2 3,052102,685,94627,744, ,858,41210,92333,0242,407 21,0265,8584,660 99, :35:02 2 3,08297,655,6451,303, ,250,59313,81438,9473,318 27,6117,0755, , :50:01 3 3,08981,674,3921,293, ,030,50922,28936,3215,409 25,4287,6045, , :05:0173,086214,447,1211,716, ,973,39640,12258,59529,27659,98713,9198,50930, :20:0153,039155,915,7671,492, ,202,28525,75857,49414,19748,0548,7784,9934, :35:0143,040156,151,5011,434, ,103,82427,30460,0457,79148,3238,2854,5713, :50:0152,888146,245,4141,666, ,019,80133,60560,46311,37952,6068,5665,2262,711 Bad Day – 15 minute samples Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343, ,340,78330,48960,8057,666 55,8187,7574,774 5, :20:01 5 2,812145,441,8711,755, ,387,07026,49059,67613,279 53,1956,6424,962 3, :35:01 4 2,877151,783,5161,876, ,653,29730,44661,75411,262 54,9067,8995,192 6, :50:01 7 2,894143,780,0801,877, ,215,54346,23463,98019,392 66,42911,8206,774 7, :05:02 4 2,912158,495,0871,808, ,191,42834,80663,07012,040 59,0419,2155,284 10, :20:0162,897155,845,1102,259, ,841,34634,72760,41612,77059,9988,9295,4977, :35:0182,938150,662,8222,195, ,976,74470,23983,41920,19382,77712,5528,5426, :50:0142,914138,147,8041,774, ,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples
43 Some Metrics Date CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 05/16/15 (Sat)865807,948,358,747149,508,775531,072,918,966425,5421,922,412374,270175,980172,163116,49014,798 05/15/15 (Fri)2612,792 12,987,557, ,936, ,465,462,9101,520,2963,227,5681,626,6921,886,768449,449295,333115,003 05/14/15 (Thu)2633,011 11,000,344,09 056,940, ,090,165,0021,639,5643,475,092871,8082,023,720454,097298,08873,017 05/13/15 (Wed)3233,126 10,371,051,21 355,142, ,879,551,6622,250,1683,423,7601,099,9302,378,306525,070374,006885,294 05/12/15 (Tue)2793,089 10,567,333, ,530,655751,901,654,8031,797,5203,397,2381,043,8492,068,487496,165328,450943,510 05/11/15 (Mon) Restart 05/10/15 (Sun) ,806,473, ,617,341522,307,235,660307,3771,804,694368,589100,764150,257102,579244,087 05/09/15 (Sat)885045,704,394,38982,411, ,023,191483,4791,423,644516,069171,617165,926115,092186,064 05/08/15 (Fri) Restart 05/07/15 (Thu)2712,940 10,046,740, ,481,723691,596,756,3581,705,6613,503,669924,0822,153,671455,228306,003128,058 05/06/15 (Wed)3382,9899,830,327,570153,056,406641,442,212,5612,247,9143,525,5461,225,8262,453,942557,639374,309129,160 05/05/15 (Tue)2932,967 10,392,149, ,806,221671,593,242,3562,000,3923,366,9551,126,1772,324,533488,949329,102171,067 05/04/15 (Mon)4882,971 10,483,718, ,479,487651,547,975,2672,311,1793,733,3071,363,4092,678,057712,951528,518212,059 05/03/15 (Sun) ,161,696, ,504,812511,884,717,953331,0061,783,8681,243,395270,902222,981156,12823,917 05/02/15 (Sat)1, ,114,391,833164,345, ,325,461444,5681,853,37624,483,1713,078,6551,889,1511,496,027132,360 05/01/15 (Fri)3742,735 11,611,724, ,877,164921,815,943,4552,450,9873,063,5771,458,1662,046,184590,195411,7053,268, , , , ,160 3,268,211
44 Latch Timeouts increased. CRUD Operations Decreased. Why? Nothing had changed
45 Further investigation revealed that the –spin setting was changed from 96,000 to 20,000. This change was a move to best practices where industry experts have been saying to not have –spin higher than 20,000
46 The change was made months back to the conmgr.properties file and was long forgotten. When the patch was applied, the database was bounced and the change finally took affect While no one remembers a configuration change, the change was there Setting –spin back up to 96,000 got them the performance back
47 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595, ,840,5809,66431,2572,338 19,1894,8584,499 88, :20:01 2 3,052102,685,94627,744, ,858,41210,92333,0242,407 21,0265,8584,660 99, :35:02 2 3,08297,655,6451,303, ,250,59313,81438,9473,318 27,6117,0755, , :50:01 3 3,08981,674,3921,293, ,030,50922,28936,3215,409 25,4287,6045, , :05:01 7 3,086214,447,1211,716, ,973,39640,12258,59529,276 59,98713,9198,509 30, :20:01 5 3,039155,915,7671,492, ,202,28525,75857,49414,197 48,0548,7784,993 4, :35:01 4 3,040156,151,5011,434, ,103,82427,30460,0457,791 48,3238,2854,571 3, :50:01 5 2,888146,245,4141,666, ,019,80133,60560,46311,379 52,6068,5665,226 2,711 Bad Day – 15 minute samples Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343, ,340,78330,48960,8057,666 55,8187,7574,774 5, :20:01 5 2,812145,441,8711,755, ,387,07026,49059,67613,279 53,1956,6424,962 3, :35:01 4 2,877151,783,5161,876, ,653,29730,44661,75411,262 54,9067,8995,192 6, :50:01 7 2,894143,780,0801,877, ,215,54346,23463,98019,392 66,42911,8206,774 7, :05:02 4 2,912158,495,0871,808, ,191,42834,80663,07012,040 59,0419,2155,284 10, :20:0162,897155,845,1102,259, ,841,34634,72760,41612,77059,9988,9295,4977, :35:0182,938150,662,8222,195, ,976,74470,23983,41920,19382,77712,5528,5426, :50:0142,914138,147,8041,774, ,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples
48 But WAIT! There’s more
49 A different customer added a few CPUs to their environment. When the users login, the CPUs peg to 100% utilized Performance suffers WebSpeed launches additional Agents Due to all agents are busy Specifics Customer database is > 1 TB 430 Webspeed agents AIX 10.1C 64bit
50 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO :10: ,177,32338,1385,93092,799,76830,1906,31604,9993,3111,859 16, :15: ,752,78145,4635,07695,312,15015,0755,07905,2531,2181,022 15, :20: ,272,16947,2694,80890,336,32721,1735,439814,9042,3661,409 21, :25: ,847,55450,8713,43768,054,67114,0285,46805,4251, , :30: ,167,30953,6613,74972,196,31615,1005,10906,3041,8081,032 20, :35: ,198,78369,1703,935104,501,98925,0867,38945,4312,9131,597 48, :40: ,870,509100,1912,61497,340,38723,8715,737427,9301,7841,504 58, :45: ,460,116391, ,827,71719,6685,931468,6711,6941,284 93, :50: ,536,969779, ,726,15723,0085,44407,8722,5431, , :55: ,690,881155,3331,801108,846,56622,6407,233246,0502,6681,470 72, :00: ,670,791539, ,316,85224,2307,55406,1472,6771,557 64, :05: ,585,414161,0121,612107,194,65121,8286,552637,0951,7391,375 38, :10: ,056,072316, ,424,28525,3435,86205,9732,4011,522 28,853
51 Unlike the previous example, we had no historical performance metrics to compare to when thing were good. Could only rely on instincts and experience.
52 A Different View In a 5 minute sample, the highest latch timeout should be no more than 3,000
53 Changed –spin from 60,000 to 20,00 and the problem went away
54 Lesson Learned There is no one setting that will work for every situation Changing –spin from 20,000 to 96,000 helped one customer Changing –spin from 60,000 to 20,000 helped another one Having historical data is key Don’t assume nothing has changed just because they said so Configuration changes usually only take affect at next startup
55 Summary These are examples of some real world Database Problems Don’t assume things can’t go wrong Having a plan is not going enough Testing the plan and having confidence is required If all else fails, seek professional help
56 Gus B Mike F Dan F Chris R Roadies: Paul Coveney, Darren Rhoads, Tom Cattigan, Joe Rozenberg Jeff Keller, Marek Bujnarowski, Ajit Deodhar Groupies: Dave Eddy, Humphrey Koraag, Diego Canziani, Kim Davies