Common Database Problems Common Database Solutions Mike Furgal PROGRESS Bravepoint – Database Services.

Common Database Problems Common Database Solutions Mike Furgal PROGRESS Bravepoint – Database Services

2 Introduction Mike Furgal  Progress Employee since 1989  Developer of the OpenEdge database  Joined Bravepoint in 2012  Heads up Database Services Including Managed Database Services Bravepoint  Largest Progress/OpenEdge consulting firm  Founded in 1987  Purchased by Progress in April 2014  Specializes in all things OpenEdge Database Services Programming QAD  Pro2SQL Real-time Replication to SQL Target

3 A series of case studies of issue that the PROGRESS BravePoint Managed Database Services Team has encountered over the years.

4 The case of the Missing Files

5 A large distribution center had a power failure. When the power came back on the machine booted but the database did not start : (43) ** Cannot find or open file /agility/prod/prod_db/platte_11.d5, errno = 2. : (451) prostrct list session begin for root on /dev/pts/0. : (12475) Unable to get file status for extent /agility/prod/prod_db/platte_11.d5 : (334) prostrct list session end.

6 Specifics  Database was 80 GB  Last good backup was 1 week old  Not running After Imaging  Platform was Linux

7 WHAT WOULD YOU DO?

8 Approach  Made a copy of the existing database incase we made a mistake  Used PROSTRCT LIST to determine which files were missing We were lucky that the missing file was part of a storage area that only held indexes  Tools Available PROSTRCT UNLOCK PROSTRCT BUILDDB

9 Solution  Restored the missing extent from the week old backup and ran PROSTRCT UNLOCK  Rebuilt the indexes # proutil db –C idxbuild all BUT…….  Index rebuild failed due to finding back blocks in the storage area where the records were stored

10 NOW WHAT?

11 Back to the Beginning  Copied the backed up database to start over. Since Index Rebuild failed, we needed to start over Good thing we had copied all the files in the first place  Add the missing extent  Truncate the BI and do a DBRPR scan Fix bad blocks Fix bad records

12 Dump and Load  After all the corruption was removed it was time to dump and load  Need to do an ASCII Dump to dump around some bad records

13 Lessons Learned  This Database was important to this customer, hence they wanted it back when it got corrupted.  They need to treat the Database better Daily Backups After Imaging  A good DR plan saves a lot of heartache

14 Next Steps  Implement a good Disaster Recover plan which includes Frequent backups After Imaging implemented  Test the Disaster Recover Plan Annually  Disaster Recover Plan needs to be on Paper Can’t be just on the computer Need a backup plan incase the DR plan fails

15 The case of the Micro Manager

16 A brand name US bank had SAN corruption. This prevented Crash Recovery from completing. They had a Hot Standby machine and database using OE Replication.

17 Specifics  Had a local backup and local AI files, but the backup would not restore  Previous backup was not available  Replica Database was up to date  Platform was Windows  Database size 200 GB  OpenEdge 10.1C04

18 What’s the Problem  Customer refused to fail-over They never tested running on the fail-over machine. Had little confidence that the application would run in the fail-over environment. Customer worried about the time it takes to fail-back once failed over.

19 Making Matters Worse  Copying the DR database to the production machine is measured in days  Options presented to Management included FORCED ACCESS to the Database

21 What Next  Forced into the database – This skips Crash Recovery  Index Rebuild DOES NOT fix the database  Dump and Load DOES NOT fix the database

22 Lesson Learned  Have confidence in your Disaster Recovery Plan There is no sense of having one if you are never going to use it  Be Careful of the “QUICK FIX” Non-technical people will ALWAYS choose the fastest approach to the solution without understanding the consequences

23 Next Steps  Worked with the customer to do a fail-over test.  Made the fail-over testing an annual event

24 Schools Out For Summer

25 A Large school district needs to get their reports cards out to 30,000+ students. They discovered they had corruption in the database because backups stopped working for about a week

26 Specifics  10.2B05 Windows 64bit OpenEdge  Last good backup is 1 week old  All report card data for 30,000+ students entered since that last good backup  After Imaging is turned on, but AI file retention was less than 1 week  Database is about 300 GB  They have the 1 week old backup restored to a different location

27 WHAT WOULD YOU DO?

28 Approach  We had 2 plans  Plan A – Get the corruption out of the live database Use any and all tools to remove the corruption  Plan B – Revert back to the week old database See if we can take all the report card data from the live database and import it into the week old database.

29 Plan A  The database.lg file showed the extents where the corruption was located.  Each storage area was a single variable length extent  Corruption was in an 80 GB extent (Ugh!) Used DBRPR to scan and fix bad blocks This took hours to run on this large extent In the end this failed

30 Plan B  Worked with the vendor to find all the tables that made up the report card processing This was about 12 tables  Dumped these tables from the live database There was no corruption in these tables  Had to figure out how to get the table data into the week old database

31 HMMMMM…..

32 Plan B  Dumped the schema for the 12 tables  Went into the dictionary and renamed the tables Added _old to the end of the table name  Loaded the schema for the 12 tables  Loaded the data for the 12 tables  This is a very useful trick Didn’t need to recompile – the application worked

33 Plan A (revisited)  Dumped and Loaded the plan A database  There were 5 tables where the dump and load failed.  Did a 4GL dump FOR EACH … BY field. EXPORT… FOR EACH … BY field DESCENDING. EXPORT …  Didn’t trust the data, so we use the same table rename technique to get these tables from the week old backup.

35 But Wait – There’s More  A week later they found they also had corruption in a different database That was solved by restore and roll forward Needed to upgrade to 10.2B08 for Roll Forward to work properly –Windows 64bit 10.2B06 has a roll forward bug that prevented it from working.

36 Next Steps  Implement a DR solution OpenEdge Replication Rolling Forward AI  Restore the backup and roll forward on the same machine This verifies the backup is functional DB block corruption does not get replicated from roll forward

37 A case of the spins

38 A large medical center patched their software over the weekend. On Monday the performance of the system was unacceptable. The vendor says the patch was minor and could not be the cause of the issue. The customer says nothing else changed.

39 Specifics  OpenEdge 11.2.0 32bit  Windows 2012 64bit  Database is 321 GB  Number of users is 3,000

40 Some Metrics – Month View Date CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 05/16/15 (Sat)865807,948,358,747149,508,775531,072,918,966425,5421,922,412374,270175,980172,163116,49014,798 05/15/15 (Fri)2612,792 12,987,557,62 6114,936,6491132,465,462,9101,520,2963,227,5681,626,6921,886,768449,449295,333115,003 05/14/15 (Thu)2633,011 11,000,344,09 056,940,2731932,090,165,0021,639,5643,475,092871,8082,023,720454,097298,08873,017 05/13/15 (Wed)3233,126 10,371,051,21 355,142,4931881,879,551,6622,250,1683,423,7601,099,9302,378,306525,070374,006885,294 05/12/15 (Tue)2793,089 10,567,333,66 8140,530,655751,901,654,8031,797,5203,397,2381,043,8492,068,487496,165328,450943,510 05/11/15 (Mon) Restart 05/10/15 (Sun)73385 10,806,473,99 6206,617,341522,307,235,660307,3771,804,694368,589100,764150,257102,579244,087 05/09/15 (Sat)885045,704,394,38982,411,18669693,023,191483,4791,423,644516,069171,617165,926115,092186,064 05/08/15 (Fri) Restart 05/07/15 (Thu)2712,940 10,046,740,99 7145,481,723691,596,756,3581,705,6613,503,669924,0822,153,671455,228306,003128,058 05/06/15 (Wed)3382,9899,830,327,570153,056,406641,442,212,5612,247,9143,525,5461,225,8262,453,942557,639374,309129,160 05/05/15 (Tue)2932,967 10,392,149,94 9154,806,221671,593,242,3562,000,3923,366,9551,126,1772,324,533488,949329,102171,067 05/04/15 (Mon)4882,971 10,483,718,09 3162,479,487651,547,975,2672,311,1793,733,3071,363,4092,678,057712,951528,518212,059 05/03/15 (Sun)125484 11,161,696,09 9217,504,812511,884,717,953331,0061,783,8681,243,395270,902222,981156,12823,917 05/02/15 (Sat)1,4335408,114,391,833164,345,41449735,325,461444,5681,853,37624,483,1713,078,6551,889,1511,496,027132,360 05/01/15 (Fri)3742,735 11,611,724,20 2126,877,164921,815,943,4552,450,9873,063,5771,458,1662,046,184590,195411,7053,268,221 323 279 293 488 1,879,551,662 1,901,654,803 1,593,242,356 1,547,975,267

41 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595,9223 11,840,5809,66431,2572,338 19,1894,8584,49988,929 42 10:20:01 2 3,052102,685,94627,744,8284 12,858,41210,92333,0242,407 21,0265,8584,66099,365 43 10:35:02 2 3,08297,655,6451,303,65675 17,250,59313,81438,9473,318 27,6117,0755,046107,799 44 10:50:01 3 3,08981,674,3921,293,72263 14,030,50922,28936,3215,409 25,4287,6045,516100,920 45 11:05:0173,086214,447,1211,716,18312561,973,39640,12258,59529,27659,98713,9198,50930,332 46 11:20:0153,039155,915,7671,492,15010434,202,28525,75857,49414,19748,0548,7784,9934,498 47 11:35:0143,040156,151,5011,434,28810934,103,82427,30460,0457,79148,3238,2854,5713,952 48 11:50:0152,888146,245,4141,666,6338833,019,80133,60560,46311,37952,6068,5665,2262,711 Bad Day – 15 minute samples Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343,04966 28,340,78330,48960,8057,666 55,8187,7574,7745,928 42 10:20:01 5 2,812145,441,8711,755,12383 27,387,07026,49059,67613,279 53,1956,6424,9623,637 43 10:35:01 4 2,877151,783,5161,876,01381 27,653,29730,44661,75411,262 54,9067,8995,1926,777 44 10:50:01 7 2,894143,780,0801,877,65177 29,215,54346,23463,98019,392 66,42911,8206,7747,185 45 11:05:02 4 2,912158,495,0871,808,83588 30,191,42834,80663,07012,040 59,0419,2155,28410,165 46 11:20:0162,897155,845,1102,259,7876927,841,34634,72760,41612,77059,9988,9295,4977,237 47 11:35:0182,938150,662,8222,195,7386925,976,74470,23983,41920,19382,77712,5528,5426,035 48 11:50:0142,914138,147,8041,774,2897823,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples

42 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595,9223 11,840,5809,66431,2572,338 19,1894,8584,499 88,929 42 10:20:01 2 3,052102,685,94627,744,8284 12,858,41210,92333,0242,407 21,0265,8584,660 99,365 43 10:35:02 2 3,08297,655,6451,303,65675 17,250,59313,81438,9473,318 27,6117,0755,046 107,799 44 10:50:01 3 3,08981,674,3921,293,72263 14,030,50922,28936,3215,409 25,4287,6045,516 100,920 45 11:05:0173,086214,447,1211,716,18312561,973,39640,12258,59529,27659,98713,9198,50930,332 46 11:20:0153,039155,915,7671,492,15010434,202,28525,75857,49414,19748,0548,7784,9934,498 47 11:35:0143,040156,151,5011,434,28810934,103,82427,30460,0457,79148,3238,2854,5713,952 48 11:50:0152,888146,245,4141,666,6338833,019,80133,60560,46311,37952,6068,5665,2262,711 Bad Day – 15 minute samples Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343,04966 28,340,78330,48960,8057,666 55,8187,7574,774 5,928 42 10:20:01 5 2,812145,441,8711,755,12383 27,387,07026,49059,67613,279 53,1956,6424,962 3,637 43 10:35:01 4 2,877151,783,5161,876,01381 27,653,29730,44661,75411,262 54,9067,8995,192 6,777 44 10:50:01 7 2,894143,780,0801,877,65177 29,215,54346,23463,98019,392 66,42911,8206,774 7,185 45 11:05:02 4 2,912158,495,0871,808,83588 30,191,42834,80663,07012,040 59,0419,2155,284 10,165 46 11:20:0162,897155,845,1102,259,7876927,841,34634,72760,41612,77059,9988,9295,4977,237 47 11:35:0182,938150,662,8222,195,7386925,976,74470,23983,41920,19382,77712,5528,5426,035 48 11:50:0142,914138,147,8041,774,2897823,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples

43 Some Metrics Date CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 05/16/15 (Sat)865807,948,358,747149,508,775531,072,918,966425,5421,922,412374,270175,980172,163116,49014,798 05/15/15 (Fri)2612,792 12,987,557,62 6114,936,6491132,465,462,9101,520,2963,227,5681,626,6921,886,768449,449295,333115,003 05/14/15 (Thu)2633,011 11,000,344,09 056,940,2731932,090,165,0021,639,5643,475,092871,8082,023,720454,097298,08873,017 05/13/15 (Wed)3233,126 10,371,051,21 355,142,4931881,879,551,6622,250,1683,423,7601,099,9302,378,306525,070374,006885,294 05/12/15 (Tue)2793,089 10,567,333,66 8140,530,655751,901,654,8031,797,5203,397,2381,043,8492,068,487496,165328,450943,510 05/11/15 (Mon) Restart 05/10/15 (Sun)73385 10,806,473,99 6206,617,341522,307,235,660307,3771,804,694368,589100,764150,257102,579244,087 05/09/15 (Sat)885045,704,394,38982,411,18669693,023,191483,4791,423,644516,069171,617165,926115,092186,064 05/08/15 (Fri) Restart 05/07/15 (Thu)2712,940 10,046,740,99 7145,481,723691,596,756,3581,705,6613,503,669924,0822,153,671455,228306,003128,058 05/06/15 (Wed)3382,9899,830,327,570153,056,406641,442,212,5612,247,9143,525,5461,225,8262,453,942557,639374,309129,160 05/05/15 (Tue)2932,967 10,392,149,94 9154,806,221671,593,242,3562,000,3923,366,9551,126,1772,324,533488,949329,102171,067 05/04/15 (Mon)4882,971 10,483,718,09 3162,479,487651,547,975,2672,311,1793,733,3071,363,4092,678,057712,951528,518212,059 05/03/15 (Sun)125484 11,161,696,09 9217,504,812511,884,717,953331,0061,783,8681,243,395270,902222,981156,12823,917 05/02/15 (Sat)1,4335408,114,391,833164,345,41449735,325,461444,5681,853,37624,483,1713,078,6551,889,1511,496,027132,360 05/01/15 (Fri)3742,735 11,611,724,20 2126,877,164921,815,943,4552,450,9873,063,5771,458,1662,046,184590,195411,7053,268,221 885,294 943,510 128,058 129,160 3,268,211

44 Latch Timeouts increased. CRUD Operations Decreased. Why? Nothing had changed

45  Further investigation revealed that the –spin setting was changed from 96,000 to 20,000. This change was a move to best practices where industry experts have been saying to not have –spin higher than 20,000

46  The change was made months back to the conmgr.properties file and was long forgotten.  When the patch was applied, the database was bounced and the change finally took affect  While no one remembers a configuration change, the change was there  Setting –spin back up to 96,000 got them the performance back

47 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 2 3,033106,609,21936,595,9223 11,840,5809,66431,2572,338 19,1894,8584,499 88,929 42 10:20:01 2 3,052102,685,94627,744,8284 12,858,41210,92333,0242,407 21,0265,8584,660 99,365 43 10:35:02 2 3,08297,655,6451,303,65675 17,250,59313,81438,9473,318 27,6117,0755,046 107,799 44 10:50:01 3 3,08981,674,3921,293,72263 14,030,50922,28936,3215,409 25,4287,6045,516 100,920 45 11:05:01 7 3,086214,447,1211,716,183125 61,973,39640,12258,59529,276 59,98713,9198,509 30,332 46 11:20:01 5 3,039155,915,7671,492,150104 34,202,28525,75857,49414,197 48,0548,7784,993 4,498 47 11:35:01 4 3,040156,151,5011,434,288109 34,103,82427,30460,0457,791 48,3238,2854,571 3,952 48 11:50:01 5 2,888146,245,4141,666,63388 33,019,80133,60560,46311,379 52,6068,5665,226 2,711 Bad Day – 15 minute samples Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec Del DB Writes BI Writes AI Writes Latch TO 41 10:05:01 4 2,848153,746,9512,343,04966 28,340,78330,48960,8057,666 55,8187,7574,774 5,928 42 10:20:01 5 2,812145,441,8711,755,12383 27,387,07026,49059,67613,279 53,1956,6424,962 3,637 43 10:35:01 4 2,877151,783,5161,876,01381 27,653,29730,44661,75411,262 54,9067,8995,192 6,777 44 10:50:01 7 2,894143,780,0801,877,65177 29,215,54346,23463,98019,392 66,42911,8206,774 7,185 45 11:05:02 4 2,912158,495,0871,808,83588 30,191,42834,80663,07012,040 59,0419,2155,284 10,165 46 11:20:0162,897155,845,1102,259,7876927,841,34634,72760,41612,77059,9988,9295,4977,237 47 11:35:0182,938150,662,8222,195,7386925,976,74470,23983,41920,19382,77712,5528,5426,035 48 11:50:0142,914138,147,8041,774,2897823,570,98131,23659,28612,95757,7567,6964,9712,731 Good Day – 15 minute samples

48 But WAIT! There’s more

49  A different customer added a few CPUs to their environment.  When the users login, the CPUs peg to 100% utilized  Performance suffers  WebSpeed launches additional Agents Due to all agents are busy Specifics  Customer database is > 1 TB  430 Webspeed agents  AIX  10.1C 64bit

50 Sample CPs Users DB Requests DB Reads Ratio Rec Rds Rec Cr Rec Up Rec DelDB Writes BI Writes AI Writes Latch TO 171 14:10:000433226,177,32338,1385,93092,799,76830,1906,31604,9993,3111,859 16,058 172 14:15:001434230,752,78145,4635,07695,312,15015,0755,07905,2531,2181,022 15,475 173 14:20:001432227,272,16947,2694,80890,336,32721,1735,439814,9042,3661,409 21,450 174 14:25:000433174,847,55450,8713,43768,054,67114,0285,46805,4251,797976 8,608 175 14:30:000433201,167,30953,6613,74972,196,31615,1005,10906,3041,8081,032 20,395 176 14:35:001433272,198,78369,1703,935104,501,98925,0867,38945,4312,9131,597 48,935 177 14:40:001434261,870,509100,1912,61497,340,38723,8715,737427,9301,7841,504 58,078 178 14:45:010434264,460,116391,712675103,827,71719,6685,931468,6711,6941,284 93,499 179 14:50:001434249,536,969779,56832088,726,15723,0085,44407,8722,5431,453 123,551 180 14:55:001435279,690,881155,3331,801108,846,56622,6407,233246,0502,6681,470 72,701 181 15:00:000433268,670,791539,202498104,316,85224,2307,55406,1472,6771,557 64,974 182 15:05:001435259,585,414161,0121,612107,194,65121,8286,552637,0951,7391,375 38,665 183 15:10:001435245,056,072316,736774101,424,28525,3435,86205,9732,4011,522 28,853

51 Unlike the previous example, we had no historical performance metrics to compare to when thing were good. Could only rely on instincts and experience.

52 A Different View In a 5 minute sample, the highest latch timeout should be no more than 3,000

53 Changed –spin from 60,000 to 20,00 and the problem went away

54 Lesson Learned  There is no one setting that will work for every situation Changing –spin from 20,000 to 96,000 helped one customer Changing –spin from 60,000 to 20,000 helped another one  Having historical data is key  Don’t assume nothing has changed just because they said so Configuration changes usually only take affect at next startup

55 Summary  These are examples of some real world Database Problems  Don’t assume things can’t go wrong  Having a plan is not going enough Testing the plan and having confidence is required  If all else fails, seek professional help

56 Gus B Mike F Dan F Chris R Roadies: Paul Coveney, Darren Rhoads, Tom Cattigan, Joe Rozenberg Jeff Keller, Marek Bujnarowski, Ajit Deodhar Groupies: Dave Eddy, Humphrey Koraag, Diego Canziani, Kim Davies

Common Database Problems Common Database Solutions Mike Furgal PROGRESS Bravepoint – Database Services.

Similar presentations

Presentation on theme: "Common Database Problems Common Database Solutions Mike Furgal PROGRESS Bravepoint – Database Services."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Common Database Problems Common Database Solutions Mike Furgal PROGRESS Bravepoint – Database Services.

Similar presentations

Presentation on theme: "Common Database Problems Common Database Solutions Mike Furgal PROGRESS Bravepoint – Database Services."— Presentation transcript:

Similar presentations

About project

Feedback