Presentation on theme: "Database Corruption Be prepared, not scared."— Presentation transcript:
1 Database Corruption Be prepared, not scared. Richard Banville Fellow, OpenEdge Development Progress Software
2 Dealing With Database Corruption PreparationPrepare for the worst, hope for the bestPreventionStopping corruption before it happensAvoiding foolish behaviorDetectionIdentifying you have a problemPinpointing the causeReactionResolving corruption with least impact
3 Types Of Corruption Corruption can be small or widespread Physical User based corruptionInternal system based corruptionPhysicalBlock level corruptionHardware: Bad disk, memory, etcLogicalMissing DataRelational issuesData accessIndex issues
4 Be Prepared Modern Release (all facets of deployment) Backups – perform regularlyBackup database AND applicationPerform large backups with split mirrorsRun online backup with –BpTEST your backups with restore & access or hot stand-byprorest – Validates data written successfully (not proper data written)prorest –vf: Compares against original, but who wants to be down that long?Use offsite storageRun with AI enabledPut AI files on separate disk/separate controllerAI management tool makes AI management easyprorest <db> -vpprorest <db> -vfWe are adding an ai archive utility to help with the maintenance of running with ai
5 Be Really Prepared Keep hot standby Continually roll forward AI filesOpenEdge ReplicationHave a comprehensive recovery strategyAudit changesPlan for natural disastersPlan for not so natural disastersDocument and test your recovery strategyEducate at all levels of organizationImplement redundancyFailover ClustersHave a duplicate remote site
6 Database Consistency Checking Seen these messages before?Index name in customer for recid could not be deleted. Wrong key in index 10 for record Invalid size of an index entry.
7 Database Consistency Checking Or how about these…Invalid RM block for area 10 rmdoins: pbk->free went negative dbkey 4096 bkwrite: bktbl dbk 4096 not equal to bkbuf dbk bkaddr called with negative blkaddr: -1234
8 Database Consistency Checking Stop shared memory problems before they happenMemory overwrite protection-MemCheckEnsure block changes written to proper shm locationBuffer 1Buffer 2Buffer 1Buffer 2Insert new key entryOops! Miscalculation results in memory stomp of next block header.2 New types of consistency checkingObject level for single object – a hierarchy of checking
9 Database Consistency Checking Stop database corruption from becoming persistentPhysical block consistency checking-DbCheckValidates record and index blocks after each update operation-AreaCheck “area name”-Index Check “index name”-TableCheck ‘table name”Typically the result of a bugAvailable for OLTP and roll forward2 New types of consistency checkingObject level for single object – a hierarchy of checking
10 Enabling Database Consistency Checking Database startup parameter (-MemCheck, -DbCheck)Managed via promon R&D, 4. Admin Functions8. Block level consistency checkCurrent consistency check status:1. -MemCheck: enabled2. -DbCheck: enabled3. –AreaCheck: disabled4. -IndexCheck: disabled5. -TableCheck: disabledEnter the option to enable/disable a consistency check:Explain a scenario where someone would want to run this during roll forward.
11 Database Consistency Checking Performance Impact Memory checking: unnoticeable impactBlock level checking: still reasonableOn error, get .lg file to Progress Technical SupportCurrent consistency check status:1. -MemCheck: enabled2. -DbCheck: enabled3. –AreaCheck: disabled4. -IndexCheck: disabled5. -TableCheck: disabledEnter the option to enable/disable a consistency check:< 1%~5%Explain a scenario where someone would want to run this during rollforward.
12 Identifying Problem Types and Reacting There are many ways for data to get corruptedIdentifying corruption typeKey word association can help direct recovery effortUnderstanding process can also helpQuickest way to recoveryKnowing the tools & which to use is keyPractice recovery efforts before neededLet’s examine a few
13 Index Issues How to proceed Index Messaging Index Root block Key entry Index <i> in <t> for recid <r> could not be deleted. (1422) Logical corruption: Missing entries or record not foundIndex <i>, block <b>, element no. 1: bad compression size. (4423)Physical corruption: Storage format of index is incorrectHow to proceedIndexRoot blockKey entry(ix, cx, ky)B-treeCursor
14 Index Validation Tools Idxcheck online validation levelsPhysical/Block corruptionPhysical consistencyLogical/key entry corruptionKeys to recordsRecords to keysValidate key orderLock table optionNew index rebuild may be faster!proutil <db> -C idxcheck
15 Index Validation & Repair Tools proutil <db> -C idxfixIndex Fix Utility1. Scan records for missing index entries. 2. Scan indexes for invalid index entries. 3. Both 1 and 2 above. 4. Cross-reference check of multiple indexes for a table. 5. Build indexes from existing indexes. 6. Delete one record and it's index entries. 7. Quit.Select one of the following:All (a/A) - Fix all the indexes Some (s/S) - Fix only some of the indexes By Area (r/R) - Fix indexes in selected areas By Schema (c/C) - Fix indexes by schema owners By Table (t/T) - Fix indexes in selected tables By Activation (v/V) - Fix selected active or inactive indexesFix indexes on Scan.Is this correct? (y/n)
16 Index Validation & Repair Tools proutil <db> -C idxfixIndex Fix Utility1. Scan records for missing index entries. 2. Scan indexes for invalid index entries. 3. Both 1 and 2 above. 4. Cross-reference check of multiple indexes for a table. 5. Build indexes from existing indexes. 6. Delete one record and it's index entries. 7. Quit.Online operationTransactions are relatively smallDoes not fix physical block corruptionOne concurrent idxfix process per table
17 Using Index Fix: Record but no index entry OLTP (.lg and screen):Index name in customer for recid could not be deleted.1. Scan records for missing index entries:Index 12 (customer, name): couldn't find key <RICHB> recidOption #1: Add key entry to index1. Scan records for missing index entries.Fix indexes on Scan. YesNOTE: 2. Scan indexes for invalid index entries.Would NOT report an error!proutil <db> -C idxfixbbbbField2Field3Field4richbaaaa166891669016691101112
18 Using Index Fix: Record but no index entry OLTP (.lg and screen):Index name in customer for recid could not be deleted.1. Scan records for missing index entries:Index 12 (customer, name): couldn't find key <RICHB> recidOption #2: Delete record and its key entry in table’s other indexes6. Delete one record and it's index entries.Type the recid to delete: 16691Type the area (number) for the recid(s): 8Look in the .st file to match area number and area name.proutil <db> -C idxfix16689aaaaField2Field3Field416690bbbbField2Field3Field4Find first cust where recid(cust) = display cust16691richbField2Field3Field4101112
19 Using Index Fix: Record but no index entry Often no runtime error reported.2. Scan indexes for invalid index entries:Index 12 (customer, name): found invalid key <RICHB> recidOnly option: remove invalid key entry2. Scan indexes for invalid index entriesFix indexes on Scan. YesNOTE: 1. Scan records for missing index entries.Would NOT report an error!proutil <db> -C idxfix101112
20 Fixing Index Corruption (continued) Missing key entries or record not found (logical corruption)Index fixAction based on record removal or index entry insert/deleteIndex <i>, block <b>, element no. 1: bad compression sizePhysical b-tree corruptionMust rebuild index to recover
21 Index Repair Tools proutil <db> -C idxbuild Offline utility Performance improvements since 10.2b06Will repair:Index block corruption (physical)Orphan index blocksAdds missing index entriesAssumes record data is correctFlexible options (db, area, table, index)Truncates existing BI fileDoes not record idxbuild changes into BI fileproutil <db> -C idxbuild
22 Index Rebuild Performance Parameter Suggestions -TBsort block size: 64-datascanthreads# threads for data scan phase: 1.5 * # CPUs-TMBmerge block size ( default -TB): 64-TFmerge pool fraction of system memory: 80 %-mergethreads# threads per concurrent merge: 1.5 * # CPUs-threadnum# concurrent sort groups merging: 2 to 4-TM# merge buffers to merge each pass: 32-rusagereport system usage statistics-silenta bit quieter than before
23 Index Build/Repair Tools Builds and activates indexOnlineOne concurrent idxactivate process per tableRequires client schema re-cacheTransaction size based on “recs” parameterDeactivate requires exclusive accessRepair logical and physical index corruptionAssumes valid record data*** Static queries require recompile to consider new indexproutil <db> -C idxactivate <i1> useindex <i2>
24 Record Issues Record Messaging How to proceed bffld: nxtfld: scan past last field. (16)Looking for field #5 but only 4 fields existRecord continuation not found, fragment recid <r> area <a>. (10831)Pointer to next record fragment is invalidHow to proceedRecordrecidfield(rm, bf, rec)rowidField1Field2Field3Field4Record Fragment 1
25 Checking For Inconsistencies Online proutil <db> -C dbanalys | tabanalysReads record for statistics purposesPhysical Validation5. Read or Validate Database Block(s)Validation levels0: Block header info only1: Record header & record size2: Record overlap checkingLogical Validation w/schema3. Record Validation4. Record Version Validationdbtool <db>The –memcheck and -dbchecks are meant to ensure that the database remains consistent at runtime. Dbtool can be run to ennsure that the database doesn’t already contains physical inconsistencies.
26 Record Repair Tools bffld: nxtfld: scan past last field. (16) Online and multi-threaded6. Record FixupAdds missing fieldsRemoves invalid “end-rec” indicator6. Delete one record and it’s index entriesdbtool <db>proutil <db> -C idxfix
27 Record Repair ToolsRecord continuation not found, fragment recid <r> area <a>.3. Remove Bad Record Fragment14. Display Record ContentsExclusive accessTruncate bi fileRecord Fragment 1proutil <db> -C dbrpr
28 More Record Repair Tools Record continuation not found, fragment recid <r> area <a>Record Fragment 1Warning:The use of dbrpr to fix problems in the database should be done with the assistance of Progress Technical Support.
29 Dbrpr Record Fix-up Example – Last resort Before you do anything: Validate current backupOptions:proutil <db> -C truncate biproutil <db> -C dbrpr1. Database Scan Menu10. Display the Free Chain2. Test One or More Indexes11. Display the RM Chain3. Remove Bad Record Fragment12. Display the Index Delete Chain4. Dump Block13. Display Block Contents5. Load Block14. Display Record Contents6. Copy Bytes Between Files15. Display Cluster Chain7. Load RM Dump File16. Scan/Fix block checksum8. Reformat Block to a Free Block9. Change Current Working Area
30 Dbrpr Record Fix-up Example – Last resort Record continuation not found, fragment recid area 8 3.Before you do anything: Validate current backupValidate bad record info1. Database Scan Menuproutil <db> -C truncate biproutil <db> -C dbrpr1. Report Bad Blocks8. Rebuild RM Chain3. Fix Bad Blocks9. Rebuild Index Delete Chain4. Report Bad Records10. Change Current Working Area5. Delete Bad Records11. Fix Cluster Chains in Type II Area6. Dump Records to RM File7. Rebuild Free Chain
31 Dbrpr Record Fix-up Example – Last resort Record continuation not found, fragment recid area 8 3.Get a view of what you are going to delete:9. Change Current Working Area13. Display Block Contents1. Dump Data Block Details6. Start DbkeyDelete partial record3. Remove Bad Record FragmentRe-validate (see previous screen)proutil <db> -C truncate biproutil <db> -C dbrprOffsetLenHexAscii1910x64d2150xrichb30“”3520x6d61MA30x626262BBB
32 Other Record Oriented Repair Tools proutil <db> -C dump <table> . -index <i>Binary dumpOnline & multi-threadedBinary record formatMay not fix individual record corruptionMay fail when encountering physical corruptionUse selective binary dump to dump in ranges-index defaults to primary indexUse different index if primary cannot be usedUse –index 0 if no valid index exists (Type II storage area)
33 Other Record Repair Mechanisms Dump records in “PUB” schema by rowidManual Ascii dump and load “repair”Reload w/bulk load or ABL importSpecify index to use or TABLE-SCANDEFINE VARIABLE ix AS INTEGER NO-UNDO.FIND _file "item".OUTPUT TO item.d.DO ix = 1 TO 10000:FIND item WHERE RECID(item) = ix NO-ERROR.IF AVAILABLE item AND ix <> INTEGER(_file._template) THENEXPORT item.END.Make sure Large enough!
34 Block Issues Block and shared memory buffer messages bkio, bk, bm Wrong dbkey in block. Found <x>, should be <y> in area <z>. (1124)Read, write, modify, releaseMost often O/S File System issueReboot often fixes this error – but why?bkioWrite:Unknown O/S error during write, errno 2, fd <x>, len <y>, offset <z>, filename <s> database <t>. (14676)Attempt to read block <n> which does not exist in area <a>. (201)Often index rebuild will fix this error. (rebuild on area level)bkio, bk, bmBlockAreaDbkeyBufferExtent
35 Block Repair ToolsChecksum validation of dbkey <d> block type 4 in area <a> does not match data. Expected: <e> received <r>. (14410)1. Report Bad Checksum2. Fix Bad Checksum16. Scan/Fix block checksum (Type II Area)Ignore for free blocks (block type 4)Validate database by other means prior to “fixing”True corruption will require a database rebuilddump and loadrestore/roll forwardMaster block: 1Free block: 4Record block: 3Index block: 5proutil <db> -C dbrpr
36 Block Chain Repair Tools RM chain count inconsistency.20 Blocks indicated on record free chain (actually 5)RM block found not on RM chain, but flagged RM chain.RM block free chain link error<type> Block <number> with invalid chain type <number> on RM chainFree block marked on free chain but linked into RM chain RMRMRMFREE
37 Block Chain Repair Tools RM chain count inconsistency.RM block found not on RM chain, but flagged RM chain.<name> Block <number> with invalid chain type <number> on RM chain 1. Database Scan Menu7. Rebuild Free Chain8. Rebuild RM Chain9. Rebuild Index Delete Chain11. Fix Cluster Chains in Type II AreaRebuild free chains/rm chains from dbrprSeek help from supportproutil <db> -C dbrpr
38 Recovery Manager Issues Recovery Messages** The after-image file expected Tue Feb 26 16:47: (832)** Those dates don't match, so you have the wrong copy of one of them. (833)Undo failed to reproduce the record in area <a> with rowid <r> and return code -1. (10566)Invalid block <x> for file <y>.a3, max is 1024 (2329)How to proceedRestore / roll forwardSwitch to hot standbyRecovery (rl)Retryai, a<n>RedoBefore imagebi, b<n>UndoAfter imageTransaction (tm)NOTE: tm may be a soft error
39 Recovering From Recovery Failures I’ve got no backup & crash recovery won’t work?Looks further back in BI.Should no longer be needed but its worth a try!**** As a very last resort, force truncateWhat are the side effects of skipping crash recovery?-F: How bad could it be?Dump and re-load into new databaseReconcile data contents and relationships after loadBackup & enable AIMaintain hot standbyproutil <db> -C truncate bi –G 120proutil <db> -C truncate bi -F
40 Structural RepairThose dates don't match, so you have the wrong copy of one of them.Usually the result of an OS copy or moveMake sure all right pieces in place & .st file identifies them correctlyDoes NOT repair corrupt databaseUpdates path names to those specified in .st fileUse “sparingly”Patches date mismatch & creates dummy extentsUse to recover what ever data remains when no backup existsprostrct repair <db> <x>.stprostrct unlock <db> <x>.st
41 Structural Repair rm x.db - Ooops! Rebuild database “control area” (.db file) from .st fileChanges to control area are not loggedCancelling a txn that changes control area may require builddbMay force re-base for OpenEdge ReplicationAlways have an up to date .st fileprostrct builddb <db> <db>.stprostrct list <db>
42 Summary The many faces of corruption Corruption shows itself in many different waysHard and soft corruptionMemory and disk. Record, index, block and db structureSome repair tools are a loaded gunIn the wrong hands they can produce havocPreparation is your best way to recoveryStandard disaster recovery preparationsKnowing options before problems occur
44 www.progress.com/exchange-pug October 6–9, 2013 • Boston #PRGS13 Special low rate of $495 for PUG Challenge attendees with the code PUGAMAnd visit the Progress booth to learn more about the Progress App Dev Challenge!
46 Recovering From Recovery Failures Time to restore/roll forwardOr switch to hot standbyWhat if roll forward fails?Use roll forward verifyRoll forward to point in time or transaction #myDb Hot StandbyaiftpXRoll forwardSYSTEM ERROR: Attempt to read block which does not exist in area 8, database x.** Save file named core for analysis by Progress Software Corporation.
47 AI Validation Before Application rfutil <db> -C aiverify <type>Partial: AI block and note header validationIncreases reliability of archived AI filesFull: partial + note data validationIdentifies point in time recoveryRunningAt AI switch or on AI archivalJust before roll forward of extentPreferably on hot standby“Any” <db> will doAiverify partial released in 10.1b02. Full released in 10.1cRun if trying to recover as much as possible from a damaged ai file.More than ai scan verboseFinds ai block corruption caused by such things as ftp problems etc.
48 Roll Forward Verification rfutil <db> -C aiverify <type>myDbHot StandbyaiftpXrlNoteVerify: Note dbkey is negative (14099)Trid: 358 code = RL_CXINS version = 2 (12528)Hot Stand by:Recovery Scenario:Re-send broken AI fileValidate/fix production dbRe-base hot standbyRoll forward to transaction