Presentation on theme: "ROBERT K. DUGGAN, CPA, CIA, CISA. ITCP/ DRP often doesn’t work. We discover it doesn’t work when we really need it to work. We pay a fortune to."— Presentation transcript:
ITCP/ DRP often doesn’t work. We discover it doesn’t work when we really need it to work. We pay a fortune to maintain it. (Tier 4-6- $400K-$2M and up!) DR test recoveries are fun!
IBM sets Tiers 1-6 for CICS operating on z/OS Based on configuration - Tiers 1-3 being 1 week to >24 hours recovery time Tiers 4-6 being <24 hours (large manufacturers/distributors with continuous processing needs and low downtime tolerance to business to instantaneous (Tier 6- banks- 0 downtime tolerance) (see IBM.com for more information) Today’s example is on a Tier 4 Scenario for medium to large organizations with 24 hour RTO requirement for critical applications (If you have a mainframe you most likely need Tier 3 up) < 24 hour recovery of critical platforms and applications – key success factors and evaluation steps are similar for the tiers
Determined by Business Impact Analysis and Risk Assessment RTO / RPO Recovery of critical platforms and applications – regardless of tier or platform, key success factors and evaluation steps are similar for all tiers. Configuration and RTO changes.
Walkthru -“Tabletop”- Scenario with roles and responsibilities Functional Exercise – Verify the effectiveness of the backup by platform Off-Site Test Restore – Verify the effectiveness of the IT DR plan offsite at the test center
Two different things, but: ITDR and BCP are severely impaired without each other.
Should occur well before the offsite test Include vendor team Follow up process with platform owners/DR team and vendor team to resolve issues noted prior to actual test restore Audit interviews platform support teams, IT Director, DR Manager assigned as part of planning to get an understanding of objectives and where the process is on an evolutionary scale
Call tree notification system dysfunctional / not at vendor, call trees incomplete or not defined Persons who can declare not defined or poorly separated (or the wrong people) – vendor cannot take action under contractual terms Support teams not defined / backups for key members Approval process for changes to DR Documents DR Documents not current and at vendor/on secure website Vendor in same geographic area
Step by step instructions for platform owner / vendor operators are not crystal clear No clear assignment of responsibilities or documented procedures for key platform owners No clear assignment of responsibility for vendor personnel or appropriate training on platforms Backups for key personnel not defined Business impact analysis and risk assessment not current/tier of recovery is insufficient- Example: Distributor switch from call center to web application/proprietary remote order entry system
Vendor personnel or backup recovery personnel cannot restore the system - Port mapping / system documentation not complete / up to date - Insufficient remote software / hardware support level - Vendor hardware is insufficient - Insufficient procedures / lack of clean updated scripts - Poorly trained recovery personnel
Backup not really effective- verify successful recovery of each platform using a checklist and document verification method (system, volume information in header screens). PS - Don’t ask for screenshots in the middle of a DR test. Just catch platform, LPAR, times, and volume information – observe/confirm effective validation. Application recovery not verified during the 24 hour test/inaccurate RTO Inaccurate system documentation leads to failure to meet RTO Port mapping is inaccurate /not maintained properly by hardware support
Restore personnel cannot follow scripts without assistance from the company platform team Test results not verified by DR Test Manager/DR Manager or test leader is not independent/does not rotate by test Teams do not complete verification checklist or keep testing notes- it is an evolving process that needs to build Teams do not update DR Instructions following test restore for lessons learned- expensive process- should have a post restore review with follow up task list Teams do not accurately capture RT/RP, evaluate against true RTO/RPO by platform and application