Presentation on theme: "Information Technology & Computing Services"— Presentation transcript:
1Information Technology & Computing Services East Carolina UniversityInformation Technology & Computing ServicesPlanning for “What if” EventsCarol Davis, IT DRP CoordinatorJonathan Rose, Systems Programmer
2Agenda ITCS Disaster Recovery Planning Goals ITCS DRP Overview Activation of the PlanReview of Team ResponsibilitiesITCS and Departmental TestingRecovering a Mission Critical SystemITDRP Centralized SharepointCampus Disaster PlanningOther Discussion
3Interesting Facts…Nearly 60 percent of organizations don’t train employees about their roles and responsibilities in the event of a disaster. More than 80 percent of organizations have locally-managed life safety plans in place, but only 20 percent of those respondents have evacuation and relocation plansAlthough 65 percent of respondents said business recovery plans are important, only 37 percent of organizations test their business recovery plans each year. Another 29 percent merely recognize the need for such plans.More than 60 percent of organizations have plans for recovering key IT Assets such as mainframes and networks. Yet, more than 20 percent of respondents said these plans are focused solely on getting machines working again after a disaster. Only one-third of respondents said their organizations test telecommunications recovery plans annually.McCollum, ITAUDIT
4Primary Goals of DRPDetails the correct course of action to follow in the event of a disasterPlanning helps to minimize confusion, errors, and expenseQuick and complete recovery of critically outlined servicesInvolves departments in business continuity
5Secondary Goals of DRP Reduce risks of loss of services Provide ongoing protection of university assetsLearn departmental critical needs for recovery effortsEnsure the continued viability of this PlanProvide DR training in an annual disaster recovery retreat for staff to understand their recovery roles
6Policy Statement Identifying & protecting assets within their control Ensuring employees understand their obligation to protect identified assetsImplementing security practices and procedures consistent with generally accepted practicesAssigning responsibilities for establishing, maintaining, and testing a Disaster Recovery Plan
7What is COBIT?COBIT stands for Control Objectives for Information and Related TechnologyIssued by the IT Governance Institute and accepted internationally as good practice for control over information and IT related risks.COBIT is a way to bridge the communication gap between IT functions, the business and auditors, by providing a common approach, understandable by all.Control includes policies, organizational structures, practices and proceduresControl objectives are statements of the desired result or purpose to be achieved by implementing specific control procedures
8COBIT FrameworkThere are 34 high-level control objectives & 318 detailed control objectivesThe four groups are planning & organization, acquisition & implementation, delivery & support, and monitoringAddressing the high-level control objectives can ensure that an adequate control system is provided for the IT environment.
10The Plan ComponentsReadiness Team - Responsible for constructing and maintaining the Disaster Recovery Plan, for managing the DR activities, and for the continued viability of the PlanMajor Services and Key Considerations - Descriptions of the critical applications, identification of users, and key considerations such as equipment configurations, user work schedules, and processing priorities
11DRP Components (continued) General Procedures for Potential Interruptions – Likely causes of service interruptions, instructions for handling the interruptions (e.g., fire, power outage, and telecommunications failure)Policies for Reducing Risks – Policies for:Disasters that may occurExcessive damage when they do occurFailing to recover from a disaster
12DRP Components (continued) Contingency Site Description – The facilities provided and all requirements associated with the use of the siteRecovery Procedures for a Major Disaster - Instructions and procedures to be followed in the event of a major disaster (e.g., activating the emergency procedures, establishing operations at the contingency site, and restoring the university to normal operations)
13DRP Components (continued) Testing and Maintenance of the Plan - Policies and procedures for ensuring the Plan remains viable as the business environment evolvesDisaster Recovery Scenarios - Examples that illustrate differences in recovery steps and elapsed times for emergencies of minor, moderate, and major severity
14Major Services - Critical Applications Electronic MailHealthcare ApplicationsFinancial ApplicationsStudent Records/RegistrationAcademic ApplicationsPublic Web ServicesPhone ServicesBanner transition itemsInfrastructure systems
15Major Services - Priorities 1. Healthcare Applications2. Financial Accounting3. Purchase Order4. Student Records*5. Fixed Asset6. All Others* May have a higher priority during registration
16Systems Testing Schedule Administrative Applications Testing Schedule was developed last yearThis helps proactively plan by utilizing a testing rotation scheduleNew applications must be added as neededSCT Banner is requiring changes to this schedule
17General Procedures for Potential Interruptions Fire (Prevention, Detection, Extinguishing, Evacuation)Call the fire department immediately (911) and utilize a pull station. If the fire is small, use a fire extinguisher.Fire extinguishers are located in the Operations Computer Room adjacent to each computer room exit and located throughout the computer room and building as per the fire inspector’s recommendations.If the employees need to evacuate the building and no alarm has sounded, utilize a pull station. If there is time, computer operations should power down the system(s) before cutting power. Trip the Emergency Power Off (EPO) or if this fails, shut off the main breakers in the mechanical room.
18General Procedures for Potential Interruptions Electrical power outagesNetwork or telecommunications failureFloodingHardware failureSoftware failureMajor disasters
19Emergency Procedure Goals Protect the lives and health of employeesProtect essential documents, records, and dataMinimize damage to data processing equipment and other property
20Policies for Reducing Risk Protection of computer dataBackup of data, hardware, supplies, and documentationSecurity of Data Center OperationOffsite storage of tapes and materialsInsurance on equipmentBe prepared as much as possible!
21Contingency Site Description SunGard primary and secondary hotsite location with account manager informationService arrangement with machine configuration and facilities is located on the (SunGard Schedule A)Travel/Hotel accommodations for staff are made by the Administrative StaffSunGard emergency numbers
22ITCS Disaster Recovery Readiness Team Responsibilities
23DRP Readiness Team Emergency Coordinator Carol Davis Action Team AlternateContactOffsiteOffsite EmergencyActionTeamLeaders
24Readiness Team Roles The “Disaster Management Team” Purpose is to establish and direct plans of actionMaintain readiness for emergenciesManage DR activities following a disasterAdministration of the PlanEmergency Control CenterOffsite operations
25Emergency Coordinators Develop and coordinate the Readiness TeamActivate and direct all activities during disasterReview and update DRP annuallyEvaluating readiness of action teamsMaintain the Emergency Control CenterLiaison with local fire and polices agencies and other involved partiesAssists with campus disaster recovery needs
26Offsite CoordinatorsReview the Plan and ensure adequacy of testing and contingency site proceduresConduct periodic tests of contingency siteCommunicate status of contingency operations via Emergency Control CenterBackup Emergency Coordinators as needed
27Action Team LeadersReview the DR Plan with respect to recovery procedures, team responsibilities, changes in personnel, availability of resourcesRecommend changes or improvements to the PlanAssist in annual training and training others on the team on disaster recovery efforts.
28ITCS Disaster Recovery Action Team Responsibilities
29Action Teams Emergency Coordinator Alternate Offsite Offsite Emergency LeadersOperationsTeamApplicationsDatabaseNetwork/TelecomFacilitiesAdministrativeSystemsInfrastructure
30Emergency Action Teams Applications TeamTeam LeaderDatabase TeamInfrastructure WiringTelecomm TeamFacilities TeamOperations TeamTeam LeaderSysMain TeamSystech TeamNetwork TeamAdministrative Team- Individual teams and team leaders are responsible for ordering and tracking needed hardware.- All ITCS employees are considered critical staff and may be asked to participate in one of the defined roles.
31Action Team Responsibilities Operations Team ensures the resumption of computer services following a disaster by restoring and continuing scheduled processing at the contingency site until such time that operations can resume at the original or replacement data center.SysMain/SysTech is to restore or replace needed systems in the event of a disaster.
32Action Team Responsibilities Network/Telecom Team is to restore or replace the data or telecommunication systems.Administrative Team is responsible for arranging transportation, housing, expense advances, shipping, etc., and performing clerical and other functions.Applications Team ensures proper functioning of the applications at the contingency site and to coordinate with users about how their applications should be operated during the contingency period.
33Action Team Responsibilities Database team is responsible for recovery of any and all database activities and works with the other teams as needed on recovery efforts.Infrastructure Wiring is to restore or replace needed wiring in the event of a disaster.Facilities Team is to restore or replace the Data Center and other data processing facilities following a disaster.
35Readiness Team Notifications Public Safety may contact the Emergency CoordinatorReadiness Team Leaders will assist in notifications to assemble the team at the Data Center or Emergency Control CenterQuick reaction of the readiness team is crucialThe situation will be assessed to determine the needed course of action
36Readiness Team Notifications Ensure the Emergency Coordinator or Alternate Emergency Coordinator is contacted if this hasn’t been completed.If the situation is judged to be a major disaster:Activate Emergency Control CenterNotify Top managementNotify Readiness and Action TeamsNotify the Offsite storage siteNotify the Offsite contingency site
37Emergency Control Center Provide centralized and coordinated control of communications during emergenciesPrimary site: should be designatedSecondary site: should be designatedActivated by Emergency Coordinator or Alternate Emergency CoordinatorEmergency Coordinators and Team Leaders to coordinate their actions with the Emergency Control Center
38SunGard Alert Notification Call SunGard NUMBERInform the operator whether you are calling in an alert notification or a disaster declaration.Please provide the following information:Your company’s full nameYour name and password (if applicable)The address of the site affectedPrimary and secondary phone numbers where you can be reachedThe nature of the alert or disasterThe type of systems/servers that you are declaring or placing on alertThe SunGard facility your company utilizes for testingA Crisis Management team member will access your Disaster Declaration Authorization (DDA) form to ensure you are authorized to provide an alert notification
40DRP Testing & Maintenance ITCS DR Plan is to be tested annuallyThe Plan is to be revised at least once every two years or as needed with technology updatesA hard copy and electronic copies are distributed to the readiness teamsMS Sharepoint is used to maintain the IT DR Plan under the Master, Planning, Testing sites for updates and is accessible depending on access privileges
412005 Hotsite TestingRecover the system & applications from backups to vendor supplied hardware at the “hot site” in ChicagoAllow system and departmental testers in Greenville to remotely test the applications running in ChicagoComplete testing recovery templatesReview the IT Disaster Recovery Plan for updates and suggestions
42Recovering a “Mission Critical System” ITCS Disaster RecoveryRecovering a “Mission Critical System”
43What is a “Mission Critical System” A system so critical to the functioning of an organization that its destruction or loss would cause an extreme interruption to the business, have significant financial implications and or threaten the health or safety of a personReal World Definition: With in moments of the system going down, someone is calling your boss and your boss is calling you.
44An Integrated Environment “System” as it relates to recovery planning should include all business assets necessary to deliver the serviceUsersNetworkUSERS APPLICATIONS SYSTEM NETWORK…. All need POWERAll of these pieces have to work. It is an All or Nothing situationThe entire organization has vested interest that all areas are ready to respondIt’s like being on a ship 30 miles out to sea. Does it really matter what side of the boat the hole is in?ApplicationsSystemsPower
45“What If” Planning Data Center Destruction Scenario It’s the weekend and you are at home enjoying a pizza and watching the NCAA tournament. Your boss calls and leaves voice mail on your answering machine indicating that a tornado has struck your data center. The facility has suffered significant damage and your sites critical systems have been damaged. He needs you to prepare for travel to the “hot site” and recover the systems.
46Quiz: What Do You Do?Multiple Choice: (Select all that apply)Pretend that you didn’t get the message. Finish your pizza and enjoy the gameFall out, dream you’re on the Apprentice, in the board room with “Donald”. You’re FiredConfidently contact your boss to begin executing your thoroughly tested disaster recovery plans
473 Keys to a Successful Recovery BackupsWithout good backups you are rebuilding your system, not recovering itAvailable HardwareCan’t restore to what you don’t haveProcedures & TrainingDocument & Test your procedures
48Backups (Data Protection) Build in as much data redundancy as possible. (RAID, Shadowing, etc.)Frequent Backups – The more the betterRandomly test restoring your dataTrack the age of tapes used for backupsAdequate number of tapes in rotationOffsite storage of recent backups
49Available Hardware Identify & Avoid single points of failure Build in as much redundancy as possible (CPU, Memory, power, NICS, disks,…)Ensure Secondary Offsite HardwareOption 1: Identical offsite systemOption 2: Offsite Cluster MemberOption 3: Contract with recovery company
50Procedures & TrainingDevelop verbose procedures explaining the recovery process in your environmentMake sure your procedures are readily available to all necessary staffTest your procedures – Practice makes perfect
512004 Disaster Recovery Test Overview Est (Min)Recovery Overview – Actual recovery times from the 2004 Offsite Recovery TestStart TimeEnd TimeActual (Min)10Inventory hardware and log into system8:058:1515Map available disks to data drives8:25Initialize disks8:308:35525Restore SYSTEM DISK8:5520Mount restored drive and edit pre-written restore programs with mapped drive info8:589:232Submit DATA DISK restore Jobs9:249:25130Configure startup files with mapped info9:309:50180Monitor data restoration process11:24119Control includes policies, organizational structures, practices and proceduresControl objectives are statements of the desired result or purpose to be achieved by implementing specific control procedures
522004 Disaster Recovery Test Overview Est (Min)Recovery Overview - Actual recovery times from the 2004 Offsite Recovery TestStart TimeEnd TimeActual (Min)20Do controlled system reboot11:4012:0015Perform initial system checks12:2012:255Modify startup files for “Full” startup12:40Full reboot of system12:4512:527Start database environment12:5512:583Review environment to ensure integrity13:1214Operations startup of applications13:2513:3510Notify Disaster Recovery Coordinator13:40180Departmental Testers check out system13:5017:00190
53“What If” PlanningAt the start, focus your planning on scenarios that affect the critical 3. Data, Hardware and Know HowBe proactive and not reactive - “An ounce of prevention is worth a pound of cure”, so build in redundancy to avoid single points of failureThe old cliché holds true, if you fail to plan then plan to fail
54What We Do at East Carolina Data RedundancyNightly “Full” BackupsMonitor vintage of tapes and rotate backups offsiteMonthly restore of Live data to Development systemHardware AvailabilityRedundant components on Live & Development systemsDevelopment system capable of running LiveContract with SunGard for recovery servicesKnow HowVerbose procedures on recovering the environmentYearly offsite disaster recovery test
56ITCSDRP Sharepoint Site https://ouritcsdrp.ecu.edu (example)
57ITDRP Sharepoint Site ITCSDRP MASTER PLANNING TESTING The ITCSDRP top-level site is the central starting point for ITCS Disaster Recovery.MASTERThis site contains the MASTER IT Disaster Recover Plan (DRP) manual in electronic format. PLANNINGThose needing modify access in ITCS will have contributor rights to the PLANNING site. TESTINGThe TESTING site is for those in ITCS and at the department level involved in annual testing.
59Campus Disaster Planning The Crisis Decision Team addresses University wide issues such as class canceling or other mission oriented issues.Campus Operations organizes and prioritizes the physical response and recovery effortsEH&S organizes the actual Emergency Operations Center to provide overall coordination of recovery effortsITCS and other critical departments operate their own EOC's which coordinate their recovery efforts with the central EOC
60Campus - Emergency Operations Center (EOC) University Emergency Coordinator oversees campus emergenciesKey administrators form the Emergency Management TeamTodd Dining in the Sweatheart Banquet Room is the primary EOC location