Presentation on theme: "TECHNICAL PLAN SUMMARY 2007 Disaster Recovery. Presentation Purpose Understand Impact of Various Disaster Situations What it means to operations Impact."— Presentation transcript:
Presentation Purpose Understand Impact of Various Disaster Situations What it means to operations Impact duration Understand Alternate Work Options and Recovery Process Understand Roles
Various Possible Disaster Scenarios DECLARATION OF A DISASTER, which activates all DR procedures, would be would be made in the event of a facility loss or regional disaster. Activation of subsections or various disaster alternate work means, however, may occur in the event of various service failures (example: phone services out of operation).
Impact to Operations Aside from a regional disaster, the most significant impact would occur with to the Farr Regional Library. FarrCPLPCVErieMemberComments ILS Store and forward on self checks will be used for up to two weeks. Horizon --xxxxx HIP --xxxxx Self Check Store and Forward (Patron checkout) --xxxxNA Telecirc --xxxxx Communication Services Reroute main number to, use branch cell phones. Use FRED sharepoint for status updates/coordination. Ability to make/send calls via headsets --xxxxx Voicemail --xxxxx Fred core --xxxxx Fred Sharepoint --xxxxx OWA --xxxxx District Cell Phones --xxxx Interlocation email --xxxxx Internet Presence/Services Staff to use wireless network or public computers for internet access-AWW indicates available with workaround (contact the vendor to eliminate patron validation) Staff Internet Access --xxxxx Mylibrary --xxxxx Board, planning, other sharepoints --xxxxx Download audiobooks --AWW Data WAN Will have acces to any local shares, for those at Farr the sections replicated would be available at CP. Access to shares (replicated) --xxxxNA Ability to Log in --xxxxx Patron Services Public machines will function without, however patron validation Filtering --xxxxx Public Computers Internet Access --xxxxx PC Reservation/Print --
Impact to Operations In the event of the loss of a branch or member library, impact would be less extensive. CP/LPFarrCVErieMemberComments ILS Remote access would be avialable for all CP/LP staff if needed. Horizon --xxxx HIP --xxxx Self Check Store and Forward (Patron checkout) --xxx NA Telecirc --xxxx Communication Services Ability to make/send calls via headsets --xxxx Voicemail --xxxx Fred core --xxxx Fred Sharepoint --xxxx OWA --xxxx District Cell Phones --xxx Interlocation email --xxxx Internet Presence/Services Staff Internet Access --xxxx Mylibrary --xxxx Board, planning, other sharepoints --xxxx Download audiobooks --xxxx Data WAN Access to shares (replicated) --xxxNA Ability to Log in --xxxx Patron Services Filtering --xxxx Public Computers Internet Access --xxxx PC Reservation/Print --xxxx
DR PLAN - What’s Included First, let’s confirm what we are trying to protect/manage with a disaster recovery plan. 1 ) Data (data only, not system configurations) ILS Email Individual Shares Department Shares 2) Systems – all systems to be rebuilt 3) Paper Copies – not included
What if we don’t do anything, is it worth the effort? Basically, we’re looking at insurance plans to cover a risk/investment of $1.5- 2.5 million. Cost of Downtime If down with no disaster recovery plan Time in HoursDaysCost System down8010 Download oclc data8010$5,000 Lost fines and fees (not yet to debt collect, debt collect has more than 90 days)NA $60,000 Fast track labor overhead(30 sec/item/est circ of 146000 items*3..April)62578.125$6,250 Update the catalog (ten minutes per bib) - 315000500006250$750,000 Impact to public avialablility (estaimted at 10% of annual taxes) - PR impact, unavailbaility $604,170 Server rebuilds (mylibrary and fred) $60,000 Total $1,485,420 Magnitude of 1.5-2.5 million to recover
Optimal Solution The optimal solution is one that determines the best fir for cost versus time to recover (and also to what point in time data is recovered).
Terminology There are a few basic terminology items to be aware of. Disaster Recovery (DR) Disaster recovery. Typically this is associated with a technology recovery plan. Business Continuity Plan An overarching business disaster recovery plan which includes staffing, public communication, and more Recovery time Time for the system to be operational and available for use Point in time recovery The amount (in time) of data that may be lost as a result of the process
The disaster recovery effort has been divided into three subject areas. The first efforts define the procedures to be followed during a disruption to technical services. The second effort surrounds the technical recovery design which impacts how long emergency procedures will need to be followed and how successfully data can be recovered to a specific point in time. Finally, the last area looks at all services to ensure the most cost effective appraoches are being taken for recovery.
Phases of Disaster Recovery Ensure staff know how to continue operations during technical disruption by downtime training (now) I. Continue Operations Define the best fit cost and recovery time balance for the organization needs (in process) II. Recover Quickly Work to establish efficient cost structure for backup and recovery through establishing best fit data management policies (2008) III. Cost Management
This section of a disaster recovery plan focuses on ensuring appropriate materials are available and staff are trained and can operate during downtime situations. I. Continue Operations Emergency Boxes at each location Downtime Phones Afterhours support information for facilities and IT ILS downtime procedure PC Res downtime procedure All telephone and/or network downtime procedures Filtering product downtime
Alternate Work Processes This table depicts at the highest level, the alternate means by which operations can be conducted in the event of a regional failure. Horizon Store & Forward on the Self Checks and returns on the smartchutes (OCLC for inquiries) Phone Cell phones Data network Use public machines for internet access. Use Sharepoints for data sharing, DR issued emails for communcation. Collaboration (email, shares) Use FRED sharepoint for communication and setup individual accounts if critical. Share data may be available depending on situation
Or simpler yet… 1. Know where the emergency box is at your location. 2. Immediately begin to use alternate work methods (to continue operations as normally as possible) 3. Check http://www.fred.sharepointspace.com for updates.http://www.fred.sharepointspace.com
Speed of Recovery (how long would services be down)
II: Speed of Recovery The speed of recovery is dependent upon -vendor response -resource time available (priorities) -equipment availability -complexity of the recovery ServiceBest Case Day Live Worst Case Day Live Data networkDay 2Days 5-8 Main PhoneDay 2Days 5-8 ILSDay 3Days 5-8 EmailDay 5Days 8-12 SharesDay 5Days 8-12 HIPDay 3Days 8-12 MyLibraryDays 8-15Days 15-20 FinancialsDays 8-15Days 15-20 Telecirc, other…Days 15-25Days 25-35
Awareness point For the WLD, when backups occur only the data of the system is being retained. In some instances systems configurations are backed up but a full system installation is typically not captured. What this means: although the data is available, the hardware must be recovered and then all software reloaded. WLD is assessing when virtual machines can be created and easily backed up.
Roles (Who Does What) DR ROLEPRIMARYBACKUP IT Coordination and Communication SusanMike/Gem Infrastructure Recovery Mike3T/MTT/Susan Client Access and Recovery EricMarcus District and Public Communication Coordination Janine Kelli
Three point in time technical designs were considered. Note all designs assume a baseline of primary equipment designed fro high availability with redundant power supplies, RAID and other standard features inherent in business class servers and equipment. Using solely tape backup as a solution I. Tape Backup Uses data replication to off site similar hardware plus tape for lost data II. Self managed replication (current)+tape Similar to self managed replication but directed to offsite services instead of internal hardware III. Hosted service replication
Best Fit After reviewing options WLD will be working with Iron Mountain Hosted backup w/ tape archive expert resources monitoring and tracking data backup process additional capacity available to provide needed data in the even to fan emergency (can work with other vendor partners) best medium for ensuring data integrity (avoiding bad tapes, etc) estimated solution duration 2 maybe three years Dependent on data and systems growth Review annually to determine if best fit What it looks like 7 days of onsite and offsite data backup on disk (fast recovery) After 7 days, historical data available on tape (at risk for older recovery) Virtual Machines HIP Horizon if possible (testing to start now) MyLibrary Other….
Time to Recover The table to the right depicts the estimated length of time needed for various services to be available, both temporary work means and full as well as full recovery where normal operations have resumed.
Cost management includes efforts to help smartly manage data growth and use. Is money being spent backing up old or inappropriate data (mp3 files, family pictures, other?) Continued data collection and research in 2007 Share data Email file size Work processes (IT example) What about personal folders, archives Review findings and develop a recommendation in 2008 Share policies/use Email mailbox size rules Other?
Key take aways for each section of Disaster recovery Know where the emergency box is at each location I. Continue Operations Use hosted backup and VM instances wherever possible II. Recover Quickly Data collection in 2007 Revisit policies in 2008 III. Cost Management
Assuming the worst case disaster (Farr destroyed) short of a regional catastrophe 1.CONTINUE OPERATIONS: Staff immediately shifts to downtime operations 2.RECOVERY QUICKLY: Director/associate director immediately updates FRED sharepoint with first conference call time Site is http://fred.sharepointspace.com – updates will be posted on the DR pagehttp://fred.sharepointspace.com All managers to participate use 866-258-0959 meeting room ID *1338021* using 1857 Daily meetings at 8:15 daily (target breif, 15 minute information sharing) until recovery is completed Managers to update staff after daily meeting Communication to staff posted on DR site IT will join all morning meetings to provide updates and will post specific information on the DR site as well
What’s not defined, subjects for a business continuity plan 1.Staffing information 1.Do staff report, where, when? 2.Will staff be paid? 2.Public and board communication plan 1.How to keep public notified of the status 3.Peer communication plan 1.ILL services, other, how to operate 4.Actual physical recovery if location destroyed 1.Rebuild/other? 2.Insurance processes 3. timeline for recovery (and again, staff impact in the interim) 5.Other?
Final Next Steps Next Steps as of Sept 12, 2007 Present to branch managers for awareness of full plan Complete testing of Horizon VM instance Due in 30 days Complete testing of Iron mountain service In process, decision due October Complete migration of applicable services to virtual machines (includes installing separate copy Q1 2008 Finalize the archive configuration for the ILS Work with Kari/Managers to train appropriate staff on store and forward uploads Conduct DR test on Nov 5