Presentation on theme: "Disaster Management at the Tier-1 Andrew Sansum 2 nd April 2009 RAL."— Presentation transcript:
Disaster Management at the Tier-1 Andrew Sansum 2 nd April 2009 RAL
Do You Recognise This 04 November 2013 Tier-1 Status Burnt out UPS battery at ASGC Clearly a Disaster
04 November 2013 Tier-1 Status Do You Recognise This?
04 November 2013 Tier-1 Status Challenger Disaster
04 November 2013 Tier-1 Status Cause of Challenger Disaster It was the O rings wasnt it?[The Rogers commission] found that the Challenger accident was caused by a failure in the O-rings … The failure of the O-rings was attributed to a design flaw, as their performance could be too easily compromised by factors including the low temperature on the day of launch Yes but there were underlying cause(s) –Communication Problems..failures in communication... resulted in a decision to launch 51-L based on incomplete and sometimes misleading information, a conflict between engineering data and management judgments, and a NASA management structure that permitted internal flight safety problems to bypass key Shuttle managers. –Management Errors:The Commission found that as early as 1977, NASA managers had not only known about the flawed O-ring, but that it had the potential for catastrophe.
04 November 2013 Tier-1 Status Why considered a disaster? People died. Challenger disintegrated about seventy-three seconds after launch, killing the seven astronauts aboard NASAs reputation was badly damaged:It also represented a serious blow to NASA's reputation, colouring the public perception of piloted spaceflight.. Financial losses and reduced funding opportunity …and affecting the agency's ability to gain continued funding from Congress. Couldnt meet operational commitments Following the Challenger disaster, NASA grounded the remainder of the shuttle fleet while the risks were assessed more thoroughly, design flaws were identified, and modifications were developed and implemented.
04 November 2013 Tier-1 Status Identify Potential Disasters We do not (usually) mean the same thing when we say disaster as is meant by the Challenger Disaster Nevertheless there are many outcomes we wish to avoid Tier-1 Disaster Management plan seeks to identify circumstances that have a potential to significantly impact: –Safety –Services Commitments –Reputation –Financial
Some Disasters Can construct list of obvious disasters. Eg: –Fire/Flood etc –Loss of network –Security incident –We did this in the form of a risk analysis: DPv0.8.mhtWe did this in the form of a risk analysis: DPv0.8.mht Also have previous experience –CASTOR upgrade –Disk firmware problems made it impossible to run delivered H/W –R89 delays (unable to manage deliveries) –Backplane burnout (not a disaster but very close) Common themes: –The ones we generated tended to be operational and start suddenly –The ones we suffered were slow moving project management Also need to be able to manage un-thought of disasters
Evolution of a Disaster Sometimes fast Sometimes slow but similar result
A Strategy Create a Disaster Management System which handles all potential disasters in a similar way. Identify common features and trigger levels to allow us to spot events before they blossom into disaster Mess with existing processes as little as possible Build specific contingency plans which add to the general response in specific circumstances. Trigger early, trigger often, respond ahead of curve –Make use of the system routinely –Stops the system decaying –gives operational and project management benefits
Dont Confuse Disaster with Routine OPS 04 November 2013 Loss of power not a disaster ….. but …. Failure of routine restart may lead to disaster
Routine Operations We already have: –Production Team (Gareth, John Kelly and Tiju) –Admin on Duty (daytime) –on-call (nighttime) Routine operations should be: –Looking for problems –Fixing things –calling experts –Notifying users –setting downtimes –assessing seriousness –reviewing events – improving future response Not part of Disaster Management System –But prevents many things moving into the system
Need Escalating Response Start lightweight (Stage 1: Disaster Potential). –informally Assess/triage –Monitor/compare against standard contingencies –Set deadlines –watch for things leaving expected script but avoid interfering Add some internal management (Disaster Possible) –Add internal (group) oversight –Formally assess –interfere more, divert resources escalate response to imminent disaster (Disaster Likely) –Broaden oversight and expertise (include GRIDPP + department) –regular meetings with experiments –prepare contingencies Manage actual disaster (stage 4: Disaster)
At each stage Formal list of pre-defined communications –Notify team of deadline to escalation –Notify PMB incident is moving onto disaster track –Notify esc senior staff –Advise Press & PR (as disaster approaches) –…. Formal list of actions that should be carried out – eg: –Define Roles –Hold Incident Review Meeting –Start process to obtain financial approval –arrange exceptional experiment liaison meeting –review policy documents –…. Formal list of criteria that get you to next stage
Contingency Plans Contingency plans supplement general disaster management system. For each stage in the general system – supplement with: –Criteria to get (avoid) to this stage –Actions to take at stage –Communications make at stage Example Contingency Plan Contingency_Plan_Major_Security_Incident.mht Contingency_Plan_Major_Security_Incident.mht 04 November 2013 Tier-1 Status
Conclusions Disaster Management System is working. Already managed: –Site DNS failure (reached Stage 1) –Power failure (reached stage 2) Doesnt replace our existing processes –But does make sure they are responding correctly Expect it to manage equally well: –Operations failures (network down and out) –Project management failures (building delivered late) –Unexpected problems (eg man from mars at door) Working well and giving immediate benefit Doesnt avoid planning for aftermath of building fire (but will help manage situation) –Still working on contingency planning and experiment requirements 04 November 2013