Presentation is loading. Please wait.

Presentation is loading. Please wait.

Disaster Management at the Tier-1

Similar presentations


Presentation on theme: "Disaster Management at the Tier-1"— Presentation transcript:

1 Disaster Management at the Tier-1
Andrew Sansum 2nd April 2009 RAL

2 Burnt out UPS battery at ASGC
Do You Recognise This Burnt out UPS battery at ASGC Clearly a Disaster 22 March 2017 Tier-1 Status

3 Do You Recognise This? 22 March 2017 Tier-1 Status

4 Challenger Disaster 22 March 2017 Tier-1 Status

5 Cause of Challenger Disaster
It was the “O” rings wasn’t it? “[The Rogers commission] found that the Challenger accident was caused by a failure in the O-rings … The failure of the O-rings was attributed to a design flaw, as their performance could be too easily compromised by factors including the low temperature on the day of launch” Yes but there were underlying cause(s) Communication Problems “..failures in communication... resulted in a decision to launch 51-L based on incomplete and sometimes misleading information, a conflict between engineering data and management judgments, and a NASA management structure that permitted internal flight safety problems to bypass key Shuttle managers.” Management Errors: “The Commission found that as early as 1977, NASA managers had not only known about the flawed O-ring, but that it had the potential for catastrophe.” 22 March 2017 Tier-1 Status

6 Why considered a disaster?
People died. “Challenger disintegrated about seventy-three seconds after launch, killing the seven astronauts aboard” NASA’s reputation was badly damaged: “It also represented a serious blow to NASA's reputation, colouring the public perception of piloted spaceflight ..” Financial losses and reduced funding opportunity “…and affecting the agency's ability to gain continued funding from Congress.” Couldn’t meet operational commitments “Following the Challenger disaster, NASA grounded the remainder of the shuttle fleet while the risks were assessed more thoroughly, design flaws were identified, and modifications were developed and implemented.” 22 March 2017 Tier-1 Status

7 Identify Potential Disasters
We do not (usually) mean the same thing when we say disaster as is meant by the “Challenger Disaster” Nevertheless there are many outcomes we wish to avoid Tier-1 Disaster Management plan seeks to identify circumstances that have a potential to significantly impact: Safety Services Commitments Reputation Financial 22 March 2017 Tier-1 Status

8 Some Disasters Can construct list of obvious disasters. Eg:
Fire/Flood etc Loss of network Security incident We did this in the form of a risk analysis: DPv0.8.mht Also have previous experience CASTOR upgrade Disk firmware problems made it impossible to run delivered H/W R89 delays (unable to manage deliveries) Backplane burnout (not a disaster but very close) Common themes: The ones we generated tended to be operational and start suddenly The ones we suffered were slow moving project management Also need to be able to manage un-thought of disasters

9 Evolution of a Disaster
Sometimes fast Sometimes slow but similar result

10 A Strategy Create a Disaster Management System which handles all potential disasters in a similar way. Identify common features and trigger levels to allow us to spot events before they blossom into disaster Mess with existing processes as little as possible Build specific contingency plans which add to the general response in specific circumstances. Trigger early, trigger often, respond ahead of curve Make use of the system routinely Stops the system decaying gives operational and project management benefits

11 Don’t Confuse Disaster with Routine OPS
Loss of power not a disaster ….. but …. Failure of routine restart may lead to disaster 22 March 2017

12 Routine Operations We already have: Routine operations should be:
Production Team (Gareth, John Kelly and Tiju) Admin on Duty (daytime) on-call (nighttime) Routine operations should be: Looking for problems Fixing things calling experts Notifying users setting downtimes assessing seriousness reviewing events – improving future response Not part of Disaster Management System But prevents many things moving into the system

13 Need Escalating Response
Start lightweight (Stage 1: Disaster Potential). informally Assess/triage Monitor/compare against standard contingencies Set deadlines watch for things leaving expected script but avoid interfering Add some internal management (Disaster Possible) Add internal (group) oversight Formally assess interfere more, divert resources escalate response to imminent disaster (Disaster Likely) Broaden oversight and expertise (include GRIDPP + department) regular meetings with experiments prepare contingencies Manage actual disaster (stage 4: Disaster)

14 At each stage Formal list of pre-defined communications
Notify team of deadline to escalation Notify PMB incident is moving onto disaster track Notify esc senior staff Advise Press & PR (as disaster approaches) …. Formal list of actions that should be carried out – eg: Define Roles Hold Incident Review Meeting Start process to obtain financial approval arrange exceptional experiment liaison meeting review policy documents Formal list of criteria that get you to next stage

15 Contingency Plans Contingency plans supplement general disaster management system. For each stage in the general system – supplement with: Criteria to get (avoid) to this stage Actions to take at stage Communications make at stage Example Contingency Plan Contingency_Plan_Major_Security_Incident.mht 22 March 2017 Tier-1 Status

16 Conclusions Disaster Management System is working. Already managed:
Site DNS failure (reached Stage 1) Power failure (reached stage 2) Doesn’t replace our existing processes But does make sure they are responding correctly Expect it to manage equally well: Operations failures (network down and out) Project management failures (building delivered late) Unexpected problems (eg man from mars at door) Working well and giving immediate benefit Doesn’t avoid planning for aftermath of building fire (but will help manage situation) Still working on contingency planning and experiment requirements 22 March 2017


Download ppt "Disaster Management at the Tier-1"

Similar presentations


Ads by Google