ICW Water Leak at FCC Forces a Partial Computing Outage at Feynman Computing Center J. MacNerland, M. Thomas, G. Bellendir, A. Walters May 31, 2007.

Slides:



Advertisements
Similar presentations
Data Center Design Issues Bill Tschudi, LBNL
Advertisements

1 Procuring a Design Project engineers perspective on starting the design process and establishing the A/E scope of work Frank C. Quigley General Engineer.
Planned FCC Power Outage July 11, ARRA Feynman ARRA construction project requires completion by September 30 th. This constraint.
FPL Hurricane Assessment Operating Committee January 25, 2005.
Response Operations During and After a Space Weather Event Response Operations During and After a Space Weather Event 3 March 2011.
TYPES OF MECHANICAL SYSTEMS
Utility Management Providence Health System - Oregon Environment of Care.
Preparing for Power Outages Like any other part of the infrastructure, electrical power to the campus can fail, either as an isolated incident (e.g., tripped.
State Data Center Re-scoped Projects With emphasis on reducing load on cooling systems in OB2 April 4, 2012.
Pentagon Washington, D.C. UMCS contract for modernization of the Pentagon Scope includes: Design, engineering and commissioning of the Building Operations.
Columbia University’s Advanced Concepts Data Center Pilot June 17, 2011.
 Capacity Development; National Systems / Global Fund Summary of the implementation capacities for National Programs and Global Fund Grants For HIV /TB.
NERC Lessons Learned Summary December NERC lessons learned published in December 2014 Three NERC lessons learned (LL) were published in December.
Tabletop Exercise IS-701.A – February 2010 Visual 5.1 Unit 5: Tabletop Exercise.
Chapter 13 Managing Computer and Data Resources. Introduction A disciplined, systematic approach is needed for management success Problem Management,
AFFILIATED ENGINEERS, INC. THE UNIVERSITY OF TEXAS MEDICAL BRANCH (UTMB) AT GALVESTON Presented to Houston Area Association of Hospital Engineering July.
Cloud Computing Data Centers Dr. Sanjay P. Ahuja, Ph.D FIS Distinguished Professor of Computer Science School of Computing, UNF.
PG&E and Altera Data Center Energy Efficiency Project.
State Data Center Re-scoped Projects With emphasis on reducing load on cooling systems in OB2 April 4, 2012.
Overview of Key Rule Features
Measuring and Validating Attempts to Green Columbia’s Data Center October 14, 2010 Rich Hall Peter M Crosta Alan Crosswell Columbia University Information.
02/24/09 Green Data Center project Alan Crosswell.
The Urban Infrastructure Challenge in Canada: Focusing on Housing Affordability and Choice Presentation by CHBA – [Name] to The Municipal Council of [Name]
Computer Room Experiences A medium sized tier-2 site view Pete Gronbech GridPP Project Manager HEPIX April 2012.
John Graham – STRATEGIC Information Group Steve Lamb - QAD Disaster Recovery Planning MMUG Spring 2013 March 19, 2013 Cleveland, OH 03/19/2013MMUG Cleveland.
Business Crisis and Continuity Management (BCCM) Class Session
Chapter 4 Office Layout.
Copyright 2007 – ServersCheck BVBA –CONFIDENTIAL Welcome.
October 23rd, 2009 Visit of CMS Computing Management at CC-IN2P3.
AIR QUANTITY REQUIRED TO TRANSFER HEAT IN A FORCED AIR SYSTEM In all four mechanical systems considered, the transfer of heat, either by taking it out.
IT Business Continuity Briefing March 3,  Incident Overview  Improving the power posture of the Primary Data Center  STAGEnet Redundancy  Telephone.
City of Polson April 5, 2010 PUBLIC HEARING Water System Improvements PRESENTATION ON PRELIMINARY ENGINEERING REPORT.
Server Virtualization: Navy Network Operations Centers
© 2008 Data Power Services, LLC By: Tom Taranto 2010 Energy Expo Hosted by National Grid NYSERDA Reducing Compressed Air Energy Cost Using a Systems Approach.
Computer Operations Group Data Centre Facilities Hitendra Patel June 2015.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Site Networking Anna Jordan April 28, 2009.
RMS Update to TAC May 8, RMS Update to TAC ► At April 9 RMS Meeting:  Antitrust Training  RMS Voting Items: ► NPRR097Changes to Section 8 to Incorporate.
Software Project Management (SPM)
ISO Outlook Summer 2005 and Beyond Senate Energy, Utilities and Communications Committee February 22, 2005 Jim Detmers Vice President of Grid Operations.
TRENTON Casing Filler Presentation
ILLINOIS COMMERCE COMMISSION Darin Burk Manager – Pipeline Safety 1.
Dealing with Hotspots in Datacenters Caused by High-Density Computing Peter Hannaford Director of Business Development EMEA.
Introducing Project Management Update December 2011.
5/18/2006 Department of Technology Services Security Architecture.
High Performance Computing (HPC) Data Center Proposal Imran Latif, Facility Project Manager Scientific & Enterprise Computing Data Centers at BNL 10/14/2015.
Outage Planning Subcommittee Mark Wilson. Outage Planning Subcommittee Created to address Market Participant concerns with existing rules and procedures.
Authors: William Tschudi, Lawrence Berkeley National Lab Stephen Fok, Pacific Gas and Electric Company Stephen Fok, Pacific Gas and Electric Company Presented.
Capital Insight Pty Limited ABN Berry Street North Sydney NSW 2060 t f Strategic Risk.
Johan Bremer, 22th-26th September 2008 Cryogenics Operations 2008, CERN, Geneva, Switzerland 1 CRYOGENICS OPERATIONS 2008 Organized by CERN Safety aspects.
Fire Recovery Facilities –Physics Computing Services is finally able to relocate back into 1301 as of this week. They will probably be completely off of.
FESS/Ops Shut Down Summary Week Ending 9/7/12. – FESS needs a CUB Cooling Outage for installation of Barriers in the CUB Chiller 4160 Volt Starter Cabinet.
Water System Master Plan & Rate Study City of DeKalb, Illinois City Council Presentation May 16, 2015.
CD-doc-650 Fermilab Computing Division Physical Infrastructure Requirements for Computers FY04 – 07 (1/4/05)
1 ITM 1.2 How IT Decisions Impact Data Center Facilities: The Importance of Collaboration Lars Strong P.E. Upsite Technologies, Inc.
Poultry Equipment Used On Broiler Farms Ag. Ed. Computer Workshop June 27, 2002 Pictures and Program Prepared by John Sutton Georgia Agriculture Education.
REFRIGERATION Refrigeration is a process in which work is done to move heat from one location to another. Uses of Refrigeration :  Industrial uses 
WATER SYSTEM WORKSHOP February 27, 2012
CD-doc-650 Fermilab Computing Division Physical Infrastructure Requirements for Computers FY04 – 07 (1/4/05)
Data Center Stabilization
Part 16 Technician’s Guide & Workbook for Duct Diagnostics and Repair
Unit 2: Chapter 2 Cooling.
The Data Center Challenge
CERN Data Centre ‘Building 513 on the Meyrin Site’
Heating Ventilating and Air Conditioning
Public Employees Retirement Association Infrastructure Upgrade
Water management Program to reduce the risk of legionella growth
Pacific Power Seismic Preparedness Update
Moving Forward From Fukushima Near-Term Task Force EP Recommendations
Public Employees Retirement Association Infrastructure Upgrade
Food Storage.
Presentation transcript:

ICW Water Leak at FCC Forces a Partial Computing Outage at Feynman Computing Center J. MacNerland, M. Thomas, G. Bellendir, A. Walters May 31, 2007

Leak Detected by Monitoring  A leak was detected 0618 on Thursday, May 31, 2007  The source was the main ICW water pipe in the east utility area of the Feynman Computing Center  Assessment was initiated and at 0745, was sent to CD management & stakeholders to advise them of a probable outage

End Cap on 8” Incoming ICW X The end cap had deteriorated and water was spraying out on the floor. There was concern that at any time the pressure could increase the size of the rupture and velocity of the leak

Affected Computing Equipment  The ICW water is required for cooling by the CRACs (Computer Room Air Conditioner) in the computer rooms and the air conditioning for the offices  There was potential impact on core servers ( , web, networking, storage, databases, Scientific computing and BSS financials)

Understanding the impact  By mid-morning, the working estimate forecasted a three hour outage, starting at 1300  Stakeholders were advised and the list of critical systems was refined  Stakeholders prepared plans to shutdown non-critical systems and to be ready to start up later in the day

Deploying The Plan  It was calculated that if non-critical systems in FCC computer rooms were shutdown that critical services would be able to remain up for the three hour outage  Most computing on FCC1 was shutdown, networking and core servers on FCC2 remained up

Deploying The Plan  Stakeholders were formally advised at 1137 that the outage would be from 1300 to 1600  All CD employees were advised about the outage in FCC computer rooms and offices and requested to turn off non-essential power devices  Stakeholders advised customers of affected services

During the Outage  Work on the water leak commenced at The weather was overcast and roughly 80F  Some investigation and action was required for a few instances of non- essential computing that remained up  At 1520, was sent out revising the outage estimate to 1700 – Temperatures remained moderate except in the east end of FCC2 where networking and core servers reside

Critical Decision Point  At 1600 (4:00PM), a Management decision was made to shutdown all systems as the temperature around core servers had risen significantly  Shortly thereafter, a firm estimate of completing the work was provided (~1700)  The ICW was restored at Cooling was re-started. FCC1 computing systems started coming back up at 1715 and FCC2 at 1725

Near Miss?  Most of the Networking and Core Services remained available throughout  Some key services did go off line as equipment was shutdown or failed  Services that were not high availability were down for approximately 4-6 hours  Much effort was spent to keep vital core services available and to manage the other services

Lessons Learned  Feynman Computing Center ICW supply is a high risk single point of failure. A failure in the main or lack of water to FCC can take down the cooling in the building for days  This risk is exacerbated by the aging infrastructure and water pipes which are 20+ years old

Lessons Learned  Requiring non-critical systems to be powered down minutes prior to the outage allows the CRACs to eliminate the residual heat  All non-critical systems need to be powered down prior to the outage  Having Networking and Core Servers centralized together created hot spots. Physical distribution would allow systems to remain up longer with reduced cooling

Hot Spots on FCC2  Core Servers are grouped together (top photo)  Test systems in lower photo (right side) produce warm air in direct proximity to the intake of Core Servers

Hot Spots on FCC2  Considerable density in the Networking area creates a hot spot  Notice the 60kva UPS in the lower photo produces extra heat

Risk Mitigation – Portable Chiller  FCC water infrastructure has four flanges in the 8” pipes that allow for the connection of one or two portable chillers  The logistics of renting a generator and chiller requires hours and significant funding (having a PO in place)  Guarantee of faster delivery requires additional funds (not regarded a viable strategy)  Implementation of such a plan requires tremendous coordination between many roles and probably is best for ICW outages that can be planned and anticipated

Risk Mitigation – Portable Chiller  There are two scenarios  1) A 100 ton chiller connected to house power or portable generator to provide chilled water for five CRACs (4 on FCC2 AC-13, 14, 15, 16 and 1 on FCC1 AC-17). These five CRACs (four on the west side of FCC2) are not ideally located to serve Core Servers and Networks

Risk Mitigation – Portable Chiller  2) Two 100 ton chillers could be connected to serve the entire FCC building, including the DX and CW CRACs. Requires portable generators for chillers X 2  Either scenario is best suited for scheduled multi-day extended outages

Risk Mitigation – Portable Chiller  Shown are the flanges to connect an external chiller for the Chilled Water CRACs  Electric Panel

Risk Mitigation – Portable Chiller  Covers of the flanges to 8” pipes for the connections to provide cooling to the entire building shown to the right

Risk Mitigation – Portable ACs  Consideration has been given to portable air conditioners  Emergency delivery would need to be understood as standard delivery is hours minimum  There are many safety considerations, large number of stakeholders involved and significant logistics coordination

Deployment of Portable ACs  One portable 30 ton AC would require and 3- 20” hoses. FCC2 cooling output of 8 CRACs is 144 tons  Removal of FCC2 windows in the computer room would be required to allow for the 20” air hoses to pass through

Impaired Air Flow  Distribution of the cold air from portable ACs could be a problem (lower CFM and localized distribution)  Circulating cool air under the floor is impaired by years of cabling and other obstructions

Final Comments On The Outage  Feynman Computing Center ICW supply is a high risk single point of failure that is capable of causing an outage for more than 24 hours, affecting the Laboratory’s core services (Networking, , Web, Databases, Scientific Computing, BS Financials, etc.)

Recommendation  Elimination of the root cause issue – the dependency on ICW for cooling the Feynman Computing Center  Replace the end of life fleet of DX and CW CRACs (Computer Room Air Conditioning) units with liquid refrigerant CRACs  Address issue of very limited FCC cooling capacity which restricts future growth by providing adequate cooling using modern cooling strategies, equipment and controls

FESS Reports  Project Definition Report - FCC CRAC Replacement Study. FESS/Engineering Project No September 28, 2007FCC CRAC Replacement Study  Project Definition Report - FCC1E Computer Room Upgrades. FESS/Engineering Project No June 9, 2006FCC1E Computer Room Upgrades