Presentation is loading. Please wait.

Presentation is loading. Please wait.

GCC Cooling Problems and Recommendations Computing Sector.

Similar presentations


Presentation on theme: "GCC Cooling Problems and Recommendations Computing Sector."— Presentation transcript:

1 GCC Cooling Problems and Recommendations Computing Sector

2 On hot days, the cooling at GCC is inadequate to operate the computing equipment at the capacity for which the rooms were designed. Jan 6, 2012 GCC Cooling2 The problem with the GCC condensers is that they won't work when their ambient input temperature is above 115 °F. The combination of hot weather plus the enclosed area between GCC building and the berm generates a heat pit on the condenser pad that leads to this input temperature limit being exceeded and the condensers becoming ineffective. Condensers GCC Berm

3 A FESS commissioned engineering study recommends constructing a raised platform for the condensers in order to reduce the formation of the heat pit and the short circuiting of hot exhaust back to the input. [Details on all options later in slides] – We are asking for GPP funds to proceed with fixing the GCC cooling problem. – Total amount request = $1,316K – Time for completion = 7 month including contingency Jan 6, 2012 GCC Cooling3

4 Design Capacity, each room GCC-B GCC-C GCC Loads and Design Capacity Jan 6, 2012 GCC Cooling4 Significant redeployment of computing equipment occurred in Summer 2011, and ~30% of all scientific computing equipment had to be turned off during the 2 very hot Summer periods. More equipment is being added to GCC rooms Extra load from GCC-C equipment increases input temp to GCC-B condensers – quite worrisome. Not near GCC room design capacity

5 Without any type of remediation, we can operate GCC-B/C at the 2010 levels. We did have several hot days in 2010 when some nodes tripped off due to temperature effects. – Measured max capacity = ~865 kW (48% of design) This is ~60% of expected 2012 summer capacity, meaning ~40% of the GCC equipment must be turned off on hot days With external cooling similar to last Summer, our expected cooling capacity is ~1140 kW (63% of design) – This still exceeds our expected 2012 load of ~1375 kW & we will need to turn off ~20% of the GCC computing equipment Jan 6, 2012 GCC Cooling5 $160K/summer for cooling

6 Conditions At Grid Computing Center Jan 6, 2012 GCC Cooling6 Outages and reduced capacity at GCC because input temp was too high for condensers to work Guaranteed to have cooling problems at GCC whenever there is hot weather – scientific analysis is affected Very close to failure, max load with external cooling

7 7 The GCC facility is a vital component for all experiments at FNAL and its proper operation at design capacities is needed to support scientific output. We need a cooling solution that solves the problem, not one that just gets us by. Long term future viability/capacity of GCC is important because we need to use all our rooms (FCC/GCC/LCC) to the maximum extent possible Fraction of Computing Resources per Room Jan 6, 2012 GCC Cooling

8 Importance of Computing in FY12 2012 will be a big year for the Tevatron program. – CDF/D0 experiments will work hard to complete a full suite of analyses with the full data set and whatever improved sensitivity features they can muster this year. – It’s critical to be successful this year – the manpower is moving on and most of what the experiments don't accomplish in 2012 will never get done. No one is reloading students or post doc's and the LHC is working well. Jan 6, 2012 GCC Cooling8

9 Importance of Computing in FY12 CDF and DØ are targeting two major conferences: – ICHEP (in Melbourne Australia, starts in early July) – Higgs Hunting Workshop (in France in late July). – These are the conferences targeted for “final final” results. Unlike all prior years, this year is not entirely conference driven. CDF and DØ will also be pushing to publish all analyses in 2012. – What is critical is the 6-8 weeks before a major conference, but this year -- ALL weeks are important. The experiments can't afford to lose any significant portion of this summer or else they risk sufficient brain drain that they won't complete specific analyses. Jan 6, 2012 GCC Cooling9

10 Importance of Computing in FY12 There are MOUs between FNAL and CERN for the CMS production computing availability – Requirement is 98% available during beam time. – Losing the Fermilab Tier-1 and the US CMS analysis facilities due to cooling problems during the 2012 LHC run could put CMS data production and US CMS analysis efforts at some risk. – CMS data production scenarios under discussion for 2012 high pileup run conditions rely on the availability of Fermilab computing resources. The CMS data coordinators would have to be informed of potential problems and would stop scheduling work to be performed at FNAL if cooling outages are anticipated. – CMS primary data samples we care about will not be stored at FNAL and the U.S. sites will have to transfer data from other sites to continue performing their analyses. Jan 6, 2012 GCC Cooling10

11 Importance of Computing in FY12 In 2012, the LCQD computing equipment in GCC represents 90% of the TFlop capacity for the $19M DOE Office of Science LQCD project. – Last Summer’s outage represented a loss of 0.6 TF- yrs, or about 2.5% of the 2011 total. – Estimated loss this upcoming summer is ~2%/week – Cooling loss would impact physics production for the major conference - Lattice ‘12 (June 24-30) – Outages would also likely mean some physics projects allocated time by USQCD collaboration would not finish Jan 6, 2012 GCC Cooling11

12 Impact of Outages An unscheduled outage of a day has a residual effect of 3-4 days. One day of downtime during a critical two week analysis is devastating, and can cause major delays to or de-scoping of conference presentations and publications. The effect of last year’s cooling shutdown lingered throughout the fall as normal maintenance that was scheduled for the summer was delayed until the fall. This includes acceptance testing for newly purchased equipment. This is a planning and resource scheduling nightmare. High temperatures and power cycling is not good for computing equipment and can lead to early mortality incidents The human resource load both in CD and at the experiments was large to monitor the situation, to reschedule, and to prioritize jobs based on predicted weather conditions Jan 6, 2012 GCC Cooling12

13 GCC Cooling Study FESS contracted CMT Engineering to determine the most effective way to modify the cooling at GCC to alleviate outages while the computing rooms are operated at their designed capacity – Final report was delivered Dec 22, 2011 (DocDB 4587) Report stated the temporary equipment that was rented for additional cooling during the hot days – does not address the root cause of the problem – is not a permanent solution – is only partially effective in keeping the cooling systems operating Jan 6, 2012 GCC Cooling13

14 Available Options Jan 6, 2012 GCC Cooling14 CMT report also included recommendation to construct cold aisle containment system (retractable roofs and modular doors) at a total cost of $171K Recommendation: $1,316K – Remove Berm + Raised Platform for condensers + Cold Aisle containment

15 Option 1 Option 1 = Remove the berm – More air would be able to circulate around the units, but unclear if enough air would flow to solve the problem – 4 months to complete, including contingency – $92K – Engineers estimate the risk of failure ~35% Jan 6, 2012 GCC Cooling15

16 Option 2 Option 2 = Remove the berm & stagger condensers – Staggering the condensers increases the airflow and decreases the possibility of recirculating hot air – Requires extended concrete pad, new refrigerant piping – Move 5 units at a time, no downtime on “cool” days – 5 months to complete, including contingency – $403K – Engineers estimate the risk of failure ~25% Jan 6, 2012 GCC Cooling16

17 Option 3 Option 3 = Remove the berm and replace 95 °F condensers with staggered 105 °F condensers. – Using units rated at higher ambient temperatures could provide some additional reliability on hot days – Higher rated models are larger and require enlarged equipment pad – Install in staggered mode as in Option 2 – Units could be replaced one at a time – no downtime – 7 months to complete, including contingency – $1,107K – Engineers estimate the risk of failure ~15% Jan 6, 2012 GCC Cooling17

18 Option 4 Option 4 = Install raised platform and relocate condensers to a height above building and berm – New open-grate support structure installed directly over existing equipment pad – similar to existing GCC- A platform – Greatly improves air circulation – Removing berm is not required, but would greatly facilitate construction (fork lifts instead of cranes) – Relocate existing condensers one at time to platform – New refrigerant piping required – 7 months to complete, including contingency – $1,053K – Engineers estimate the risk of failure ~0% Jan 6, 2012 GCC Cooling18

19 Option 5 Option 5 = Install new chilled water cooling system – 12 months to complete, including contingency – $7,802K – Engineers estimate the risk of failure ~0% Jan 6, 2012 GCC Cooling19

20 Phased Deployment Could the GCC cooling upgrade be staged over several years? – According to FESS engineering, yes this is possible but there would be of course extra costs involved due the extra engineering work to split the project into multiple pieces and the multiple yearly contracts involved to execute the staged deployment – Question of cost versus the risk lab is willing to accept Jan 6, 2012 GCC Cooling20

21 OptionScheduleRisk of FailureTotal Cost 1 Remove Berm4 months 35-40% $92K 2 Remove Berm and stagger condensers5 months 25-30% $403K 3 Replace condensers with 105° models7 months 15-25% $1,107K 4 Move condensers to raised platform7 months 0-10% $1,053K 5 Install new chilled water cooling system12 months 0-5% $7,802K Summary Jan 6, 2012 GCC Cooling21 The schedule shows that we need to act now to minimize cooling outages at GCC this summer! Since schedule contingency is already added, we believe we still have an excellent chance of completing work before the hot days. Cost of temporary cooling at GCC is ~$160K/summer Recommendation to resolve GCC Cooling problem: $1,316K Remove Berm (now) + Raised Platform for condensers + Cold Aisle containment

22 Backup slides Jan 6, 2012 GCC Cooling22

23 Details on Cooling Outages Jan 6, 2012 GCC Cooling23 GCC-B Load GCC-C Load

24 Temporary Measures 2010: Soaker water hoses were deployed under the condensers to cool the concrete pad and leverage evaporative cooling. October 2010: 80 condenser duct chimneys were added 2011: Measures included soaker hoses, supplemental cooling and other small operational improvements. – However, these actions did not prevent load shed incidents and will not provide sufficient heat rejection for the ultimate designed power density of 10.8 kW per rack giving a total computing capacity of 900 kW per room Jan 6, 2012 GCC Cooling24 $160K/summer for cooling Duct chimneys added to top of condensers Original state


Download ppt "GCC Cooling Problems and Recommendations Computing Sector."

Similar presentations


Ads by Google