Oxford Site Report HEPSYSMAN

Slides:



Advertisements
Similar presentations
Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.
Advertisements

SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin.
Quarterly report ScotGrid Quarter Fraser Speirs.
Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
17th October 2013Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and GridPP Project Manager.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Wahid Bhimji Andy Washbrook And others including ECDF systems team Not a comprehensive update but what ever occurred to me yesterday.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Oxford Site Update HEPiX Sean Brisbane Tier 3 Linux System Administrator March 2015.
Exploring The Green Blade Ken Lutz University of California, Berkeley LoCal Retreat, June 8, 2009.
Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Tier 3g Infrastructure Doug Benjamin Duke University.
Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
Computer Operations Group Data Centre Facilities Hitendra Patel June 2015.
RAL PPD Site Update and other odds and ends Chris Brew.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly )
Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009.
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
Winnie Lacesso Bristol Storage June DPM LCG Storage lcgse01 = DPM built in 2005 by Yves Coppens & Pete Gronbech SuperMicro X5DPAGG (Streamline.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPIX 2009 Umea, Sweden 26 th May 2009.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
26SEP03 2 nd SAR Workshop Oklahoma University Dick Greenwood Louisiana Tech University LaTech IAC Site Report.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
Tier1 Hardware Review Martin Bly HEPSysMan - RAL, June 2013.
Southgrid Technical Meeting Pete Gronbech: 24 th October 2006 Cambridge.
14th October 2010Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and SouthGrid Technical Co-ordinator.
HEPSYSMAN May 2007 Oxford & SouthGrid Computing Status (Ian McArthur), Pete Gronbech May 2007 Physics IT Services PP Computing.
Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.
11th October 2012Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and GridPP Project Manager.
Virtualization Supplemental Material beyond the textbook.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Cambridge Site Report John Hill 20 June 20131SouthGrid Face to Face.
Royal Holloway site report Simon George RAL Jun 2010.
Academia Sinica Grid Computing Centre (ASGC), Taiwan
London Tier-2 Quarter Owen Maroney
Experience of Lustre at QMUL
Pete Gronbech GridPP Project Manager April 2016
Pete Gronbech GridPP Project Manager April 2017
HEPSYSMAN Summer June 2017 Ian Loader Chris Brew
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Andrea Chierici On behalf of INFN-T1 staff
Stuart Wild. Particle Physics Group Meeting, January 2010.
Experience of Lustre at a Tier-2 site
PK-CIIT Grid Operations in Pakistan
RHUL Site Report Govind Songara, Antonio Perez,
QMUL Site Report by Dave Kant HEPSYSMAN Meeting /09/2019
Pete Gronbech, Kashif Mohammad and Vipul Davda
Kashif Mohammad VIPUL DAVDA
HEPSYSMAN Summer June 2017 Ian Loader Chris Brew
Presentation transcript:

Oxford Site Report HEPSYSMAN Pete Gronbech GridPP Project Manager June 22nd 2016

Oxford Local Cluster Current capacity 6000HS06 680TB Almost exclusively SL6 now. New procurement summer 2015. 4 Supermicro twin squared CPU boxes provide 256CPU physical cores Chose Intel E5-2630v3’s should provide approx. ~4400 HS06 upgrade. Storage from Lenovo. 1 U server with two disk shelves, containing 12 * 4TB SAS disks. Provided an increased capacity of ~350TB ~88TB for NFS and the rest for Lustre HEPSYSMAN RAL - 22nd June 2016

Oxford’s Grid Cluster HEPSYSMAN RAL - 22nd June 2016

HEPSYSMAN RAL - 22nd June 2016

Oxford Grid Cluster (Some viab nodes ~88 cores) GridPP4 status Autumn 2015 Current capacity 16,768HS06 980TB DPM Storage older Supermicro 26, 24 and 36 bay servers decommissioned, so capacity is provided by Dell 510 and 710s. 12 bay with 2 or 4 TB disks. Majority of CPU nodes are ‘twin-squared’ Viglen Supermicro worker nodes have been installed. Intel E5- 8 core (16 Hyper-threaded cores each) provides 1300 job slots with 2GB RAM. The majority of the Grid Cluster runs HT Condor behind an ARC CE. (Some viab nodes ~88 cores) HEPSYSMAN RAL - 22nd June 2016

Decommissioning old storage 17 Supermicro servers removed from the DPM SE. (Reduction of 320TB new total 980TB) HEPSYSMAN RAL - 22nd June 2016

Storage March 2016 Simplified  Old servers removed, switched off or repurposed Software up to date OS up to date Simplified  HEPSYSMAN RAL - 22nd June 2016

Plan for GridPP5 CPU GridPP4+ h/w money spent on CPU. This rack is identical to the kit used by Oxford Advanced Research Computing. Initial plan is to plug into our Condor Batch as WNs When staff levels and time allows, we will investigate integrating the rack into the ARC cluster. HEPSYSMAN RAL - 22nd June 2016

CPU Upgrade March 2016 Lenovo NeXtScale 25 Nodes each with Dual E5-2640 v3 & 64GB RAM 800 new cores (new total ~2200) HEPSYSMAN RAL - 22nd June 2016

Begbroke Computer Room houses the Oxford Tier 2 Cluster HEPSYSMAN RAL - 22nd June 2016

Spot the difference HEPSYSMAN RAL - 22nd June 2016

Air Con Failure Saturday 16:19 Sept 26th 2015 Sunday 16:13 Monday 17:04 HEPSYSMAN RAL - 22nd June 2016

People came into site or remotely switched off clusters very quickly. Building Services reset the A/C in 1-2 hours on both weekend days. The bulk of the load comes from University and Physics HPC clusters, but it turns out some critical University Financial services were also being run on this site. The incident was taken seriously and backup system ordered on Tuesday morning and installed from 10pm to 2am that night. Provides ~100KW back up in case of further trips. Normal load is ~230KW so main clusters were restricted. HEPSYSMAN RAL - 22nd June 2016

Additional temporary A/C HEPSYSMAN RAL - 22nd June 2016

Even risk mitigation has it’s own risks. Pressurization unit had two faults, both repaired. New improved unit to be installed on Monday. A 200KW computing load will heat up very very quickly when the A/C fails. It always seems to do this out of hours. You need to react to this very quickly. Really needs to be automated, even fast response from staff not fast enough. Even risk mitigation has it’s own risks. HEPSYSMAN RAL - 22nd June 2016

Water and Electrics!! HEPSYSMAN RAL - 22nd June 2016

Here we go again Unfortunately a thunder storm on Sunday 12th June 2016 caused a large brown out at the Begbroke site. This caused some pumps to switch off. Pattern was very similar to last time, Physics and ARC staff alerted first by our monitoring. Too long before building services got to site. Still work to be done to improve this sort of call out. HEPSYSMAN RAL - 22nd June 2016

Conclusions A/C problems Local Cluster Grid Cluster 2015 Local Cluster upgrade went well now need to plan and purchase 2016 upgrade. Recruitment on going. Grid Cluster A time of streamlining and rationalisation. Still a lot of work to do, to investigate integrating with university resources. Will this be possible, will it save time, or allow bursting to greater resources? Possibly benefits of cost savings on h/w maintenance and electricity costs. A/C problems Need faster response from Building Services Need auto shutdown of systems HEPSYSMAN RAL - 22nd June 2016