Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015.

Slides:

Advertisements

Similar presentations

Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.

Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.

17th October 2013Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and GridPP Project Manager.

Duke Atlas Tier 3 Site Doug Benjamin (Duke University)

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

Oxford Site Update HEPiX Sean Brisbane Tier 3 Linux System Administrator March 2015.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Cambridge Site Report Cambridge Site Report HEP SYSMAN, RAL th June 2010 Santanu Das Cavendish Laboratory, Cambridge Santanu.

SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.

UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.

Tier 3g Infrastructure Doug Benjamin Duke University.

Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

Quarterly report SouthernTier-2 Quarter P.D. Gronbech.

FZU Computing Centre Jan Švec Institute of Physics of the AS CR, v.v.i

CC - IN2P3 Site Report Hepix Fall meeting 2009 – Berkeley

RAL PPD Site Update and other odds and ends Chris Brew.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPIX 2009 Umea, Sweden 26 th May 2009.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.

RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.

Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.

1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

13th October 2011Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and GridPP Project Manager.

KISTI-GSDC SITE REPORT Sang-Un Ahn, Jin Kim On the behalf of KISTI GSDC 24 March 2015 HEPiX Spring 2015 Workshop Oxford University, Oxford, UK.

London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.

Southgrid Technical Meeting Pete Gronbech: 24 th October 2006 Cambridge.

Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.

14th October 2010Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and SouthGrid Technical Co-ordinator.

HEPSYSMAN May 2007 Oxford & SouthGrid Computing Status (Ian McArthur), Pete Gronbech May 2007 Physics IT Services PP Computing.

HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.

SiGNET – Slovenian Production Grid Marko Mikuž Univ. Ljubljana & J. Stefan Institute on behalf of SiGNET team ICFA DDW’06 Kraków, 10 th October 2006.

IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.

11th October 2012Graduate Lectures1 Oxford University Particle Physics Unix Overview Pete Gronbech Senior Systems Manager and GridPP Project Manager.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

BaBar Cluster Had been unstable mainly because of failing disks Very few (

RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.

A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

NERSC/LBNL at LBNL in Berkeley October 2009 Site Report Roberto Gomezel INFN 1.

UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.

WLCG IPv6 deployment strategy

Pete Gronbech GridPP Project Manager April 2016

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop

Jeremy Maris Research Computing IT Services University of Sussex

Moving from CREAM CE to ARC CE

Oxford Site Report HEPSYSMAN

PES Lessons learned from large scale LSF scalability tests

Small site approaches - Sussex

ETHZ, Zürich September 1st , 2016

RHUL Site Report Govind Songara, Antonio Perez,

Pete Gronbech, Kashif Mohammad and Vipul Davda

Kashif Mohammad VIPUL DAVDA

Presentation transcript:

Oxford & SouthGrid Update HEPiX Pete Gronbech GridPP Project Manager October 2015

Oxford Particle Physics - Overview Two computational clusters for the PP physicists. –Grid Cluster part of the SouthGrid Tier-2 –Local Cluster (AKA Tier-3) A common Cobbler and Puppet system is used to install and maintain all the SL6 systems. BNL - October

Oxford Grid Cluster No major changes since last time. All of the Grid Cluster now running HT Condor behind ARC CE. The last CREAM CE’s using torque and maui were decommissioned by 1st August. BNL - October Current capacity 16,768HS TB

Oxford Local Cluster Almost exclusively SL6 now. New procurement this summer. 4 Supermicro twin squared CPU boxes provide 256CPU physical cores Chose Intel E5-2630v3’s should provide approx. ~4400 HS06 upgrade. Storage will be from Lenovo. –1 U server with two disk shelves, containing 12 * 4TB SAS disks. –Should provide an increased capacity of ~350TB –~88TB for NFS and the rest for Lustre BNL - October Current capacity 6000HS06 680TB

Intel E v2 SL6 HEPSPEC06 Average result BNL - October

Intel E v3 SL6 HEPSPEC06 BNL - October Peak average result 347 with 31 threads. (~2% drop in compute power wrt 2650v2)

Power Usage – Twin squared chassis 2650v2 Max 1165W Idle 310W BNL - October

Power Usage – Twin squared chassis 2630v3 BNL - October Max 945W Idle 250W (~19% drop in electrical power wrt 2650v2)

SouthGrid Sites BNL - October

RAL PPD General Migrating all services to puppet managed SL6, actively killing off cfengine managed SL5 Services In the process of migrating our Virtualisation cluster from Windows 2008R2 Failover Cluster to Windows 2012R2 Tier 2 Not much change, continue to be ArcCE/Condor/dCache - these services now very reliable and low maintenance Currently deploying ELK New nagios and ganglia monitoring integrated with puppet Department Prototype analysis cluster - 5 node BeeGFS cluster ownCloud storage - See Chris’s Talk BNL - October Current capacity HS TB

JET Very small cluster remains at JET. Closure of the central fusion VO glite services by BIFI and CIEMAT will mean reduction in work done at this site. BNL - October Current capacity 1772 HS06 1.5TB

Birmingham Tier 2 Grid Site Grid site running ATLAS, ALICE, LHCb and 10% other VOs The Birmingham Tier 2 site hardware consists of: 1200 cores across 56 machines at two campus locations ● ~470 TB of disk, mostly reserved for ATLAS and ALICE ● Using some Tier 3 hardware to run Vac during slow periods ● Status of the site: Some very old workers are starting to fail, still have plenty left ● Disks have been lasting surprisingly well, replacing less than expected ● Future plans and upcoming changes: Replace CREAM and Torque/Maui with ARC and HTCondor ● Get some IPV6 address from central IT and start some tests BNL - October Current capacity HS06 620TB

Birmingham Tier 3 Systems The Birmingham Tier 3 site hardware consists of: Batch Cluster Farm, 8 nodes, 32 (logical) cores, 48GB per node ● 160TB 'New' Storage + 160TB 'Old' Storage ● ~60 Desktop Machines ● 5 1U VM hosts ● The following services are used, running on VMs where appropriate: Cluster management through Puppet ● New storage accessed through Lustre, old storage over NFS ● Condor Batch system set up with Dynamic queues ● Authentication through LDAP ● DHCP and mail servers ● Desktops run Fedora 21 but can chroot into SL5 and SL6 images ● BNL - October

Cambridge (1) GRID Currently have 288 CPU cores and 350TB storage. Major recent news is move of all GRID kit to new University Data Centre Move went reasonably smoothly Had to make a small reorganisation of servers in the racks to avoid exceeding power and cooling limits – “Standard” Hall is limited to 8kW per rack – “HPC” Hall has higher limits, but there is limited space. Also we would be charged for rack space in the HPC Hall (we are currently not charged for rack space in the Standard Hall). Kit is still on the HEP network as before (networking provision in the Data Centre is “work in progress”). Hence we’ve had to buy a circuit from the University between the department and the Data Centre. At present we are not being charged for electricity and cooling in the Data Centre 14BNL - October 2015 Current capacity 3729 HS06 350TB

Cambridge (2) HEP Cluster Currently have ~40 desktops, 112 CPU cores in a batch farm and ~200TB storage Desktops and farm are in a common HTCondor setup Running SLC6 on both desktop and farm machines Storage is on a (small) number of DAS servers. Considering a trial of Lustre (though it may be a “sledgehammer to crack a walnut”) Backup is currently to a LTO5 autoloader. This needs replacement in the near- ish future. The University is making noises about providing Storage As A Service which might be a way forward, but the initial suggested pricing looked uncompetitive. Also as usual, things move at a glacial pace and the replacement may be necessary before a production service is available 15BNL - October 2015

Bristol Bristol switching from StoRM SE to DM-Lite, which has been a big change. The Storage Group keen for a dm-lite with HDFS test site Looking for an SE that will work with HDFS and the European authentication infrastructure (so not BestMan) Suggested to try DMLite First iterations did not work, but the developers followed up Now run the DMLite SE with 2x GridFTP servers (no SRM) The GridFTP servers we use tmpfs as a fast buffer Missing bits are (but in the queue): BDII fixes (proper reporting), High Availability (HA) config (currently can only specify one name node) Performance improvements might come for the GridFTP servers (in case of single streams skip buffering) BNL - October Current capacity HS06 218TB

University of Sussex - Site Report -Mixed-use cluster with both local and WLCG Tier2 Grid usage -~3000 cores across 112 nodes, mixture of Intel and AMD processors -Housed inside University’s central Datacentre across 5 racks -Infiniband network, using Qlogin/Intel Truescale 40Gb QDR -~380TB Lustre filesystem used as scratch area, with NFS for other storage -Grid uses a few dedicated nodes predominantly for Atlas and SNO+ -Batch System -Univa Grid Engine (UGE) is used as Batch system. Currently on version Started using UGE’s support for cgroup resource limiting to give proper main-memory limits to user jobs. Very happy with how it’s working, along with a default resource allocation policy implemented via JSV script, making usage fairer. Current capacity 1977 HS06 70TB

University of Sussex - Site Report -Lustre -Core filesystem for users to perform all I/O on. Nodes have very limited local scratch, all expected to use Lustre -Have had a very good experience with Lustre for past few years, self-maintained using community releases -Move to Lustre Upgrade held back by LU-1482, which prevents the Grid StoRM middleware from working on it. A recent unofficial patch to Lustre that fixes this issue enabled us to perform the upgrade -Bought new MDS servers enabling us to set up completely separate system to preexisting one. We can then mount both filesystems and copy data between the two. -Currently have 380TB Lustre system and 280TB Lustre filesystem. After decommissioning the old system will have ~600TB unified Lustre filesystem. -Experimenting with robinhood policy engine for Lustre filesystem usage analysis. Had good experience so far.

Begbroke Computer Room houses the Oxford Tier 2 Cluster 19 BNL - October 2015

Spot the difference BNL - October

Air Con Failure BNL - October Saturday 16:19 Sunday 16:13 Monday 17:04

People came into site or remotely switched off clusters very quickly. Building Services reset the A/C in 1-2 hours on both weekend days. The bulk of the load comes from University and Physics HPC clusters, but it turns out some critical University Financial services were also being run on this site. The incident was taken seriously and backup system ordered on Tuesday morning and installed from 10pm to 2am that night. Provides ~100KW back up in case of further trips. Normal load is ~230KW so main clusters currently restricted. BNL - October

Additional temporary A/C BNL - October

Pressurization unit had two faults, both repaired. New improved unit to be installed on Monday. A 200KW computing load will heat up very very quickly when the A/C fails. It always seems to do this out of hours. You need to react to this very quickly. Really needs to be automated, even fast response from staff not fast enough. Even risk mitigation has it’s own risks. BNL - October

Water and Electrics!! BNL - October