UKI-LT2-RHUL ATLAS STEP09 report Duncan Rand on behalf of the RHUL Grid team.

Slides:

Advertisements

Similar presentations

Review of WLCG Tier-2 Workshop Duncan Rand Royal Holloway, University of London Brunel University.

Advertisements

LondonGrid Site Status and Resilience Issues Duncan Rand GridPP22.

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

Scheduling under LCG at RAL UK HEP Sysman, Manchester 11th November 2004 Steve Traylen

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

ATLAS computing in Geneva Szymon Gadomski, NDGF meeting, September 2009 S. Gadomski, ”ATLAS computing in Geneva", NDGF, Sept 091 the Geneva ATLAS Tier-3.

London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

RAL PPD Site Update and other odds and ends Chris Brew.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

Oxford STEP09 Report Ewan MacMahon/ Pete Gronbech HEPSYSMAN RAL 2nd July 2009.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

А.Минаенко Совещание по физике и компьютингу, 16 сентября 2009 г., ИФВЭ, Протвино Текущее состояние и ближайшие перспективы компьютинга для АТЛАСа в России.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

1 Report on CHEP 2007 Raja Nandakumar. 2 Synopsis  Two classes of talks and posters ➟ Computer hardware ▓ Dominated by cooling / power consumption ▓

GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.

Project Management Sarah Pearce 3 September GridPP21.

Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.

Your university or experiment logo here Storage and Data Management - Background Jens Jensen, STFC.

São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.

Site Report BEIJING-LCG2 Wenjing Wu (IHEP) 2010/11/21.

JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.

Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Analysis in STEP09 at TOKYO Hiroyuki Matsunaga University of Tokyo WLCG STEP'09 Post-Mortem Workshop.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

Your university or experiment logo here Tier1 Deployment Steve Traylen.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

Status of India CMS Grid Computing Facility (T2-IN-TIFR) Rajesh Babu Muda TIFR, Mumbai On behalf of IndiaCMS T2 Team July 28, 20111Status of India CMS.

A Year of HTCondor at the RAL Tier-1 Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.

BaBar Cluster Had been unstable mainly because of failing disks Very few (

Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,

Your university or experiment logo here User Board Glenn Patrick GridPP20, 11 March 2008.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Response of the ATLAS Spanish Tier2 for.

Analysis vs Storage 4 the LHC Experiments (… ATLAS biased view ) Alessandro Di Girolamo CERN / INFN.

A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.

StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group

An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

Dynamic Extension of the INFN Tier-1 on external resources

Experience of Lustre at QMUL

SuperB – INFN-Bari Giacinto DONVITO.

FileStager test results

Bernd Panzer-Steindel, CERN/IT

Experience of Lustre at a Tier-2 site

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF

The LHCb Computing Data Challenge DC06

Presentation transcript:

UKI-LT2-RHUL ATLAS STEP09 report Duncan Rand on behalf of the RHUL Grid team

The Newton cluster 400 cores (2.4 kSI2K; 7.3 HEP-SPEC06/core) 2GB RAM per core DPM: 12 x 22TB pool nodes (total 264TB, 264 disks) CE/SE well resourced (similar spec as WN) WAN: 1 Gbit/sLAN: ~37Gbit/s worker nodes:2 Gbit/s storage: 4 Gbit/s Located: Imperial machine room (LMAN PoP)

ATLAS STEP09 1.Data distribution to the site 2.Monte Carlo production 3.Hammercloud user analysis –Ganga WMS RFIO and filestaging –Panda - filestaging only

Data distribution 23% of the UK data i.e. 26TB No obvious problems were encountered in this section of the exercise; data was absorbed by the site as and when required, apparently without backlogs Disk sufficient - no filling up of space tokens Bursty transfers Cluster moving to Egham over summer Support move to 2 Gbit/s WAN

Data transfer to and from RHUL

Monte Carlo production Steady rate of Monte Carlo production maintained throughout STEP09 A total of 5578 jobs were completed with an efficiency of 97.5%

MC production at RHUL

IC-HEP: MC data upload problems UKI-LT2-IC-HEP stages out to RHUL SE Problems copying data back to RAL over weekend of 5-6th June Gridftp transfers stalled, until time-out (3600s), blocking one or more of the 4 FTS file shares RALLCG2-UKILT2RHUL Result: IC-HEP transferring jobs > 2000 and consequent throttling of jobs Obviously need to understand original problem Recomendation: Improve FTS channel configuration to mitigate against similar such problems in the future

MC production at Imperial-HEP

Batch system shares Appropriate fair shares between and within VO's were set in Maui (after idea from RAL) ACCOUNTCFG option used to set percentages for three LHC VO’s and ‘others’ GROUPCFG option (i.e. unix group) used to set intra-LHC VO fairshares and ‘others’ share e.g. 50% /atlas/Role=production, 25% /atlas/Role=pilot, 25% /atlas Successful - allowed tight control of resource In practice capping of ATLAS analysis jobs prevented fairshare operating fully during STEP09

FSUSERWEIGHT 10 FSGROUPWEIGHT 100 FSACCOUNTWEIGHT 1000 # ATLAS 89% overall ACCOUNTCFG[atlas] FSTARGET=89 # Within ATLAS: 50% /atlas/Role=production, 25% /atlas/Role=pilot, 25% /atlas GROUPCFG[atlas] FSTARGET=25 ADEF=atlas MAXJOB=400 GROUPCFG[atlaspil] FSTARGET=25 ADEF=atlas MAXJOB=400 GROUPCFG[atlasprd] FSTARGET=50 ADEF=atlas MAXJOB=400 GROUPCFG[atlassgm] FSTARGET=1 ADEF=atlas PRIORITY= MAXJOB=4 # CMS 5% overall ACCOUNTCFG[cms] FSTARGET=5 # Within CMS: 50% /CMS/Role=production, 50% /CMS GROUPCFG[cms] FSTARGET=50 ADEF=cms GROUPCFG[cmsprd] FSTARGET=50 ADEF=cms GROUPCFG[cmssgm] FSTARGET=1 ADEF=cms PRIORITY= MAXJOB=6 # LHCb 5% overall ACCOUNTCFG[lhcb] FSTARGET=5 # 90% Role=pilot; 5% Role=user; 4% Role=production; 1% Role=lcgadmin GROUPCFG[lhcb] FSTARGET=5 ADEF=lhcb GROUPCFG[lhcbprd] FSTARGET=4 ADEF=lhcb GROUPCFG[lhcbpil] FSTARGET=90 ADEF=lhcb # non-LHC VO's get 1% of cluster, shared equally (GROUPCFG FSTARGET= 1 for all) ACCOUNTCFG[others] FSTARGET=1 GROUPCFG[biomd] FSTARGET=1 ADEF=others GROUPCFG[hone] FSTARGET=1 ADEF=others GROUPCFG[ilc] FSTARGET=1 ADEF=others

Ganga WMS based analysis Setpoint 100 slots within HammerCloud One pool node down - lots of ‘no such file’ errors RFIO IOBUFSIZE set to default 128KB throughout High LAN bandwidth (ganglia plots overestimate by factor 2) Hard to work out what was going on

LAN ganglia plots

Panda based analysis Set point was 100 slots Any more than this and rfcp commands stalled Recent test (test_482) with hard cap of 100 got 11Hz event rate (59% efficiency)

Conclusion RHUL took full part in STEP09 Data transfer into site successful Monte Carlo production was steady apart from some temporary problems transferring IC-HEP data back to RAL Analysis didn’t scale well - network performance needs careful examination Nevertheless 500 million events analysed