LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,

Slides:

Advertisements

Similar presentations

London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,

Advertisements

ESLEA and HEPs Work on UKLight Network. ESLEA Exploitation of Switched Lightpaths in E- sciences Applications Exploitation of Switched Lightpaths in E-

Review of WLCG Tier-2 Workshop Duncan Rand Royal Holloway, University of London Brunel University.

NorthGrid status Alessandra Forti Gridpp15 RAL, 11 th January 2006.

Deployment metrics and planning (aka Potentially the most boring talk this week) GridPP16 Jeremy Coles 27 th June 2006.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.

Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.

Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.

Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.

LondonGrid Site Status and Resilience Issues Duncan Rand GridPP22.

Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done.

Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…

LondonGrid Status Duncan Rand. Slide 2 GridPP 21 Swansea LondonGrid Status LondonGrid Five Universities with seven GOC sites –Brunel University –Imperial.

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

CBPF J. Magnin LAFEX-CBPF. Outline What is the GRID ? Why GRID at CBPF ? What are our needs ? Status of GRID at CBPF.

National Grid's Contribution to LHCb IFIN-HH Serban Constantinescu, Ciubancan Mihai, Teodor Ivanoaica.

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

LCG-France Project Status Fabio Hernandez Frédérique Chollet Fairouz Malek Réunion Sites LCG-France Annecy, May

S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.

Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.

IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.

London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.

SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.

London Tier 2 Status Report GridPP 12, Brunel, 1 st February 2005 Owen Maroney.

UCL Site Report Ben Waugh HepSysMan, 22 May 2007.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\

CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.

LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,

12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.

Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

Tier1 Status Report Martin Bly RAL 27,28 April 2005.

Organisation Management and Policy Group (MPG): Responsible for setting and policy decisions and resolving any issues concerning fractional usage, acceptable.

PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.

RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.

São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.

GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June

BaBar Cluster Had been unstable mainly because of failing disks Very few (

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.

J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.

STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,

London Tier-2 Quarter Owen Maroney

Experience of Lustre at QMUL

The Beijing Tier 2: status and plans

Update on Plan for KISTI-GSDC

Experience of Lustre at a Tier-2 site

Oxford Site Report HEPSYSMAN

The LHCb Computing Data Challenge DC06

Presentation transcript:

LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza, D. McBride, H. Nebrinsky, D. Rand, G. Rybkine, G. Sciacca, K. Septhon, B. Waugh

LT 2 Jan 27 th June 2006London Tier 2 Status Outline LT2 Usage LT2 Sites updates LT2 SC4 activity Conclusion

LT 2 Jan 27 th June 2006London Tier 2 Status Number of Running Jobs January February

LT 2 Jan 27 th June 2006London Tier 2 Status Number of running jobs March April

LT 2 Jan 27 th June 2006London Tier 2 Status Number of running jobs May Increase of the infrastructure usage by LHCB last month Has stressed the system. Caused very slow mds responses.

LT 2 Jan 27 th June 2006London Tier 2 Status Usage and efficiency per VO [ , ] WallTime consumption ATLAS, LHCB, BIOMED, CMS are the top consumers Efficiency: Fraction of total time which result in a successful state. Efficiency by order: BIOMED, ATLAS, LHCB, CMS. Efficiency pattern is not yet understood. Why is BIOMED more efficient (ie causes less middleware failures) BIOMED ATLASLHCBCMS

LT 2 Jan 27 th June 2006London Tier 2 Status ucl-central Brunel QMUL RHUL IC-HEP UCL-HEP ucl-central IC-LESC Usage and efficiency per CE [ , ] WallTime QMUL provides 55% of the total WallTime Delicate to provide the Service Level Agreement of 95% availability of the LT2 with 1FTE at QMUL Efficiency UCL-CENTRAL the most efficient in job successes rate. Can be explained because they mainly attract biomed jobs Brunel ucl-central Brunel brunel QMUL RHUL IC-HEP UCL-HEP ucl-central IC-LESC

LT 2 Jan 27 th June 2006London Tier 2 Status CE / VO view In London we support 18 VO. (sixt has not been used) Right Plots shows the relative VO usage for each CE. Size of the box is proportional to the total Wall Clock Time ucl-central Brunel QMUL RHUL IC-HEP UCL-HEP ucl-central IC-LESC Brunel

LT 2 Jan 27 th June 2006London Tier 2 Status GridLoad Tool to monitor the sites: -Updates every 5minutes -Uses the RTM data and stores it in rrd files Shows theNumber of Jobs in any state VO view. Stacks the Jobs by VO CE view. Stacks the Jobs by CE Still a prototype. Will add View by GOC and ROC. Error checking. Add usage (running cpu/ total cpu). Improve look and feel Could interface with NAGIOS for raising alarms (high abort rate)

LT 2 Jan 27 th June 2006London Tier 2 Status GridLoad (cont) The GridLoad plots can be useful to spot problems. Example: Observed High Abort rate at one site for LHCB jobs It helped to be proactive for the VO. Could spot that there is a problem before we receive a ticket # Aborted Jobs #Running Jobs

LT 2 Jan 27 th June 2006London Tier 2 Status LT2 Usage: Conclusions We have now an additional tool to monitor the LT2 cpu activity on real time. The overall usage is increasing. We need to understand the efficiency patterns. What causes those differences between the VO. We need similar real time monitoring tools for the storage. Jan - May

LT 2 Jan 27 th June 2006London Tier 2 Status Outline LT2 Usage LT2 Sites updates LT2 SC4 activity Conclusion

LT 2 Jan 27 th June 2006London Tier 2 Status Brunel site update –New Cluster provided by Streamline Computing : Supermicro dual processor dual core AMD Opteron nodes 40x1.8 GHz, 4GB memory, 80 GB disk Head node 2 GHz 8GB memry, 320 GB disk Total 164 Cores Is in the process of being configured –Gb connection ? 1Gb wan at Brunel in 65 days from now. They are currently buying appropriates switches and related hardware. Will have a throttling router that limits the LCG traffic if the university demand is high. If the university demand is low then the LCG will have higher allocation The Brunel site is expected to have a 10 times faster connection (200Mb) by September. –SRM: best rate was 59Mb/s. Will remove any nfs mounted filesystem. No real showstopper there.

LT 2 Jan 27 th June 2006London Tier 2 Status IC Sites update HEP: –Old IBM (60CPU) cluster running smoothly, almost full of jobs for the last two month. –Will build a new cluster with off the shelf boxes 40 Dual Core AMD 40 TB of disk (non raid) Will use SGE for the job manager.

LT 2 Jan 27 th June 2006London Tier 2 Status IC Sites update Investigated FTS performance issues with dCache dCache transfers FTS using urlcopy causes high iowait FTS/urlcopy (130Mb/s) FTS/srmcp (179Mb/s) iowait Time Block Time

LT 2 Jan 27 th June 2006London Tier 2 Status LESC: –33% of GHz opterons. –Running RHEL3 64 bit. –SGE job manager. –DPM storage with small disk partition –Currently porting DPM to SOLARIS to avoid nfs mounting file systems used for the SRM. See the progressing work at –Difficulties: Improving usage. Several VO not comfortable with 64bit arch even if 32 bit libraries are there ICT: –Deploying a new 200 Xeon cluster running PBS for College Use. –Will have a share of 30% in that cluster for LCG –30TB of raid storage that will be shared. –Difficulties: They want to use GT IC Sites update

LT 2 Jan 27 th June 2006London Tier 2 Status QMUL site update Lots of activity with the commissioning of their new cluster provided by Viglen –280 Dual Core Opterons (270) 2GHz –All nodes have 2x250 Gb disks =140TB ! What filesystem to use with that environment. Will consider lustre. –All nodes are 1Gb connected. With 10Gb inter switch links. –Now online with ~1600 job slots –Problems: Site stability under high job load: nfs mounted software area not coping Raid boxes giving hardware errors. Seemed to be due to loose sata connectors. The disks where tested ok with smart. Not yet clear what it is due to. Reliability of DPM on Poolfs

LT 2 Jan 27 th June 2006London Tier 2 Status UCL sites update CCC –Have successfully moved to SGE job manager to service 364 Slots (91 dual cpu + hyper threading) –Improved their SRM performance by using direct fiberchannel link to the raid array from the head node. Write bandwidth moved from 90Mb/s to 238Mb/s –Will have 40 additional nodes (160 slots) soon. –Moving their cluster from one building to another one will start on July 3 for 1 week. HEP –New Gb switches have been bought. Need to cable them to the head node. –Will have 1,2 boxes with mirrored 120Gb disks with dpm pool installed on them to support non Atlas vo –Atlas will still be using nfs mounted –Problem: Performance for Atlas storage

LT 2 Jan 27 th June 2006London Tier 2 Status RHUL site update Cluster running smoothly –142 Job slots almost full for two month. All VO targetting that site. –No more nfs mounted disks with write access from DPM. –Broad VO usage Update on the 1Gb connection: –Purchase order was signed yesterday. –Discussions are now starting as to when it will be installed. Problems: –Need to be able to drain Pool to remove the read-only nfs mounted filesystem

LT 2 Jan 27 th June 2006London Tier 2 Status Transfers throughput status Rate (Mb/s) SiteInboundOutboundUpdate Brunel5759Gb connection signed (200Mb by september) IC-HEP80190FTS performance problem not yet understood IC-LeSC15695DPM being build for solaris QMUL Poolfs need to be recompiled with round robin feature RHUL5958Gb connection signed UCL-HEP7163Gb switches there. UCL- CENTRAL90309 Move to direct fiberchannel connection. Rate is now 238Mb/s

LT 2 Jan 27 th June 2006London Tier 2 Status Outline LT2 Usage LT2 Sites updates LT2 SC4 activity Conclusion

LT 2 Jan 27 th June 2006London Tier 2 Status SC4 Activity CMS: Target is CSA06 (Computing Software and Analysis Challenge) –CSA06 Objective=A 50 million event exercise to test the workflow and dataflow associated with the data handling and data access model of CMS Will test the new cms reconstruction framework for large production Need 20MB/s bandwidth to T2 storage Will start on 15 Sept More information can be found at: –IC-HEP and IC-LESC preparing for CSA06 –Strategy is to help other sites when IC is ok. New PheDex installed that uses FTS –Need to solve the FTS performance issues ProdAgent configuration prepared for IC-LESC and IC-HEP –Brunel Involved in PheDex. ATLAS: –No commitment yet.

LT 2 Jan 27 th June 2006London Tier 2 Status Conclusions Real Time monitoring of the LT2 job states in place –The usage is increasing Site evolution –SGE deployed at UCL-CENTRAL –QMUL more than doubled the number of job slots –Brunel: Gb connection on the right track, Commissioning a new cluster (160 cores) –IC: spotted FTS performance issues, Porting of DPM under solaris ongoing, Will commission a new cluster at HEP –RHUL: very stable site, Gb connection signed. General storage evolution: In the process of removing nfs mounts. SC4: Involvement in the CMS SC4 activity is going on. Need to have a volunteer in the atlas sc4.

LT 2 Jan 27 th June 2006London Tier 2 Status Thanks to all of the Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, M. Green, W. Hay, P. Hobson, P. Kyberd, A. Martin, G. Mazza, D. McBride, H. Nebrinsky, D. Rand, G. Rybkine, G. Sciacca, K. Septhon, B. Waugh, LT 2