LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.

Slides:



Advertisements
Similar presentations
Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.
Advertisements

Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Service Challenge Meeting “Service Challenge 2 update” James Casey, IT-GD, CERN IN2P3, Lyon, 15 March 2005.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
BNL Facility Status and Service Challenge 3 Zhenping Liu, Razvan Popescu, Xin Zhao and Dantong Yu USATLAS Computing Facility Brookhaven National Lab.
Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,
Workshop Summary (my impressions at least) Dirk Duellmann, CERN IT LCG Database Deployment & Persistency Workshop.
NL Service Challenge Plans Kors Bos, Sander Klous, Davide Salomoni (NIKHEF) Pieter de Boer, Mark van de Sanden, Huub Stoffers, Ron Trompert, Jules Wolfrat.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
BNL Facility Status and Service Challenge 3 HEPiX Karlsruhe, Germany May 9~13, 2005 Zhenping Liu, Razvan Popescu, and Dantong Yu USATLAS/RHIC Computing.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
US-CMS T2 Centers US-CMS Tier 2 Report Patricia McBride Fermilab GDB Meeting August 31, 2007 Triumf - Vancouver.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
ASCC Site Report Eric Yen & Simon C. Lin Academia Sinica 20 July 2005.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
HEPiX Rome – April 2006 The High Energy Data Pump A Survey of State-of-the-Art Hardware & Software Solutions Martin Gasthuber / DESY Graeme Stewart / Glasgow.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
SRM 2.2: experiment requirements, status and deployment plans 6 th March 2007 Flavia Donno, INFN and IT/GD, CERN.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
Servizi core INFN Grid presso il CNAF: setup attuale
T0-T1 Networking Meeting 16th June Meeting
The Beijing Tier 2: status and plans
“A Data Movement Service for the LHC”
LCG Service Challenge: Planning and Milestones
NL Service Challenge Plans
James Casey, IT-GD, CERN CERN, 5th September 2005
Service Challenge 3 CERN
BNL FTS services Hironori Ito.
LHC Data Analysis using a worldwide computing grid
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005

CERN IT-GD Overview Reminder of Targets for the Service Challenge Plan and timescales CERN Tier-0 Configuration Tier-1 Configuration Throughput phase of Service Challenge 2 Transfer Software Monitoring Outages and problems encountered SC2 single site tests

CERN IT-GD Service Challenge Summary “Service Challenge 2” Throughput test from Tier-0 to Tier-1 sites Started 14 th March Set up Infrastructure to 7 Sites NL, IN2P3, FNAL, BNL, FZK, INFN, RAL 100MB/s to each site 500MB/s combined to all sites at same time 500MB/s to a few sites individually Goal : by end March, sustained 500 MB/s at CERN

CERN IT-GD SC2 met its throughput targets >600MB/s daily average for 10 days was achieved - Midday 23 rd March to Midday 2 nd April Not without outages, but system showed it could recover rate again from outages Load reasonable evenly divided over sites (give network bandwidth constraints of Tier-1 sites)

CERN IT-GD Division of Data between sites SiteAverage throughput (MB/s)Data Moved (TB) BNL6151 FNAL6151 GridKA IN2P39175 INFN8167 RAL7258 SARA10688 TOTAL600500

CERN IT-GD CERN Tier-0 Configuration (1/3) Service Challenge 1 was on an experimental hardware setup 10 HP IA64 nodes within the opencluster Not a standard service No LEMON monitoring No operator coverage No standard configuration Plan was to move to a “standard” configuration IA32 “worker nodes” running CASTOR SRM/gridftp Still serving data from local disks Also to move to “production” networking The opencluster is a test and research facility Again, non-monitored network hardware

CERN IT-GD CERN Tier-0 Configuration (2/3) This didn’t all go to plan… First set of 20 IA32 worker nodes weren’t connected to right network segments They were installed, configured and tested before this was discovered Replacement nodes (20 IA32 disk servers) were supplied too late to get installed, configured and tested before SC2 Had to fallback to openlab IA64 nodes for SC2 These weren’t connected to the production network on their inward facing interface (towards CERN LAN) Added 10 extra IA64 nodes to system – 16 in total were used for data transfer in SC2

CERN IT-GD CERN Tier-0 Configuration (3/3)

CERN IT-GD Tier-1 Storage Configuration Small set of storage configurations Most sites ran “vanilla” Globus gridftp servers SARA, INFN, IN2P3, FZK The rest of the sites ran dCache FNAL, BNL, RAL Most sites used local or system-attached disk FZK used SAN via GPFS FNAL used production CMS dCache, including tape Load-balancing for most plain gridftp sites was done at the RADIANT layer INFN deployed “n-1” DNS alias – highest loaded machine was replaced in alias every 10 minutes Alleviated problems see with pure round-robin on gridftp servers

CERN IT-GD Tier-1 Network Connectivity Sites are in the middle of their upgrade path to LCG networking 10Gb will be the norm – now it’s more like 1Gb All this is heavily tied to work done in the LCG T0-T1 networking group Most sites were able to provided dedicated network links Or at least links where we could get most of the supplied bandwidth IN2P3, BNL still were on shared links with a bit more congestion Needed to be dealt with differently Upped the number of concurrent TCP streams per file transfer

CERN IT-GD Tier-1 Network Topology

CERN IT-GD Transfer Control Software (1/3) LCG RADIANT Software used to control transfers Prototype software interoperable with the gLite FTS Run on a single node at CERN, controlling all Tier-0 to Tier- 1 “channels” Transfers done via 3rd-party gridftp Plan is to move to gLite FTS for SC3 Initial promising results presented at Lyon SC Meeting More details in LCG Storage Workshop tomorrow ‘radiant-load-generator’ was used to generate transfers Configured for each channel to load-balance where appropriate Specifies number of concurrent streams for a file transfer This was normally =1 for a dedicated link Ran from cron to keep transfer queues full

CERN IT-GD Transfer Control Software (2/3) Control of load on a channel via number of concurrent transfers Final Configuration: #Chan : State : Last Active :Bwidth: Files: From : To FZK :Active :05/03/29 15:23:47 :10240 :5 :cern.ch :gridka.de IN2P35:Active :05/03/29 15:16:32 :204 :1 :cern.ch :ccxfer05.in2p3.fr IN2P33:Active :05/03/29 15:20:30 :204 :1 :cern.ch :ccxfer03.in2p3.fr INFN :Active :05/03/29 15:23:07 :1024 :8 :cern.ch :cr.cnaf.infn.it IN2P32:Active :05/03/29 15:21:46 :204 :1 :cern.ch :ccxfer02.in2p3.fr FNAL :Inactiv:Unknown :10240 :0 :cern.ch :fnal.gov IN2P3 :Active :05/03/29 15:23:40 :204 :1 :cern.ch :ccxfer01.in2p3.fr IN2P34:Active :05/03/29 15:18:00 :204 :1 :cern.ch :ccxfer04.in2p3.fr BNL :Inactiv:Unknown :622 :24 :cern.ch :bnl.gov NL :Active :05/03/29 15:22:54 :3072 :10 :cern.ch :tier1.sara.nl RAL :Active :05/03/29 15:23:09 :2048 :12 :cern.ch :gridpp.rl.ac.uk

CERN IT-GD Transfer Control Software (3/3) RADIANT controlled transfers were pure gridftp only SRM components did not get developed in time FNAL and BNL did SRM transfers by pulling from their end FNAL used PhEDEx to pull data BNL used simple driver scripts around srmCopy command BNL started some radiant-controlled gridftp transfers too Used the exact transfer nodes as the rest of the sites (radiantservice.cern.ch) SRM interactions helped debug issues with dCache SRM and DNS load-balanced SRM/gridftp servers Fixed java DNS alias problems seen at FNAL Did not work smoothly for most of the challenge with FNAL mode of operation Main problem solved and fix in place by end of throughtput phase How to do srmCopy’s with 1000s of files is still not well understood

CERN IT-GD CERN MRTG Graphs of network usage out of cluster LEMON monitoring of cluster CPU usage Disk usage Network usage Gridftp logfile monitoring

CERN IT-GD Tier-1 Monitoring Gridftp Monitoring ( Main monitoring tool used during the SC2 Hourly throughput per site Hourly throughput per host Daily throughput per site Daily throughput per host

CERN IT-GD Service Outages (1/2) Progress page kept in SC wiki of All tunings made to the system All outages noted during the running Any actions needed to cleanup and recover service No real 24x7 service in place Manual monitoring of monitoring webpages Best-effort restart of service Also at Tier-1 sites – problems communicated to service challenge teams, but this was not a 24x7 coverage

CERN IT-GD Service Outages (2/2) Capacity in the cluster meant that we could recover from one site not being active Other sites would up their load a bit automatically due to gridftp stream rates increasing Only thing that killed transfers were CERN outages We did not do any scheduled outages during the SC No procedures for starting a schedule outage If we had done one to move to managed network infrastructure, it would have removed some of the scheduled ones

CERN IT-GD Outage Breakdown - CERN Mxproxy instability Locks up under high load Understood by developers of myproxy Can be handled by watchdog job Particular problem on restart after network outage CERN LAN Network outages Had 2 long-ish outages (~12 hours) Issue with being on non-managed network hardware Database quota limit hit Tablespace was being monitored but not quota Quota monitoring added Database load problems Caused intermittent drops in throughput due to new jobs not being scheduled In-memory hob queues in the transfer agents meant these we’re a big problem

CERN IT-GD DNS load-balancing of gridftpd An issue raides at Lyon SC Meeting “load-balanced” gridftp servers Not really load-balanced with round-robin – no awareness of state or behaviour of other nodes Can end up with all transfers going to one host, with all others idle CNAF put in place “n-1” load-balancing Heaviest loaded node is taken out of alias every 10 minutes Smoothed out load

CERN IT-GD Individual site tests Being scheduled right now Sites can pick days in next two weeks when they have the capacity 500MB/s to disk 60MB/s to tape FNAL is running 500MB/s disk tests right now

CERN IT-GD Summary SC2 met it’s throughput goals An improvement from SC1 We still don’t have something we can call a service But monitoring is better We see outages when they happen, and we understood why they happen First step towards operations guides Some advances in infrastructure and software will happen before SC3 gLite transfer software SRM service more widely deployed We have to understand how to incorporate these elements