Service Challenge Meeting “Service Challenge 2 update” James Casey, IT-GD, CERN IN2P3, Lyon, 15 March 2005.

Slides:



Advertisements
Similar presentations
Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
Advertisements

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Jos van Wezel Doris Ressmann GridKa, Karlsruhe TSM as tape storage backend for disk pool managers.
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
FZU participation in the Tier0 test CERN August 3, 2006.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
LHC Tier 2 Networking BOF Joe Metzger Joint Techs Vancouver 2005.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
NL Service Challenge Plans Kors Bos, Sander Klous, Davide Salomoni (NIKHEF) Pieter de Boer, Mark van de Sanden, Huub Stoffers, Ron Trompert, Jules Wolfrat.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.
Mass Storage at SARA Peter Michielse (NCF) Mark van de Sanden, Ron Trompert (SARA) GDB – CERN – January 12, 2005.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
ASCC Site Report Eric Yen & Simon C. Lin Academia Sinica 20 July 2005.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.
HEPiX Rome – April 2006 The High Energy Data Pump A Survey of State-of-the-Art Hardware & Software Solutions Martin Gasthuber / DESY Graeme Stewart / Glasgow.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
Baseline Services Group Status of File Transfer Service discussions Storage Management Workshop 6 th April 2005 Ian Bird IT/GD.
T0-T1 Networking Meeting 16th June Meeting
“A Data Movement Service for the LHC”
NL Service Challenge Plans
Cross-site problem resolution Focus on reliable file transfer service
James Casey, IT-GD, CERN CERN, 5th September 2005
Service Challenge 3 CERN
BNL FTS services Hironori Ito.
CMS transferts massif Artem Trunov.
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
WLCG Service Interventions
LHC Tier 2 Networking BOF
IPv6 update Duncan Rand Imperial College London
The LHCb Computing Data Challenge DC06
Presentation transcript:

Service Challenge Meeting “Service Challenge 2 update” James Casey, IT-GD, CERN IN2P3, Lyon, 15 March 2005

CERN IT-GD Overview Reminder of targets for the Service Challenge Plan and timescales What is in place at CERN Tier-0 What has been tested at Tier-1 sites What is still to be tested at Tier-1 sites Issues outstanding from CERN Meeting Communication Monitoring Problems seen so far

CERN IT-GD Service Challenge Summary “Service Challenge 2” Started 14 th May (yesterday) Set up Infrastructure to 6 Sites NL, IN2P3, FNAL, FZK, INFN, RAL 100MB/s to each site 500MB/s combined to all sites at same time 500MB/s to a few sites individually Use RADIANT transfer software Still with dummy data 1GB file size Goal : by end March, sustained 500 MB/s at CERN

CERN IT-GD Plans and timescale - Proposals Monday 14 th - Friday 18 th Tests site individually. 6 sites, so started early with NL ! Monday IN2P3 Tuesday FNAL Wednesday FZK Thursday INFN Friday RAL Monday 21 st – Friday 1 st April Test all sites together Monday 4 th – Friday 8 th April 500MB/s single site tests Monday 11 th - … Tape tests ? Overlaps with Easter and Storage Management workshop at CERN

CERN IT-GD What is in place – CERN (1/2) Openlab Cluster extended to 20 machines oplapro[55-64,71-80].cern.ch All IA64 with local fast disk Allows for concurrent testing with a 4 node setup (oplapro55-59) for tier-1 sites 2 x 2nodes for gLite SC code (oplapro60-63) radiantservice.cern.ch is now 9 nodes (oplapro71-79) Radiant Control node (oplapro80) Raidant software improvements Updated schema and state changes to match gLite software Will make future changes easier Cleaned up some of the logic Channel management improved E.g. Can change number of concurrent transfers dynamically Better logging Load-generator improved to be more pluggable by site admins

CERN IT-GD What is in place – CERN (1/2) New IA32 cluster for SC2 Managed as a production service by Castor Operations Team Alarms and monitoring in place 20 node lxfs[51-70]) load-balanced gridftp/SRM Uses local disks for storage castorgridsc.cern.ch Still Waiting for connection to external 10Gb connection Expected this week 4 New transfer control nodes 1 for SC2 production 1 for SC2 spare 1 for gLite production 1 for testing (iperf, etc…)

CERN IT-GD SC CERN Configuration

CERN IT-GD What is outstanding - CERN External network connectivity for new SC2 cluster We will use openlab machines until this is ready Radiant Software Curently only pure gridftp has been tested SRM code was written a long time ago, but not tested with a remote SRM RAL will use this code More agents needed to set ‘Canceling’ jobs to ‘Cancelled’ To debug error messages of jobs in ‘Waiting’ and send back to queue when ready Some more CLI tools for admins Summary of what transfers are in progress Last hour Failure/Success rate per site

CERN IT-GD What is tested – SARA NL Nodes in place for a few weeks Testing with iperf and globus-url-copy done Up to 80 MB/s peak single stream disk to disk Sustained 40MB/s Network currently over 3 x 1Gb etherchannel 150MB/s sustained disk to disk

CERN IT-GD What is tested – IN2P3 IN2P3 2 nodes that were tested in January iperf memory-to-memory tests ~300Mb/s to a single host –Single and multiple Can get 600Mb/s to both hosts combined Better gridftp single stream performance now seen 60MB/s single stream disk to disk peak 35-40MB/s sustained Problem with one RAID controller stopped full tests to RAID disk using local disk for now in ccxfer04.in2p3.fr Fixed?

CERN IT-GD What is Tested - FNAL SRM to SRM from CASTOR SRM to dCache SRM Also tested gridftp to SRM Data pulled from Fermi Problems found and resolved in DNS handling Java DNS name caching interacted poorly with the round robin DNS load-balancing (Fixed quickly by Timur) Showed up with srmCp All 40 transfers went to one host!

CERN IT-GD What is tested – FNAL/ BNL Transfers to tape have been tested CERN SRM to FNAL tape at 40-50MB/s via FNAL SRM Transfers will be controlled by PHEDEX Will be put in place early next week Will send data to 5 USCMS tier-2 sites Florida, Caltech, San Diego, Wisconsin and Purdue BNL Tried SRM to SRM transfers Had problems with SURL format causing failures Dual attached node at BNL caused debugging difficulties Routing asymmetries

CERN IT-GD What is outstanding – Tier-1 sites RAL dCache in place – good performance seen (500MB/s ?) Waiting on UKLIGHT network line final provisioning As of yesterday, Netherlight had provisioned the fibre to CERN INFN Cluster now in place (8 nodes) Network tests done Can fill the 1Gb pipe FZK Cluster in place – will use gridftp to start Network tests with GEANT done Still some problems to fill whole provisioned 10Gb pipe Will test after SC2

CERN IT-GD Communication Using service-challenge-tech list to distribute information Each site has now specified a mail alias for the local service challenge team Still waiting on one from FZK Service Challenge meetings What Schedule ? Weekly (daily ?) with all sites? What day? Still work to be done on documentation

CERN IT-GD Monitoring Ron Trompert wrote gridftp analysis tool for SARA Extended it to analyze CERN logfiles to all sites

CERN IT-GD Monitoring For SC1 network statistics gathered from routers at CERN Provided by Danny Davids from CERN IT-CS Group For SC2 extended to other sites Uses SNMP from the router Can also be from a switch (any virtual interface) Gathered by hosts at CERN Bytes in/out, packets in/out gathered Stored in central database Visualization tool outstanding Have identified effort from Indian LCG Collaborators Will be coming to CERN this month

CERN IT-GD Problems seen so far Long timeouts/transfers (~2500 secs) in gridftp transfers Intermittent failure (105 times to FNAL in total) Seen by FNAL and NL Now have some log messages from FNAL to aid the debugging Quite sensitive to changes in conditions Extra traffic on disk servers seems to alter performance a lot Changing number of concurrent transfers also Performance is not constant with radiant control software Some parts needs to be understood better (next slide)

CERN IT-GD Radiant load-balancing Some performance loss due to way Radiant is scheduling transfers When one transfer finishes, next scheduled transfer is not to that host Gives asymmetric loading of destination hosts A property of “load-balanced” gridftp servers Not really load-balanced with round-robin – not awareness of behaviour of other nodes Can end up with all transfers going to one host, with all others idle One host ends up heavily loaded and very slow Loss of all throughtput

CERN IT-GD Solutions The longer term solutions are move to SRM which does the load-balanacing (and with srm-cp this might go away) move to gridftp 3.95 which could be adapted to have a load-balanced front-end modify radiant (or the gLite software) to do better scheduling Network and destination-SURL aware ? put in a load-balanced alias for the destination cluster which does N of M load-balancing (i.e. M hosts where the N most-lightly loaded are in the alias.) Used currently for castorgrid, and will be used for castorgridsc Move to SRM for all sites is coming, so putting in an appropriate amount of effort for this is required How much is appropriate?

CERN IT-GD Summary Good progress since last meeting Especially the gridftp monitoring tool from Ron Sites are all nearly ready Still a lot of work to be done But now seems to be debugging underlying software components rather than configuration, network and hardware issues