Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.

Slides:



Advertisements
Similar presentations
Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.
Advertisements

Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
HEPiX Edinburgh 28 May 2004 LCG les robertson - cern-it-1 Data Management Service Challenge Scope Networking, file transfer, data management Storage management.
The LCG Service Challenges: Experiment Participation Jamie Shiers, CERN-IT-GD 4 March 2005.
Network Design and Implementation
Service Challenge Meeting “Service Challenge 2 update” James Casey, IT-GD, CERN IN2P3, Lyon, 15 March 2005.
GridPP meeting Feb 03 R. Hughes-Jones Manchester WP7 Networking Richard Hughes-Jones.
Large Scale File Distribution Troy Raeder & Tanya Peters.
Lecture Internet Overview: roadmap 1.1 What is the Internet? 1.2 Network edge  end systems, access networks, links 1.3 Network core  circuit switching,
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
FZU participation in the Tier0 test CERN August 3, 2006.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
The LCG Service Challenges and GridPP Jamie Shiers, CERN-IT-GD 31 January 2005.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,
Data transfer over the wide area network with a large round trip time H. Matsunaga, T. Isobe, T. Mashimo, H. Sakamoto, I. Ueda International Center for.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
BNL Service Challenge 3 Site Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
NL Service Challenge Plans Kors Bos, Sander Klous, Davide Salomoni (NIKHEF) Pieter de Boer, Mark van de Sanden, Huub Stoffers, Ron Trompert, Jules Wolfrat.
GIST 23 DWD, 27-29th Apr 2005 GGSPS development and operations Andy Smith RAL.
End-to-End performance tuning Brian Davies Gridpp28 Manchester 2012.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
ATLAS WAN Requirements at BNL Slides Extracted From Presentation Given By Bruce G. Gibbard 13 December 2004.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
End-to-End Arguments in System Design CSCI 634, Fall 2010.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.
Mass Storage at SARA Peter Michielse (NCF) Mark van de Sanden, Ron Trompert (SARA) GDB – CERN – January 12, 2005.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
HEPiX Rome – April 2006 The High Energy Data Pump A Survey of State-of-the-Art Hardware & Software Solutions Martin Gasthuber / DESY Graeme Stewart / Glasgow.
15 March 2000Manuel Delfino / CERN IT Division / Mass Storage Management1 Mass Storage Management improvised report for LHC Computing Review Software Panel.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team
15.June 2004Bernd Panzer-Steindel, CERN/IT1 CERN Mass Storage Issues.
“A Data Movement Service for the LHC”
LCG Service Challenge: Planning and Milestones
NL Service Challenge Plans
James Casey, IT-GD, CERN CERN, 5th September 2005
Elizabeth Gallas - Oxford ADC Weekly September 13, 2011
WLCG Service Interventions
LHC Data Analysis using a worldwide computing grid
Global Catalog and Flexible Single Master Operations (FSMO) Roles
Systems Operations and Support
Presentation transcript:

Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005

CERN IT-GD Overview Reminder of targets for the Service Challenge What we did What can we learn for SC2?

CERN IT-GD Milestone I & II Proposal From NIKHEF/SARA Service Challenge Meeting Dec04 - Service Challenge I complete mass store (disk) -mass store (disk) 3 T1s (Lyon, Amsterdam, Chicago) 500 MB/sec (individually and aggregate) 2 weeks sustained Software; GridFTP plus some scripts Mar05 - Service Challenge II complete Software: reliable file transfer service mass store (disk) - mass store (disk), 5 T1’s (also Karlsruhe, RAL,..) 500 MB/sec T0-T1 but also between T1’s 1 month sustained

CERN IT-GD Service Challenge Schedule From FZK Dec Service Challenge Meeting: Dec 04 SARA/NIKHEF challenge Still some problems to work out with bandwidth to teras system at SARA Fermilab Over CERN shutdown – best effort support Can try again in January in originally provisioned slot Jan 04 FZK Karlsruhe

CERN IT-GD SARA – Dec 04 Used a SARA specific solution Gridftp running on 32 nodes of SGI supercomputer (teras) 3 x 1Gb network links direct to teras.sara.nl 3 gridftp servers, one for each link Did load balancing from CERN side 3 oplapro machines transmitted down each 1Gb link Used radiant-load-generator script to generate data transfers Much efforts was put in from SARA personnel (~1-2 FTEs) before and during the challenge period Tests ran from 6-20 th December Much time spent debugging components

CERN IT-GD Problems seen during SC1 Network Instability Router electrical problem at CERN Interruptions due to network upgrades on CERN test LAN Hardware Instability Crashes seen on teras 32-node partition used for challenges Disk failure on CERN transfer node Software Instability Failed transfers from gridftp. Long timeouts resulted in significant reduction in throughput Problems in gridftp with corrupted files Often hard to isolate a problem to the right subsystem

CERN IT-GD SARA SC1 Summary Sustained run of 3 days at end 6 hosts at CERN side. single stream transfers, 12 files at a time Average throughput was 54MB/s Error rate on transfers was 2.7% Could transfer down each individual network links at ~40MB/s This did not translate into the expected 120MB/s speed Load on teras and oplapro machines was never high (~6-7 for a 32 node teras, < 2 for 2-node oplapro) Load on oplapro machines See Service Challenge wiki for logbook kept during Challenge

CERN IT-GD Gridftp problems 64 bit compatibility problems logs negative numbers for file size > 2 GB logs erroneous buffer sizes to the logfile if the server is 64- bits No checking of file length on transfer No error message doing a third party transfer with corrupted files Issues followed up with globus gridftp team First two will be fixed in next version. The issue of how to signal problems during transfers is logged as an enhancement request

CERN IT-GD FermiLab & FZK – Dec 04/Jan 05 FermiLab declined to take part in Dec 04 sustained challenge They had already demonstrated 500MB/s for 3 days in November FZK started this week Bruno Hoeft will give more details in his site report

CERN IT-GD What can we learn ? SC1 did not succeed We did not meet the milestone of 500MB/s for 2 weeks We need to do these challenges to see what actually goes wrong A lot of things do, and did, go wrong We need better test plans for validating the infrastrcture before the challenges (network throughtput, disk speeds, etc…) Ron Trompert (SARA) has made a first version of this We need to proactively fix low-level components Gridftp, etc… SC2 and SC3 will be a lot of work !