LCG Service Challenges: Progress Since The Last One –

Slides:



Advertisements
Similar presentations
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Advertisements

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.
CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
Workshop Summary (my impressions at least) Dirk Duellmann, CERN IT LCG Database Deployment & Persistency Workshop.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
3D Workshop Outline & Goals Dirk Düllmann, CERN IT More details at
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
LCG Service Challenges: Planning for Tier2 Sites Update for HEPiX meeting Jamie Shiers IT-GD, CERN.
Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
SC4 Planning Planning for the Initial LCG Service September 2005.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
The LHC Computing Environment Challenges in Building up the Full Production Environment [ Formerly known as the LCG Service Challenges ]
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
LCG 3D Project Update (given to LCG MB this Monday) Dirk Duellmann CERN IT/PSS and 3D
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
ASCC Site Report Eric Yen & Simon C. Lin Academia Sinica 20 July 2005.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Database Project Milestones (+ few status slides) Dirk Duellmann, CERN IT-PSS (
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
ARDA Massimo Lamanna / CERN Massimo Lamanna 2 TOC ARDA Workshop Post-workshop activities Milestones (already shown in December)
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
The Worldwide LHC Computing Grid WLCG Milestones for 2007 Focus on Q1 / Q2 Collaboration Workshop, January 2007.
LCG Service Challenges: Progress Since The Last One –
Operations Workshop Introduction and Goals Markus Schulz, Ian Bird Bologna 24 th May 2005.
T0-T1 Networking Meeting 16th June Meeting
WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006
“A Data Movement Service for the LHC”
Dirk Duellmann CERN IT/PSS and 3D
LCG Service Challenge: Planning and Milestones
Service Challenge 3 CERN
3D Application Tests Application test proposals
Database Readiness Workshop Intro & Goals
Olof Bärring LCG-LHCC Review, 22nd September 2008
The LCG Service Challenges: Ramping up the LCG Service
Workshop Summary Dirk Duellmann.
LHC Data Analysis using a worldwide computing grid
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

LCG Service Challenges: Progress Since The Last One – July

LCG Service Challenges: Status and Plans 2 Introduction  Neither SC1 nor SC2 fully met their goals SC2 met / exceeded its throughput goals  But not its service goals…  Multiple threads started early 2005 to address:  Bringing experiments into loop (SC3+)  Bringing T2s into loop (ditto)  Preparing for full production services  Addressing problems beyond ‘throughput goals’  e.g. site / experiment goals, additional services etc All Tier1s are now involved! Many Tier2s! New s/w successfully deployed!  Will not comment on individual successes / issues – site slots for that! Successful workshops, tutorials (April, May, June) and site visits!  Throughput tests gradually approaching target (more later)  Need to understand the problems areas and address them Acknowledge progress / successes / hard-work of many!

LCG Service Challenges: Status and Plans 3 Executive Summary (updated since PEB)  ‘Pilots’ – LFC & FTS  Scheduled originally for mid-May  Multiple delays (obtaining / configuring h/w, s/w, procedures etc.)  LFC has been available for some weeks  Testing with ATLAS, ALICE, CMS, LHCb  FTS fully available since Monday 11 th July  Using “Quick Fix” release from previous Friday…  SC3 Throughput Tests have started!  Seeing ‘SC2-level’ traffic using FTS (most T1s) + PhEDEx (FNAL + others)  Problems at many sites at SRM level: monitoring pagemonitoring page  Holes in service over w/e (as expected)  Need to debug SRMs before we can look at remaining FTS failures  We will learn a lot about running these basic services!  (Whilst shaking down the services significantly)  Key deliverable: reliable, stress-tested core data management services  Site preparations: work still needed for Service Phase!  Valuable information through SC Wiki  Experiments in direct contact with some sites (e.g. Lyon)  This is helping to push the preparation!  See -> Service Challengeshttp://cern.ch/LCG/Service Challenges  An awful lot has been achieved since SC2 (and SC1…) but still more ahead…

LCG Service Challenges: Status and Plans 4 Site Components - Updated  Each T1 to provide 10Gb network link to CERN  Each site to provide SRM 1.1 interface to managed storage  All sites involved in SC3: T0, T1s, T2s.  T0 to provide File Transfer Service; also at named T1s for T2-T1 transfer tests  Named Tier1s: BNL, CNAF, FZK, RAL; Others also setting up FTS  CMS T2s being supported by a number of T1s using PhEDEx  LCG File Catalog – not involved in Throughput but needed for Service  ALICE / ATLAS: site local catalog  LHCb: central catalog with >1 R/O ‘copies’ (on ~October timescale)  IN2P3 to host one copy; CNAF? Taiwan? RAL?  CMS: evaluating different catalogs  FNAL: Globus RLS, T0+other T1s: LFC; T2s: POOL MySQL, GRLS, …  T2s – many more than foreseen  Running DPM or dCache, depending on T1 / local preferences / support  [ Support load at CERN through DPM / LFC / FTS client ]  Work still needed to have these consistently available as services

LCG Service Challenges: Status and Plans 5 Tier2 participation by Tier1 Tier1 (Approx) Status mid-June ASCC, Taipei Yes; preparing for T2 support in Asia - Pacific CNAF, Italy Yes; workshop held end May in Bari PIC, Spain Yes; no Oracle service for FTS; CMS transfers with PhEDEx IN2P3, Lyon Yes; LAL + IN2P3 GridKA, Germany Yes – studying with DESY RAL, UK Yes – plan in place for several Tier2s BNL, USA Yes – named ATLAS Tier2s FNAL, USA Yes – CMS transfers with PhEDEx; already performing transfers TRIUMF, Canada Yes – planning to install FTS and identify T2s for tests NIKHEF/SARA, Netherlands Re-evaluate on SC4 timescale (which T2s outside NL?) Nordic Centre Yes; preparing T1 / T2s in Nordic region CERN Swiss T2 plus some others not unlikely Virtually all Tier1s actively preparing for Tier2 support Much interest from Tier2 side: debugging process rapidly! Some Tier2s still need to identify their Tier1 centre This is an(other) area where things are looking good!

LCG Service Challenges: Status and Plans 6 T2s NIKHEF/SARAAmsterdam XX X Free UniversityAmsterdam, NL X Univ. of NijmegenNijmegen, NL X Univ. of UtrechtUtrecht, NLX CERNGeneva CSCSManno, Switzerland XXX FZK? PraguePrague, Czech Rep.XX + KFKIBudapest, HungaryX X + SZTAKIBudapest, HungaryX X + Eotvos UniversityBudapest, HungaryX X NDGF? Helsinki Institute of PhysicsHelsinki, Finland X FZK? KrakowKrakow, PolandXX X # WarszawaWarszawa, PolandX XX ? Russian Tier-2 clusterMoscow, RussiaXXXX x TechnionHaifa, Israel X x WeizmannRehovot, Israel X x Tel Aviv Univ.Tel Aviv, Israel X PAEC- 1/NCP/NUST/COMSATSPakistan X PIC? UERJRio de Janeiro, Brazil X y TIFRMumbai, India X y VECC/SINPKolkata, IndiaX ??MelbourneX Cape TownX Etc.

LCG Service Challenges: Status and Plans 7 Services at CERN  Building on ’standard service model’ 1.First level support: operations team  Box-level monitoring, reboot, alarms, procedures etc 2.Second level support team: Grid Deployment groupGrid Deployment group  Alerted by operators and/or alarms (and/or production managers…)  Follow ‘smoke-tests’ for applications  Identify appropriate 3 rd level support team to call  Responsible for maintaining and improving procedures  Two people per week: complementary to System Manager on Duty  Provide daily report to SC meeting (09:00); interact with experiments  Members: IT-GD-EIS, IT-GD-SC (including me)  Phone numbers: ; Third level support teams: by service  Notified by 2 nd level and / or through operators (by agreement) (Definition of a service?)  Should be called (very) rarely… (Definition of a service?)

LCG Service Challenges: Status and Plans 8 Services elsewhere  Several services require DB behind them  CASTOR/dCache/DPM etc  FTS  LFC  LFC (today) and FTS (October?) will support MySQL as well as Oracle database backend  CASTOR also does this today (PIC)  Knowledge of community being leveraged to provide guidance – through Wiki – on how to do these  e.g. proposal for DB backup at T2s archiving recovery set at T1  (stop server; copy file & restart; archive at T1 or hot backup as sample options)

LCG Service Challenges: Status and Plans 9 More on Services  24 x 7 services do not mean that people have to be chained to the computer 24 x 7  Services must be designed / deployed to be as reliable and recoverable as possible  Monitor to check that this is so – including end to end monitoring  Cannot tolerate failure of a major component Friday evening not looked at until Monday morning… after coffee…  Eventually run in degraded mode?  Need to use existing experience and technology…  Monitoring, alarms, operators, SMS to 2 nd / 3 rd level support…  Now is the time to get these procedures in place  Must be able to arrange that suitable experts can have network access within reasonable time  Even from the beach / on the plane …

LCG Service Challenges: Status and Plans 10 SC3 – Deadlines and Deliverables  May 31 st 2005: basic components delivered and in place  June 2005: integration testing  June 13 – 15: planning workshop – experiment issues  June 30 th 2005: integration testing successfully completed  July 1 – 10: start disk – disk throughput tests  Assume a number of false starts / difficulties  July 11 – 20: disk tests  July 21 – 27: tape tests  July 28 – 31: T2 tests

LCG Service Challenges: Status and Plans 11 Service Schedule (Raw-ish) Sep Oct Nov Dec ALICE ATLAS CMS LHCb Sep Oct Nov Dec ALICE ATLAS CMS LHCb

LCG Service Challenges: Status and Plans 12 SC Communication  Service Challenge Wiki – cern.ch/LCG -> Service Challenges  Contains Tier-0 and Tier-1 contact/configuration information and work logs for SC teams  Weekly phone-cons on-going  Dial-in number:  Access code:  Daily service meetings for CERN teams from 27 th June  B28 R-015: standing agenda and minutes via Wikistanding agenda and minutes  Technical communication through list  What else is required by Tier-1s?  Daily (or frequent) meetings during SC?

LCG Service Challenges: Status and Plans 13 SC Meetings / Workshops  Not enough support for September workshop  Despite +ve feedback from April & June workshops  Propose to continue with CHEP workshop nevertheless  I believe weekly con-calls are useful  Judging on length / number of people joining etc  There are still many issues we need to discuss / resolve  Please bring up issues that worry you!  GDBs in September / October?

LCG Service Challenges: Status and Plans 14 SC3 Summary  There has been a great deal of progress since SC2!  Particularly in the areas of monitoring, services, procedures, documentation, delivery of pilots, LCG 2.5 release, other s/w …  Integration of remaining T1s, adding T2s, …  Good understanding and agreement on goals of SC3  What services need to run where  Proposed metrics to define success  Outline schedule – detailed resource requirements still sketchy  Concerns about readiness to run production-level services  Preparations are late, but lots of pressure and effort  Are enough resources available to run services?  Backups, single points of failure, vacations, …  SC3 leads to real production services by end of year  Must continue to run during preparations for SC4  This is the build up to the LHC service – must ensure that appropriate resources are behind it  Still a number of ‘pressure points’ and ‘single points of failure’

LCG Service Challenges: Status and Plans 15 Postscript…  Wanted. Man for a hazardous journey.  Low wages, intense cold, long months of darkness and constant risks.  Return uncertain. E. Shackleton, London Newspaper, 1913.

LCG Service Challenge 3 Preparation for Service Phase

LCG Service Challenges: Status and Plans 17 What Remains to be done?  Baseline services setup at all participating sites  Validation through sample jobs provided by experiments  Agreement on resource requirements and schedule  Agreement of metrics  Resolution of outstanding issues (VO-boxes, experiment- specific services, clear definition of support lines, software components, releases and dependencies etc.)  …

LCG Service Challenges: Status and Plans 18