Grid Deployment Data challenge follow-up & lessons learned Ian Bird LCG Deployment Area Manager LHCC Comprehensive Review 22 nd November 2004.

Slides:



Advertisements
Similar presentations
HEPiX Edinburgh 28 May 2004 LCG les robertson - cern-it-1 Data Management Service Challenge Scope Networking, file transfer, data management Storage management.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Plateforme de Calcul pour les Sciences du Vivant SRB & gLite V. Breton.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Grid Workload Management Massimo Sgaravatto INFN Padova.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
LCG LHC Computing Grid Project – LCG CERN – European Organisation for Nuclear Research Geneva, Switzerland LCG LHCC Comprehensive.
CERN LCG Deployment Overview Ian Bird CERN IT/GD LHCC Comprehensive Review November 2003.
GLite – An Outsider’s View Stephen Burke RAL. January 31 st 2005gLite overview Introduction A personal view of the current situation –Asked to be provocative!
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
LCG EGEE is a project funded by the European Union under contract IST LCG PEB, 7 th June 2004 Prototype Middleware Status Update Frédéric Hemmer.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
INFSO-RI Enabling Grids for E-sciencE gLite Data Management and Interoperability Peter Kunszt (JRA1 DM Cluster) 2 nd EGEE Conference,
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
EGEE MiddlewareLCG Internal review18 November EGEE Middleware Activities Overview Frédéric Hemmer EGEE Middleware Manager EGEE is proposed as.
INFSO-RI Enabling Grids for E-sciencE Experience of using gLite for analysis of ATLAS combined test beam data A. Zalite / PNPI.
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
15 December 2015M. Lamanna “The ARDA project”1 The ARDA Project (meeting with the LCG referees) Massimo Lamanna CERN.
Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
WP3 Information and Monitoring Rob Byrom / WP3
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
ATLAS Database Access Library Local Area LCG3D Meeting Fermilab, Batavia, USA October 21, 2004 Alexandre Vaniachine (ANL)
CERN LCG Deployment Overview Ian Bird CERN IT/GD LCG Internal Review November 2003.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
Last update: 03/03/ :37 LCG Grid Technology Area Quarterly Status & Progress Report SC2 February 6, 2004.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
EGEE is a project funded by the European Union under contract IST ARDA Project Status Massimo Lamanna ARDA Project Leader NA4/HEP Cork, 19.
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
Site Services and Policies Summary Dirk Düllmann, CERN IT More details at
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
INFSO-RI Enabling Grids for E-sciencE Upcoming Releases Markus Schulz CERN SA1 15 th June 2005.
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
ARDA Massimo Lamanna / CERN Massimo Lamanna 2 TOC ARDA Workshop Post-workshop activities Milestones (already shown in December)
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Evolution of storage and data management
Status of Task Forces Ian Bird GDB 8 May 2003.
Regional Operations Centres Core infrastructure Centres
Operations Status Report
EGEE Middleware Activities Overview
SA1 Execution Plan Status and Issues
LCG Security Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
WP1 activity, achievements and plans
LCG Operations Workshop, e-IRG Workshop
Data Management cluster summary
Operating the LCG and EGEE Production Grid for HEP
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
Pierre Girard ATLAS Visit
Presentation transcript:

Grid Deployment Data challenge follow-up & lessons learned Ian Bird LCG Deployment Area Manager LHCC Comprehensive Review 22 nd November 2004

LHCC Comprehensive Review 22 nd November Outline Issues seen in the data challenges Responses and requirements Lessons learned Operations, certification, support Directions for future Service challenges, etc

LHCC Comprehensive Review 22 nd November Problem summaries Outstanding middleware problems have been collected during the DCs Limitations_and_Requirements.pdf Limitations_and_Requirements.pdf And in the GAG document gag/LCG_GAG_Docs/DCFeedBack.pdf gag/LCG_GAG_Docs/DCFeedBack.pdf 1 st systematic confrontation of required functionalities with capabilities of the existing middleware Some can be patched, worked around, Most has to be direct input as essential requirements to future developments

LHCC Comprehensive Review 22 nd November Major issues seen in data challenges Job success rate Due in large part to site mis-configurations, lack of fabric management (and expertise) Many grid sites are new – never run a cluster before Started to address this in the operations workshop –Create a fabric management handbook – recipes and best practices – leverage HEPiX, EGEE training, etc. Now much improved – monitoring was significantly improved in the last 6 months, problems followed up direct to sites Should improve more – process now better understood and agreed – being implemented in the EGEE CICs + Taipei GOC –Helps with more effort involved – rotating responsibility

LHCC Comprehensive Review 22 nd November Issues – 2 Workload Management System Simple Globus model does not match complexity of modern batch systems with fairshare schedulers RB does not communicate full set of resource requirements to the batch systems Current IS schema needs expanding to improve the description of batch capabilities, heterogeneity, etc Many workarounds have been put in place – but these are hard to scale WLM from gLite anticipated very soon  must make sure that these issues are addressed and resolved

LHCC Comprehensive Review 22 nd November Issues – 3 Information system This was continually improved Based on BDII – drop in replacement for MDS components Can now scale to 100’s of sites, with good robustness Many issues that manifest as IS problems in fact due to underlying fabric problems E.g. PBS hangs do not update IS BDII allowed each experiment to have a “view” on the full system that could be tailored to its needs Specify sites that are experiment-verified and LCG verified Provided the experiment with control over which sites it used

LHCC Comprehensive Review 22 nd November Issues – 4 Data management Replica management tools Edg tools replaced by lcg-utils – additional requested functionality and better performance This was available early in the DCs but not used by all experiments Replica catalogue performance and functionality Re-engineering EDG RLS to address these issues –Same interfaces as now – lcg-utils –Propose also to provide gLite catalogue interface to provide simple implementation quickly with new interface GFAL originally a low-level IO interface to Grid Storage Now provides: –File Catalog abstraction –Storage Element Abstraction (EDG SE, EDG ‘Classic SE’, SRM v1) –Connection to Information system

LHCC Comprehensive Review 22 nd November Layered Data Management APIs lcg_utils Vendor Specific APIs GFAL CatalogingStorageData transfer Data Management (Replication, Indexing, Querying) User ToolsExperiment Framework EDG LFC SRM Classic SE gridftp Robust Data Transfer

LHCC Comprehensive Review 22 nd November Issues – 5 Data transfer (gridftp) unreliability Building reliable file transfer service – goal to demonstrate reliability and performance in series of service challenges Currently in tuning phase with sites Fermi achieving 3 Gbit/s sustained transfer rates Service Challenges start in December NIKHEF/SARA first, then Karlsruhe and Fermi Lack of managed storage elements dCache seen as solution – but still not ready for simple widespread deployment, but significant effort has been invested to make this a solution for large sites Lightweight alternative – provide basic space management and SRM interface, with data access via gridftp and rfio The disk-pool manager and file transfer service Are not LCG-2 specific – needed for both LCG-2 and gLite based services

LHCC Comprehensive Review 22 nd November Lessons – Operations – 1 (Lack of) hierarchy in: Middleware distribution Operation and user support Communication (sites  GDB; general), mailing lists took over all discussions …  EGEE-based structure with hierarchy of OMC, CIC, ROC Still need fast reaction for (e.g.) data challenges – task forces? Installation and configuration tools For middleware has to be as simple as possible & flexible No single solution  small/large & established/novice sites  Move towards generic procedures that can be used with any tools Certification and release preparation Very effective, but slow and opaque to outside groups Some problems only in real production and load  Streamlined certification and release process Separate core and “add-on” middleware Advertise forthcoming changes – but expediency is not always democratic …

LHCC Comprehensive Review 22 nd November Lessons – Operations – 2 Monitoring and troubleshooting Many tools available for users, admins, operators Big overlap/duplication of efforts, independent solutions  Establishing a single monitoring infrastructure Based on R-GMA, aggregate many sources of information; usable by many tools to produce customised views and displays Already contains many of the existing tools Fabric management Many (new) sites have no experience or expertise  Produce a fabric management recipe/best practice document (part of EGEE “cookbook”), HEPiX, … Testbed for early application access and testing  EIS testbed was provided  Pre-production service gets sites and operators involved too Middleware to be installed on WN  as close to 0 as possible Very important for co-existence with other grids etc.

LHCC Comprehensive Review 22 nd November Lessons – Certification – 1 Certification process was crucial Almost all installations worked 1 st time Need to tie in EIS testbed (and now pre-production system) as integral part of certification and release process … and integrate deployment preparation Release cycles  monthly But too short a cycle Public discussion of release contents Did not happen fully – expediency during DC’s – but need more input from site managers to improve the process Inclusion of external efforts Current process does not easily allow inclusion of work done by external groups – needs to be modified Testing of experiment software Was found to be extremely useful – investigate how to do this more effectively – pre-production service? Developers MUST be exposed to the middleware problems Fixing must be their TOP priority – not the next piece of functionality

LHCC Comprehensive Review 22 nd November Lessons – support – 1 Having the EIS group was essential Support people who learned experiment environments and issues and followed the problems Could certainly be expanded – needs backup people Experiments using LCG-2 for the 1 st time only when the DC started! A lot of time was lost finding problems that could have been seen earlier – vis LHCb experience Only at this point we learned how experiments wanted to use the middleware Meant changes and adaptations during the DC Experiment interfaces were still being written after the DCs started … Real functional/performance issues only surfaced then LCG-2 was only ready in time for the DC’s but LCG-1 had only been lightly used and not in the DC mode Suspect level of effort needed for interfacing experiments’ software to middleware was underestimated

LHCC Comprehensive Review 22 nd November Lessons – support – 2 EIS testbed (or pre-production service) needs to be an integral part of the release process with experiments’ grid experts fully involved Early feedback on the middleware Services and software (middleware and experiments) expect other services to be reliable Not possible in a distributed system – must deal with failure and retries Must have a “one-stop-shop” for all problems One place where all problems are reported and dispatched Has not been clear who should report what and where  outcome of Ops workshop: user support group

LHCC Comprehensive Review 22 nd November Future directions Deployment strategy Working groups Service challenges Planning and milestones

LHCC Comprehensive Review 22 nd November Deployment and services Current LCG-2 based service continues as production service for batch work Experiments moving to continuous MC production mode Together with work in-hand provides a well-understood baseline service Deploy in parallel a pre-production service Deploy LCG-2 components, and gLite components as they are delivered Understand how to migrate from LCG-2 to gLite – Which components can be replaced Which can run in parallel Do new components satisfy requirements – functional and management/deployment Move proven components into the production system

LHCC Comprehensive Review 22 nd November LCG-2 vs gLite vs Alien LCG-2 is the current production service Alien can use LCG-2 as a resource The “gLite prototype” is based on Alien Has been deployed at a few sites – “by hand” and used by ARDA to start to address analysis on the grid gLite – development based on the ARDA prototype Building the 2 nd generation middleware, using ideas from Alien and EDG, as well as VDT etc – based on web services Will be gradually deployed in pre-production as it is delivered

LHCC Comprehensive Review 22 nd November Interoperability issues LCG-2 and Grid3 – based on same tools and schema See a way to reasonable interoperation Not all issues are resolved (e.g. file catalogues) LCG/EGEE – gLite is the way forward – this is where the development effort is What will OSG deploy – will it be compatible with gLite? What will NorduGrid/ARC do – gLite? How can we avoid divergence? LCG-2 is reviewed/monitored by all parties OSG, NorduGrid deployment decisions are not reviewed by the LCG project Risk is that interoperability is not possible in that situation

LHCC Comprehensive Review 22 nd November Operations management Following operations workshop Have working groups on Operations support User support Fabric management Operational security group With agreed strategies, currently being implemented Planning longer term milestones, designed to improve service reliability and management Check points with service challenges and data challenges Many issues are common with Grid3/OSG – want to bring these together as far as possible

LHCC Comprehensive Review 22 nd November Service challenges Proposed to be used in addition to ongoing data challenges and production use: Goal is to ensure baseline services can be demonstrated Demonstrate the resolution of problems mentioned earlier Demonstrate that operational and emergency procedures are in place 4 areas proposed: Reliable data transfer Demonstrate fundamental service for Tier 0  Tier 1 by end 2004 Job flooding/exerciser Understand the limitations and baseline performances of the system –May be achieved by the ongoing real productions Incident response Ensure the procedures are in place and work – before real life tests them Interoperability How can we bring together the different grid infrastructures? Now proposed expansion to Tier 1 readiness challenges Set of milestones to build up to full data rates over next 2 years Exactly what these should be is under discussion now

LHCC Comprehensive Review 22 nd November Planning next steps Is under way now Goal to agree set of basic milestones by end 2004, covering Service challenges Operations, support, security, fabric management These should be monitored by the GDA steering group Tier 1 and large Tier 2 managers … and the GDB Coordination and management Grid Deployment Board – policy and high level agreements GDA steering group – following deployment milestones and directions Weekly operations meeting – coordinating operational issues Operational security group Working groups set up in the workshop

LHCC Comprehensive Review 22 nd November Summary Data challenges have run during 2004 Many issues have been raised, much has been learned A lot has been addressed already Experience is documented as input to new developments Strategy for improving operations/support is agreed And implementation started Service and data challenges, ongoing productions Will be used to show improvements and progress Preparing for gLite deployment Interoperability/co-existence of grids is a reality Issues to be addressed at all levels – management/technical