Service, Operations and Support Infrastructures in HEP Processing the Data from the World’s Largest Scientific Machine Patricia Méndez Lorenzo (IT-GS/EIS),

Slides:

Advertisements

Similar presentations

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Wofgang Thöne, Institute For Scientific Computing – EGEE-Meeting August 2004 Welcome to the User.

Advertisements

Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.

INFSO-RI Enabling Grids for E-sciencE The Grid Challenges in LHC Service Deployment Patricia Méndez Lorenzo CERN (IT-GD) Linköping.

Les Les Robertson WLCG Project Leader WLCG – Worldwide LHC Computing Grid Where we are now & the Challenges of Real Data CHEP 2007 Victoria BC 3 September.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.

Les Les Robertson LCG Project Leader LCG - The Worldwide LHC Computing Grid LHC Data Analysis Challenges for 100 Computing Centres in 20 Countries HEPiX.

Experiment Requirements for Global Infostructure Irwin Gaines FNAL/DOE.

Frédéric Hemmer, CERN, IT DepartmentThe LHC Computing Grid – October 2006 LHC Computing and Grids Frédéric Hemmer IT Deputy Department Head October 10,

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Frédéric Hemmer, CERN, IT Department The LHC Computing Grid – June 2006 The LHC Computing Grid Visit of the Comité d’avis pour les questions Scientifiques.

GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.

INFSO-RI Enabling Grids for E-sciencE Geant4 Physics Validation: Use of the GRID Resources Patricia Mendez Lorenzo CERN (IT-GD)

Grid Applications for High Energy Physics and Interoperability Dominique Boutigny CC-IN2P3 June 24, 2006 Centre de Calcul de l’IN2P3 et du DAPNIA.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.

10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab

Ian Bird LCG Deployment Manager EGEE Operations Manager LCG - The Worldwide LHC Computing Grid Building a Service for LHC Data Analysis 22 September 2006.

SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager

GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.

Bob Jones Technical Director CERN - August 2003 EGEE is proposed as a project to be funded by the European Union under contract IST

1 The LHC Computing Grid – February 2007 Frédéric Hemmer, CERN, IT Department LHC Computing and Grids Frédéric Hemmer Deputy IT Department Head January.

CERN IT Department CH-1211 Genève 23 Switzerland Visit of Professor Karel van der Toorn President University of Amsterdam Wednesday 10 th.

Grid Middleware Tutorial / Grid Technologies IntroSlide 1 /14 Grid Technologies Intro Ivan Degtyarenko ivan.degtyarenko dog csc dot fi CSC – The Finnish.

Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.

The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.

INFSO-RI Enabling Grids for E-sciencE Porting Scientific Applications on GRID: CERN Experience Patricia Méndez Lorenzo CERN (IT-PSS/ED)

Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.

ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.

Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004 LCG Operations.

INFSO-RI Enabling Grids for E-sciencE Experience of using gLite for analysis of ATLAS combined test beam data A. Zalite / PNPI.

Grid DESY Andreas Gellrich DESY EGEE ROC DECH Meeting FZ Karlsruhe, 22./

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.

Alex Read, Dept. of Physics Grid Activities in Norway R-ECFA, Oslo, 15 May, 2009.

Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Julia Andreeva on behalf of the MND section MND review.

LHC Computing, CERN, & Federated Identities

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.

The LHC Computing Grid Visit of Dr. John Marburger

1 The LHC Computing Grid – April 2007 Frédéric Hemmer, CERN, IT Department The LHC Computing Grid A World-Wide Computer Centre Frédéric Hemmer Deputy IT.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

My Jobs at CERN April 2015 My Jobs at CERN2

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

LCG Workshop User Support Working Group 2-4 November 2004 – n o 1 Some thoughts on planning and organization of User Support in LCG/EGEE Flavia Donno LCG.

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Top 5 Experiment Issues ExperimentALICEATLASCMSLHCb Issue #1xrootd- CASTOR2 functionality & performance Data Access from T1 MSS Issue.

INFN-Grid WS, Bari, 2004/10/15 Andrea Caltroni, INFN-Padova Marco Verlato, INFN-Padova Andrea Ferraro, INFN-CNAF Bologna EGEE User Support Report.

What is OSG? (What does it have to do with Atlas T3s?) What is OSG? (What does it have to do with Atlas T3s?) Dan Fraser OSG Production Coordinator OSG.

WLCG Tier-2 Asia Workshop TIFR, Mumbai 1-3 December 2006

Regional Operations Centres Core infrastructure Centres

Physics Data Management at CERN

The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez

Ian Bird GDB Meeting CERN 9 September 2003

The LHC Computing Challenge

Readiness of ATLAS Computing - A personal view

LCG Operations Workshop, e-IRG Workshop

LHC Data Analysis using a worldwide computing grid

Overview & Status Al-Ain, UAE November 2007.

Presentation transcript:

Service, Operations and Support Infrastructures in HEP Processing the Data from the World’s Largest Scientific Machine Patricia Méndez Lorenzo (IT-GS/EIS), CERN OGF23, Barcelona (Spain)

Patricia Méndez Lorenzo Outlook GRID Environment Users How to support, attract... and maintain them Services Maintenance, operations Application to WLCG Experiences we gained within that Grid infrastructure User support and high reliable services are key aspects of any Grid project In this talk we will address the status of the application support, the operations and the services applied to WLCG and the lessons we learnt with the gridification of the LHC experiments 2

Patricia Méndez Lorenzo The LHC Grid Computing LHC will be completed this year and run for the next years Experiments will produce about 15 Millions Gigabytes per year of data (about 20 million CDs!) LHC data analysis requires a computing power equivalent to around of today’s fastest PC processors Requires many cooperating computer centers, as CERN can only provide the 20% of the capacity Concorde (15 Km) Balloon (30 Km) CD stack with 1 year LHC data! (~ 20 Km) Mt. Blanc (4.8 Km) The WLCG (Worldwide LHC Computing Grid) has been built to cover the computing expectations of the 4 experiments WLCG needs to ensure a high degree of fault tolerance and reliability of the Grid services Through a set of effective and reliable operation procedures Applicable to T0, T1 and T2 sites Individualized support for the 4 LHC experiments 3

Patricia Méndez Lorenzo The Grid solution The WLCG technical report lists: –Significant costs of maintaining and upgrading the necessary resources … more easily handled in a distributed environment, where individual institutes and organizations can fund local resources whilst contributing to the global goal Clear from the funding point of view –... no single points of failure. Multiple copies of the data, automatic reassigning of tasks to resources … facilities access to data for all scientists independent of location. … round the clock monitoring and support. Similar to the 3rd point of the “Grid Checklist” following I. Foster`s book... “To deliver nontrivial qualities of service”... The utility of the combined system is significantly greater than the sum of its parts 4

Patricia Méndez Lorenzo Projects we use for LHC Grid WLCG depends on 2 major science grid infrastructures provided by: –EGEE – Enabling Grid for E-SciencE (160 communities) –OSG – US Open Science Grid (also supporting other communities beyond HEP) –Both infrastructures are federations of independent GRIDs –WLCG uses both OSG resources and many (but not all) from EGEE EGEE OSG WLCG 5 Grid

Patricia Méndez Lorenzo EGEE Sites 6 30 sites 3200 cpus 25 Universities 4 National Labs 2800 CPUs Grid3 EGEE: Steady growth over the lifetime of the project EGEE: > 237 sites, 45 countries > 35,000 processors, ~ 10 PB storage

Patricia Méndez Lorenzo The WLCG Distribution of Resources Tier-0 – the accelerator centre Data acquisition and initial Processing of raw data Distribution of data to the different Tier’s Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1 (11 centers ) – “online” to the data acquisition process  high availability Managed Mass Storage –  grid-enabled data service Data-heavy analysis National, regional support Tier-2 – ~200 centres in ~40 countries Simulation End-user analysis – batch and interactive 7

Patricia Méndez Lorenzo Towards a production level service Level of the Grid services provided by WLCG –Signed by all sites in the MoU Contains the services provided by each site per experiment, the time schedule for recovery and the level of service –Ranges from 99% for key services at T0 up to 95% for several services at T2 sites Defined for high level functionalities more than individual services Specifications of the experiments: Critical services –Including degradation of the experiment in case of failures A general procedure has been defined to ensure a good level of service –Checklist for new services, recommendations for middleware and DB developers, operational techniques and procedures 8

Patricia Méndez Lorenzo 9 Building a Robust Service Essential elements towards a robust service –Careful planning complemented with a set of techniques –Deployment and operational procedures included Keep in mind always the following ideas –Think Service: service is much more than a piece of middleware –Think Grid: A change to a service at a given site might affect remote users and services A robust service means minimum human intervention –Mandatory for scalability and to handle huge amount of data

Patricia Méndez Lorenzo 10 The Check-list for a new service The following aspects must be ensured by any new service (Grid or non Grid): –Software implementation Short glitches have to be coped with Load-balanced setup among different servers is a good solution also for maintenance procedures In case of a DB backend, the re-establish of lost connections is mandatory –Hardware setup Avoid single points of failure through: –Power supplies –Network connections –Operational procedures and basic tests are needed Standard diagnosis procedures should be ensured Site manager will also have the possibility to define their own tests

Patricia Méndez Lorenzo 11 Middleware deployment techniques Robustness and resilience must be aspects to observe during middleware design –High-availability setup onto an already deployed software is much more difficult Basic characteristics to observe –Modular design Allows an easier debug The different operational features of the system are ensure in case of individual failures –System architecture should be ready for glitches Possible errors must be known and documented –Maintenance state must be minimized as much as possible –Apply the rule: “think service” Can the system be cleanly and transparently stopped for the user? Can the node be drained with no service effects? –Apply the same rules to a Grid environment

Patricia Méndez Lorenzo 12 Operations Meetings The experience gained at CERN for many years shows the value of running daily operations meetings –No longer than 10min –The alarms of the day before are checked and discussed –Assignment of open issues to the responsible –If any weakness in the system, this will be exposed –Experiments are also welcome to join! Weekly operations meeting with the ROC responsible –Escalation of open problems mostly opened by the experiments –Each experiment is represented during the meeting Besides this procedure any operation in the system must be performed as quick and transparent as possible –Otherwise the degradation of the system can cause serious problems to the users and other services Also once the intervention is finished a post-mortem and documentation procedure is required for future interventions

Patricia Méndez Lorenzo 13 Critical Services Experiments provide their own services to implement Grid services need by their computing models –Failures on these services can also compromise the experiment performance –The same rules applied to the Grid services will have also to be applied to experiment services We already know that during the first months of the real data taking we will face several instabilities –For critical services a 24x7 service will have to be provided –On-call procedures will be defined for those services –For the rest a contact person will be defined (expert call out by console operators)

Patricia Méndez Lorenzo 14 Grid vs. Experiment software Even in LHC the experiment requirements in terms of services are quite different Two different middleware categories to be considered: –Generic middleware Common to all applications –Application-oriented middleware Requires a deep knowledge of the experiment models and their Grid use The experiment support has to be defined in the same terms –Each support member will need to know the computational model of the supported experiment –This requires a big effort by the Grid support team and a hand-by-hand work and collaboration –In the case of HEP applications and also beyond dedicated support members are assigned to each application

Patricia Méndez Lorenzo 15 Support Infrastructure Experiment Integration Support (EIS) team –Born in 2002 with the LHC Computing Grid project: ~10 members, all high energy physicists EIS main goals –Helping the four LHC experiments (ALICE, ATLAS, CMS, LHCb) in Grid-related issues –Help the VO with running the "production”: Manage Grid and VO-specific services –Creation of documentation (gLite User Guide) –Contribute to end user support if required –Provide expertise for understanding and solving Grid-related issues Site and middleware level Deep knowledge of both Grid infrastructure and VO computing model –Integration and testing of middleware Development of missing components Evaluation of middleware functionality and scalability testing –Acting as interface between the experiments and WLCG –Not only for HEP: also Biomed and other VOs ARDA team –Devoted to development for the experiments (and not only!) Monitoring dashboard, GANGA, Diane, AMGA, …

Patricia Méndez Lorenzo 16 VO Monitoring VOs need to have the full picture of the Grid status: SAM –Grid services –VO-specific services … and to know how they are doing on the Grid: Dashboard –Job status, success/failure statistics –Status of data transfers Exactly at the boundary between the application and the Grid domain Examples from the LHC computing: –Usage of the Service Availability Monitoring (SAM) for VO-specific service monitoring –ARDA dashboard to monitor the experiment workflows –Fully intercorrelated

Patricia Méndez Lorenzo 17 SAM and Dashboard Monitoring displays

Patricia Méndez Lorenzo Summary WLCG has been born to cover the high computer and storage demand of the experiments from 2008 We count today more than 200 sites in 34 countries The whole project is the results of a big collaboration between experiments, developers and sites It is a reality, GRID is being used in production Covering a wide range of research fields, from HEP applications until humanitarian actions 18