CHEP 2006, 13-18 February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
A Computation Management Agent for Multi-Institutional Grids
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
CHEP 2004, 27 September - 1 October 2004, Interlaken1 DIRAC – the distributed production and analysis for LHCb A.Tsaregorodtsev, CPPM, Marseille CHEP 2004,
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.
CH2 System models.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Grid Initiatives for e-Science virtual communities in Europe and Latin America DIRAC TEAM CPPM – CNRS DIRAC Grid Middleware.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
Backdrop Particle Paintings created by artist Tom Kemp September Grid Information and Monitoring System using XML-RPC and Instant.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
LHCb week, 27 May 2004, CERN1 Using services in DIRAC A.Tsaregorodtsev, CPPM, Marseille 2 nd ARDA Workshop, June 2004, CERN.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.
The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.
DIRAC Review (12 th December 2005)Stuart K. Paterson1 DIRAC Review Workload Management System.
GRID Security & DIRAC A. Casajus R. Graciani A. Tsaregorodtsev.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
GAG meeting, 5 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, Marseille N. Brook, Bristol/CERN GAG Meeting, 5 July 2004, CERN.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
1 DIRAC WMS & DMS A.Tsaregorodtsev, CPPM, Marseille ICFA Grid Workshop,15 October 2006, Sinaia.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
1 DIRAC project A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
1 Building application portals with DIRAC A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 27 April 2010, Journée LuminyGrid, Marseille.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
Jean-Philippe Baud, IT-GD, CERN November 2007
StoRM: a SRM solution for disk based storage systems
INFN GRID Workshop Bari, 26th October 2004
Grid Deployment Board meeting, 8 November 2006, CERN
WLCG Collaboration Workshop;
R. Graciani for LHCb Mumbay, Feb 2006
LHCb Grid Computing LHCb is a particle physics experiment which will study the subtle differences between matter and antimatter. The international collaboration.
Presentation transcript:

CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome

CHEP 2006, February 2006, Mumbai 2 Outline  LHCb Computing Model  DIRAC production and analysis system  Pilot agent paradigm  Application to the user analysis  Conclusion

CHEP 2006, February 2006, Mumbai 3 LHCb Computing Model

CHEP 2006, February 2006, Mumbai 4 DIRAC overview  LHCb grid system for the Monte-Carlo simulation data production and analysis  Integrates computing resources available at LHCb production sites as well as on the LCG grid  Composed of a set of light-weight services and a network of distributed agents to deliver workload to computing resources  Runs autonomously once installed and configured on production sites  Implemented in Python, using XML-RPC service access protocol DIRAC – Distributed Infrastructure with Remote Agent Control

CHEP 2006, February 2006, Mumbai 5 DIRAC design goals  Light implementation  Must be easy to deploy on various platforms  Non-intrusive No root privileges, no dedicated machines on sites  Must be easy to configure, maintain and operate  Using standard components and third party developments as much as possible  High level of adaptability  There will be always resources outside LCGn domain Sites that can not afford LCG, desktops, …  We have to use them all in a consistent way  Modular design at each level  Adding easily new functionality

CHEP 2006, February 2006, Mumbai 6 DIRAC Services, Agents and Resources DIRAC Job Management Service DIRAC Job Management Service Agent Production Manager Production Manager GANGA DIRAC API JobMonitorSvc JobAccountingSvc Job monitor ConfigurationSvc FileCatalogSvc BookkeepingSvc BK query webpage BK query webpage FileCatalog browser FileCatalog browser Services Agent MessageSvc Resources LCG Grid WN Site Gatekeeper Tier1 VO-box

CHEP 2006, February 2006, Mumbai 7 DIRAC Services  DIRAC Services are permanent processes deployed centrally or running at the VO-boxes and accepting incoming connections from clients (UI, jobs, agents)  Reliable and redundant deployment  Running with watchdog process for automatic restart on failure or reboot  Critical services have mirrors for extra redundancy and load balancing  Secure service framework:  XML-RPC protocol for client/service communication with GSI authentication and fine grained authorization based on user identity, groups and roles

CHEP 2006, February 2006, Mumbai 8 DIRAC workload management  Realizes PULL scheduling paradigm  Agents are requesting jobs whenever the corresponding resource is free  Using Condor ClassAd and Matchmaker for finding jobs suitable to the resource profile  Agents are steering job execution on site  Jobs are reporting their state and environment to central Job Monitoring service

CHEP 2006, February 2006, Mumbai 9 WMS Service  DIRAC Workload Management System is itself composed of a set of central services, pilot agents and job wrappers  The central Task Queue allows to apply easily the VO policies by prioritization of the user jobs  Using the accounting information and user identities, groups and roles (VOMS)  The job scheduling happens in the last moment  With Pilot agents the job goes to a resource for immediate execution  Sites are not required to manage user shares/priorities  Single long queue with guaranteed LHCb site quota will be enough

CHEP 2006, February 2006, Mumbai 10 DIRAC Agents  Light easy to deploy software components running close to a computing resource to accomplish specific tasks  Written in Python, need only the interpreter for deployment  Modular easily configurable for specific needs  Running in user space  Using only outbound connections  Agents based on the same software framework are used in different contexts  Agents for centralized operations at CERN E.g. Transfer Agents used in the SC3 Data Transfer phase Production system agents  Agents at the LHCb VO-boxes  Pilot Agents deployed as LCG jobs

CHEP 2006, February 2006, Mumbai 11 Pilot agents  Pilot agents are deployed on the Worker Nodes as regular jobs using the standard LCG scheduling mechanism  Form a distributed Workload Management system  Once started on the WN, the pilot agent performs some checks of the environment  Measures the CPU benchmark, disk and memory space  Installs the application software  If the WN is OK the user job is retrieved from the central DIRAC Task Queue and executed  In the end of execution some operations can be requested to be done asynchronously on the VO-box to accomplish the job

CHEP 2006, February 2006, Mumbai 12 Distributed Analysis  The Pilot Agent paradigm was extended recently to the Distributed Analysis activity  The advantages of this approach for users are:  Inefficiencies of the LCG grid are completely hidden from the users  Fine optimizations of the job turnaround It also reduces the load on the LCG WMS  The system was demonstrated to serve dozens of simultaneous users with about 2Hz submission rate  The limitation is mainly in the capacity of LCG RB to schedule this number of jobs

CHEP 2006, February 2006, Mumbai 13 DIRAC WMS Pilot Agent Strategies  The combination of pilot agents running right on the WNs with the central Task Queue allows fine optimization of the workload on the VO level  The WN reserved by the pilot agent is a first class resource - there is no more uncertainly due to delays in in the local batch queue  DIRAC Modes of submission  ‘Resubmission’ Pilot Agent submission to LCG with monitoring Multiple Pilot Agents may be sent in case of LCG failures  ‘Filling Mode’ Pilot Agents may request several jobs from the same user, one after the other  ‘Multi-Threaded’ Same as ‘Filling’ Mode above except two jobs can be run in parallel on the Worker Node

CHEP 2006, February 2006, Mumbai 14 LCG limit on start time, minimum was about 9 Mins Start Times for 10 Experiments, 30 Users

CHEP 2006, February 2006, Mumbai 15 VO-box  VO-boxes are dedicated hosts at the Tier1 centers running specific LHCb services for  Reliability due to retrying failed operations  Efficiency due to early release of WNs and delegating data moving operations from jobs to the VO-box agents  Agents on VO-boxes execute requests for various operations from local jobs:  Data Transfer requests  Bookkeeping, Status message requests

CHEP 2006, February 2006, Mumbai 16 LHCb VO-box architecture

CHEP 2006, February 2006, Mumbai 17  Request DB is populated with data transfer/replication requests from Data Manager or jobs  Transfer Agent  checks the validity of request and passes to the FTS service  uses third party transfer in case of FTS channel unavailability  retries transfers in case of failures  registers the new replicas in the catalog Transfer Agent example

CHEP 2006, February 2006, Mumbai 18 DIRAC production performance  Up to over 5000 simultaneous production jobs  The throughput is only limited by the capacity available on LCG  ~80 distinct sites accessed through LCG or through DIRAC directly

CHEP 2006, February 2006, Mumbai 19 Conclusions  The Overlay Network paradigm employed by the DIRAC system proved to be efficient in integrating heterogeneous resources in a single reliable system for simulation data production  The system is now extended to deal with the Distributed Analysis tasks  Workload management on the user level is effective  Real users (~30) are starting to use the system  The LHCb Data Challenge 2006 in June  Test LHCb Computing Model before data taking  An ultimate test of the DIRAC system

CHEP 2006, February 2006, Mumbai 20 Back-up slides

CHEP 2006, February 2006, Mumbai 21 DIRAC Services and Resources DIRAC Job Management Service DIRAC Job Management Service DIRAC CE DIRAC Sites Agent Production Manager Production Manager GANGA UI DIRAC API JobMonitorSvc JobAccountingSvc AccountingDB Job monitor ConfigurationSvc FileCatalogSvc BookkeepingSvc BK query webpage BK query webpage FileCatalog browser FileCatalog browser DIRAC services DIRAC Storage DiskFile gridftp LCG Resource Broker Resource Broker CE 1 CE 2 CE 3 Agent DIRAC resources FileCatalog Agent

CHEP 2006, February 2006, Mumbai 22 Configuration service  Master server at CERN is the only one allowing write access  Redundant system with multiple read-only slave servers running at sites on VO-boxes for load balancing and high availability  Automatic slave updates from the master information  Watchdog to restart the server in case of failures

CHEP 2006, February 2006, Mumbai 23 Other Services  Job monitoring service  Getting job heartbeats and status reports  Service the job status to clients ( users ) Web and scripting interfaces  Bookkeeping service  Receiving, storing and serving job provenance information  Accounting service  Receives accounting information for each job  Generates reports per time period, specific productions or user groups  Provides information for taking policy decisions

CHEP 2006, February 2006, Mumbai 24 DIRAC  DIRAC is a distributed data production and analysis system for the LHCb experiment  Includes workload and data management components  Was developed originally for the MC data production tasks  The goal was: integrate all the heterogeneous computing resources available to LHCb Minimize human intervention at LHCb sites  The resulting design led to an architecture based on a set of services and a network of light distributed agents

CHEP 2006, February 2006, Mumbai 25 File Catalog Service  LFC is the main File Catalog  Chosen after trying out several options  Good performance after optimization done  One global catalog with several read-only mirrors for redundancy and load balancing  Similar client API as for other DIRAC “File Catalog” services  Seamless file registration in several catalogs  E.g. Processing DB receiving data to be processed automatically

CHEP 2006, February 2006, Mumbai 26 DIRAC performance  Performance in the 2005 RTTC production  Over 5000 simultanuous jobs Limited by the available resources  Far from the critical load on the DIRAC servers