1 Managing distributed computing resources with DIRAC A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 12-17 September 2011, NEC’11, Varna.

Slides:



Advertisements
Similar presentations
1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.
Advertisements

High Performance Computing Course Notes Grid Computing.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
Sergey Belov, LIT JINR 15 September, NEC’2011, Varna, Bulgaria.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
BESIII distributed computing and VMDIRAC
Grid Initiatives for e-Science virtual communities in Europe and Latin America DIRAC TEAM CPPM – CNRS DIRAC Grid Middleware.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.
Grid job submission using HTCondor Andrew Lahiff.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
Transformation System report Luisa Arrabito 1, Federico Stagni 2 1) LUPM CNRS/IN2P3, France 2) CERN 5 th DIRAC User Workshop 27 th – 29 th May 2015, Ferrara.
DIRAC Review (12 th December 2005)Stuart K. Paterson1 DIRAC Review Workload Management System.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
The GridPP DIRAC project DIRAC for non-LHC communities.
GRID Security & DIRAC A. Casajus R. Graciani A. Tsaregorodtsev.
ANALYSIS TOOLS FOR THE LHC EXPERIMENTS Dietrich Liko / CERN IT.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
The GridPP DIRAC project DIRAC for non-LHC communities.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Overview of gLite, the EGEE middleware Mike Mineter Training Outreach Education National.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
LHCb/DIRAC week A.Tsaregorodtsev, CPPM 7 April 2011.
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
DIRAC Distributed Computing Services A. Tsaregorodtsev, CPPM-IN2P3-CNRS FCPPL Meeting, 29 March 2013, Nanjing.
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
1 Building application portals with DIRAC A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 27 April 2010, Journée LuminyGrid, Marseille.
Multi-community e-Science service connecting grids & clouds R. Graciani 1, V. Méndez 2, T. Fifield 3, A. Tsaregordtsev 4 1 University of Barcelona 2 University.
Distributed Computing Framework A. Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille EGI Webinar, 7 June 2016.
Job Priorities and Resource sharing in CMS A. Sciabà ECGI meeting on job priorities 15 May 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
Distributed computing and Cloud Shandong University (JiNan) BESIII CGEM Cloud computing Summer School July 18~ July 23, 2016 Xiaomei Zhang 1.
Overview of the Belle II computing
DIRAC services.
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Grid Deployment Board meeting, 8 November 2006, CERN
WLCG Collaboration Workshop;
VMDIRAC status Vanessa HAMAR CC-IN2P3.
LHCb Computing Philippe Charpentier CERN
The LHCb Computing Data Challenge DC06
Presentation transcript:

1 Managing distributed computing resources with DIRAC A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille September 2011, NEC’11, Varna

2 Outline  DIRAC Overview  Main subsystems  Workload Management  Request Management  Transformation Management  Data Management  Use in LHCb and other experiments  DIRAC as a service  Conclusion

Introduction  DIRAC is first of all a framework to build distributed computing systems  Supporting Service Oriented Architectures  GSI compliant secure client/service protocol Fine grained service access rules  Hierarchical Configuration service for bootstrapping distributed services and agents  This framework is used to build all the DIRAC systems:  Workload Management Based on Pilot Job paradigm  Production Management  Data Management  etc 3

Physicist User EGEE Pilot Director EGI/WLCG Grid NDG Pilot Director NDG Grid EELA Pilot Director GISELA Grid CREAM Pilot Director CREAM CE Matcher Service Production Manager

User credentials management  The WMS with Pilot Jobs requires a strict user proxy management system  Jobs are submitted to the DIRAC Central Task Queue with credentials of their owner (VOMS proxy)  Pilot Jobs are submitted to a Grid WMS with credentials of a user with a special Pilot role  The Pilot Job fetches the user job and the job owner’s proxy  The User Job is executed with its owner’s proxy used to access SE, catalogs, etc  The DIRAC Proxy manager service ensures the necessary functionality  Proxy storage and renewal  Possibility to outsource the proxy renewal to the MyProxy server 5

Direct submission to CEs  Using gLite WMS now just as a pilot deployment mechanism  Limited use of brokering features For jobs with input data the destination site is already chosen  Have to use multiple Resource Brokers because of scalability problems  DIRAC is supporting direct submission to CEs  CREAM CEs  Can apply individual site policy Site chooses how much load it can take (Pull vs Push paradigm)  Direct measurement of the site state watching the pilot status info  This is a general trend  All the LHC experiments declared abandoning eventually gLite WMS 6

DIRAC sites  Dedicated Pilot Director per (group of) site(s)  On-site Director  Site managers have full control  Of LHCb payloads  Off-site Director  Site delegates control to the central service  Site must only define a dedicated local user account  The payload submission through the SSH tunnel  In both cases the payload is executed with the owner credentials 7 On-site DirectorOff-site Director

DIRAC Sites  Several DIRAC sites in production in LHCb  E.g. Yandex 1800 cores Second largest MC production site  Interesting possibility for small user communities or infrastructures e.g.  contributing local clusters  building regional or university grids 8

WMS performance  Up to 35K concurrent jobs in ~120 distinct sites  Limited by the resources available to LHCb  10 mid-range servers hosting DIRAC central services  Further optimizations to increase the capacity are possible ● Hardware, database optimizations, service load balancing, etc 9

Belle (KEK) use of the Amazon EC2  VM scheduler developed for Belle MC production system  Dynamic VM spawning taking spot prices and TQ state into account 10 Thomas Kuhr, Belle

Belle Use of the Amazon EC2  Various computing resource combined in a single production system  KEK cluster  LCG grid sites  Amazon EC2  Common monitoring, accounting, etc 11 Thomas Kuhr, Belle II

Belle II  Starting at 2015 after the KEK update  50 ab -1 by 2020  Computing model  Data rate 1.8 GB/s ( high rate scenario )  Using KEK computing center, grid and cloud resources  Belle II distributed computing system is based on DIRAC 12 Raw Data Storage and Processing MC Production and Ntuple Production Ntuple Analysis Thomas Kuhr, Belle II

Support for MPI Jobs  MPI Service developed for applications in the GISELA Grid  Astrophysics, BioMed, Seismology applications  No special MPI support on sites is required MPI software installed by Pilot Jobs  MPI ring usage optimization Ring reuse for multiple jobs  Lower load on the gLite WMS Variable ring sizes for different jobs  Possible usage for HEP applications:  Proof on demand dynamic sessions 13

Coping with failures  Problem: distributed resources and services are unreliable  Software bugs, misconfiguration  Hardware failures  Human errors  Solution: redundancy and asynchronous operations  DIRAC services are redundant  Geographically: Configuration, Request Management  Several instances for any service 14

Request Management system  A Request Management System (RMS) to accept and execute asynchronously any kind of operation that can fail  Data upload and registration  Job status and parameter reports  Request are collected by RMS instances on VO-boxes at 7 Tier-1 sites  Extra redundancy in VO-box availability  Requests are forwarded to the central Request Database  For keeping track of the pending requests  For efficient bulk request execution 15

DIRAC Transformation Management  Data driven payload generation based on templates  Generating data processing and replication tasks  LHCb specific templates and catalogs 16

Data Management  Based on the Request Management System  Asynchronous data operations  transfers, registration, removal  Two complementary replication mechanisms  Transfer Agent user data public network  FTS service Production data Private FTS OPN network Smart pluggable replication strategies 17

Transfer accounting (LHCb) 18

ILC using DIRAC  ILC CERN group  Using DIRAC Workload Management and Transformation systems  2M jobs run in the first year  Instead of 20K planned initially  DIRAC FileCatalog was developed for ILC  More efficient than LFC for common queries  Includes user metadata natively 19

DIRAC as a service  DIRAC installation shared by a number of user communities and centrally operated  EELA/GISELA grid  gLite based  DIRAC is part of the grid production infrastructure Single VO  French NGI installation   Started as a service for grid tutorials support  Serving users from various domains now Biomed, earth observation, seismology, … Multiple VOs 20

DIRAC as a service  Necessity to manage multiple VOs with a single DIRAC installation  Per VO pilot credentials  Per VO accounting  Per VO resources description  Pilot directors are VO aware  Job matching takes pilot VO assignment into account 21

DIRAC Consortium  Other projects are starting to use or evaluating DIRAC  CTA, SuperB, BES, VIP(medical imaging), … Contributing to DIRAC development Increasing the number of experts  Need for user support infrastructure  Turning DIRAC into an Open Source project  DIRAC Consortium agreement in preparation IN2P3, Barcelona University, CERN, …  News, docs, forum 22

Conclusions  DIRAC is successfully used in LHCb for all distributed computing tasks in the first years of the LHC operations  Other experiments and user communities started to use DIRAC contributing their developments to the project  The DIRAC open source project is being built now to bring the experience from HEP computing to other experiments and application domains 23

24 Backup slides

LHCb in brief 25  Experiment dedicated to studying CP-violation  Responsible for the dominance of matter on antimatter  Matter-antimatter difference studied using the b-quark (beauty)  High precision physics (tiny difference…)  Single arm spectrometer  Looks like a fixed-target experiment  Smallest of the 4 big LHC experiments  ~500 physicists  Nevertheless, computing is also a challenge….

LHCb Computing Model

Tier0 Center  Raw data shipped in real time to Tier-0  Resilience enforced by a second copy at Tier-1’s  Rate: ~3000 evts/s (35 kB) at ~100 MB/s  Part of the first pass reconstruction and re-reconstruction  Acting as one of the Tier1 center  Calibration and alignment performed on a selected part of the data stream (at CERN)  Alignment and tracking calibration using dimuons (~5/s) Used also for validation of new calibration  PID calibration using Ks, D*  CAF – CERN Analysis Facility  Grid resources for analysis  Direct batch system usage (LXBATCH) for SW tuning  Interactive usage (LXPLUS) 27

Tier1 Center  Real data persistency  First pass reconstruction and re-reconstruction  Data Stripping  Event preselection in several streams (if needed)  The resulting DST data shipped to all the other Tier1 centers  Group analysis  Further reduction of the datasets, μDST format  Centrally managed using the LHCb Production System  User analysis  Selections on stripped data  Preparing N-tuples and reduced datasets for local analysis 28

Tier2-Tier3 centers  No assumption of the local LHCb specific support  MC production facilities  Small local storage requirements to buffer MC data before shipping to a respective Tier1 center  User analysis  No assumption of the user analysis in the base Computing model  However, several distinguished centers are willing to contribute Analysis (Stripped) data replication to T2-T3 centers by site managers  Full or partial sample Increases the amount of resources capable of running User Analysis jobs  Analysis data at T2 centers available to the whole Collaboration No special preferences for local users 29