LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN.

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
3 Sept 2001F HARRIS CHEP, Beijing 1 Moving the LHCb Monte Carlo production system to the GRID D.Galli,U.Marconi,V.Vagnoni INFN Bologna N Brook Bristol.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Computational grids and grids projects DSS,
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
LHCb week, 27 May 2004, CERN1 Using services in DIRAC A.Tsaregorodtsev, CPPM, Marseille 2 nd ARDA Workshop, June 2004, CERN.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Results of the LHCb experiment Data Challenge 2004 Joël Closier CERN / LHCb CHEP’ 04.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
LHCb-ATLAS GANGA Workshop, 21 April 2004, CERN 1 DIRAC Software distribution A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
EGEE is a project funded by the European Commission under contract IST NA4/HEP work F Harris (Oxford/CERN) M.Lamanna(CERN) NA4 Open meeting.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.
LHCb Data Challenge in 2002 A.Tsaregorodtsev, CPPM, Marseille DataGRID France meeting, Lyon, 18 April 2002.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
CHEP 2006, February 2006, Mumbai 1 DIRAC, the LHCb Data Production and Distributed Analysis system A.Tsaregorodtsev, CPPM, Marseille CHEP 2006,
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
GAG meeting, 5 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, Marseille N. Brook, Bristol/CERN GAG Meeting, 5 July 2004, CERN.
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
The EDG Testbed Deployment Details
INFN GRID Workshop Bari, 26th October 2004
INFN-GRID Workshop Bari, October, 26, 2004
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
MC data production, reconstruction and analysis - lessons from PDC’04
Grid Deployment Board meeting, 8 November 2006, CERN
Artem Trunov and EKP team EPK – Uni Karlsruhe
R. Graciani for LHCb Mumbay, Feb 2006
Production Manager Tools (New Architecture)
The LHCb Computing Data Challenge DC06
Presentation transcript:

LCG-France, 22 July 2004, CERN1 LHCb Data Challenge 2004 A.Tsaregorodtsev, CPPM, Marseille LCG-France Meeting, 22 July 2004, CERN

LCG-France, 22 July 2004, CERN2 Goals of DC’04  Main goal: gather information to be used for writing the LHCb computing TDR/TP  Robustness test of the LHCb software and production system Using software as realistic as possible in terms of performance  Test of the LHCb distributed computing model Including distributed analyses realistic test of analysis environment, need realistic analyses  Incorporation of the LCG application area software into the LHCb production environment  Use of LCG resources as a substantial fraction of the production capacity

LCG-France, 22 July 2004, CERN3 DC 2004 phases  Phase 1 – MC data production  180M events of different signals, bg, mbias  Simulation+reconstruction  DST’s are copied to Tier1 centres  Phase 2 – Data reprocessing  Selection of various physics streams from DST’s  Copy selections to all Tier1 centers  Phase 3 – User analysis  User analysis jobs on DST data distributed in all the Tier1 centers

LCG-France, 22 July 2004, CERN4 Phase 1 MC production

LCG-France, 22 July 2004, CERN5 DIRAC Services and Resources DIRAC Job Management Service DIRAC Job Management Service DIRAC CE LCG Resource Broker Resource Broker CE 1 DIRAC Sites Agent CE 2 CE 3 Production manager Production manager GANGA UI User CLI JobMonitorSvc JobAccountingSvc AccountingDB Job monitor InfomarionSvc FileCatalogSvc MonitoringSvc BookkeepingSvc BK query webpage BK query webpage FileCatalog browser FileCatalog browser User interfaces DIRAC services DIRAC resources DIRAC Storage DiskFile gridftp bbftp rfio

LCG-France, 22 July 2004, CERN6 Software to be installed  Before an LHCb application can run on a Worker Node the following software components should be installed:  Application software itself;  Software packages on which the application depends;  Necessary databases (file based)  DIRAC software  Single untar command to install in place  All the necessary libraries are included – no assumption made about the availability of whatever software on the destination site (except recent python interpreter):  External libraries;  Compiler libraries;  ld-linux.so  Same binary distribution running on RH

LCG-France, 22 July 2004, CERN7 Software installation  Software repository:  Web server (http protocol)  LCG Storage Element  Installation in place DIRAC way:  By Agent upon reception of a job with particular software requirements; OR  By a running job itself.  Installation in place LCG2 way:  Special kind of a job running standard DIRAC software installation utility

LCG-France, 22 July 2004, CERN8 Software installation in the job  A job may need extra SW packages not in place on CE  Special version of geometry;  User analysis algorithms.  Any number of packages can be installed in the job itself (up to all of them)  Packages are installed in the job user space  Imitate the structure of the LHCb standard SW directory tree with symbolic links

LCG-France, 22 July 2004, CERN9 3’d party components  Originally DIRAC aimed at producing the following components:  Production database;  Metadata and job provenance database;  Workload management.  Expected 3’d party components:  Data management (FileCatalogue, replica management)  Security services  Information and Monitoring Services  Expectations for early delivery of the ARDA prototype components failed

LCG-France, 22 July 2004, CERN10 File catalog service  The LHCb Bookkeeping was not meant to be used as a File (Replica) Catalog  Main use as Metadata and Job Provenance database  Replica catalog based on specially built views  AliEn File Catalog was chosen to get a (full) set of the necessary functionality:  Hierarchical structure: Logical organization of data – optimized queries; ACL by directory; Metadata by directory; File system paradigm;  Robust, proven implementation  Easy to wrap as an independent service: Inspired by the ARDA RTAG work

LCG-France, 22 July 2004, CERN11 AliEn FileCatalog in DIRAC  AliEn FC SOAP interface was not ready in the beginning of 2004  Had to provide our own XML-RPC wrapper Compatible with XML-RPC BK File Catalog  Using AliEn command line “alien –exec”  Ugly, but works  Building service on top of AliEn which is run by the lhcbprod AliEn user  Not really using the AliEn security mechanisms  Using AliEn version 1.32  So far in DC2004:  >100’000 files with >250’000 replicas  Very stable performance

LCG-France, 22 July 2004, CERN12 File catalogs MySQL AliEn FC AliEn UI XML-RPC server XML-RPC server AliEn FC Client AliEn FC Client ORACLE LHCb BK DB LHCb BK DB XML-RPC server XML-RPC server BK FC Client BK FC Client FC Client DIRAC Application, Service DIRAC Application, Service AliEn FileCatalog Service BK FileCatalog Service FileCatalog Client

LCG-France, 22 July 2004, CERN13 Data Production – 2004  Currently distributed data sets  CERN: Complete DST (copied directly from production centres)  Tier1: Master copy of DST produced at associated sites  DIRAC sites: Bologna, Karlsruhe, Spain (PIC), Lyon, UK sites (RAL), all otherwise CERN  LCG sites: Currently only 3 Grid (MSS) SE sites - CASTOR Bologna, PIC, CERN  Bologna:ru,pl,hu,cz,gr,it  PIC: us,ca,es,pt,tw  CERN: elsewhere

LCG-France, 22 July 2004, CERN14 DIRAC DataManagement tools  DIRAC Storage Element:  IS description + server (bbftpd, sftpd, httpd, gridftpd, xmlrpcd, file, rfio, etc)  Need no special service installation on the site  Description in the Information Service:  Host, Protocol, Local path  ReplicaManager API for common operations:  copy(), copyDir(), get(), exists(), size(), mkdir(), etc  Examples of usage:  dirac-rm-copyAndRegister dirac-rm-copy dc2004.dst CERN_Castor_BBFTP  Tier0SE and Tier1SE’s are defined in the central IS

LCG-France, 22 July 2004, CERN15 Reliable Data Transfer  Any data transfer should be accomplished despite temporary failures of various services or networks:  Multiple retries of failed transfers with any necessary delay: Until services are up and running; Not applicable for LCG jobs.  Multiple retries of registration in the Catalog.  Transfer Agent:  Maintains a database of Transfer requests;  Transfers datasets or whole directories with log files;  Retries transfers until success

LCG-France, 22 July 2004, CERN16 DIRAC DataManagement tools Transfer Agent Transfer DB Job Data Manager Data Optimizer SE 1 SE 2 cache Transfer requests

LCG-France, 22 July 2004, CERN17 DIRAC DC2004 performance  In May-July:  Simulation+Reconstruction  >80000 jobs  ~75M events  ~25TB of data Stored at CERN,PIC,Lyon,CNAF,RAL Tier1 centres  >150’000 files in the catalogs  ~2000 jobs running continuously Up to 3000 in a peak

LCG-France, 22 July 2004, CERN18 DC2004 at CC/IN2P3  The main DIRAC development site  The CC/IN2P3 contribution is very weak  Production runs stable continuously;  Resources are very limited  HPSS performance is stable

LCG-France, 22 July 2004, CERN19 Note on the BBFTP  Nice product  Stable, performant, complete, grid enabled  Light weight  Easy deployment of the statically linked executable  Good peformance  Would be nice to have a parallelized load balancing server  Functionality not complete with respect to GRIDFTP:  Remote storage management (ls(), size(), remove() )  Transfers between remote servers

LCG-France, 22 July 2004, CERN20 LCG experience

LCG-France, 22 July 2004, CERN21 Production jobs  Long jobs – 23 hours on average 2GHz PIV  Simulation+Digitization+Reconstruction steps  5 to 10 steps in one job  No event input data  Output data – 1-2 output files of ~200MB  Stored to Tier1 and Tier0 SE  Log files copied to an SE at CERN  AliEn and Bookkeeping Catalogues are updated

LCG-France, 22 July 2004, CERN22 Using LCG resources  Different ways of scheduling jobs to LCG  Standard: jobs got via RB;  Direct: jobs go directly to CE;  Resource reservation  Using Reservation mode for the DC2004 production:  Deploying agents to WN as LCG jobs  DIRAC jobs are fetched by the agents in case the environment is OK  Agent steers the job execution including data transfers, update of the catalogs and bookkeeping.

LCG-France, 22 July 2004, CERN23 Using LCG resources (2)  Using DIRAC DataManagement tools:  DIRAC SE + gridftp + sftp  Starting to populate RLS from DIRAC catalogues:  For evaluation  For use with ReplicaManager of LCG

LCG-France, 22 July 2004, CERN24 Resource Broker I  No trivial use of tools for large number of jobs i.e. production  Command re-authenticated for every job  Produce errors with list of jobs (e.g. retrieve non-terminated jobs)  Slow to response when few 100 jobs in RB  e.g. 15 seconds for job scheduling  Ranking mechanism to provide even distribution of jobs  Number of CPUs published is for site & not for user/VO - request for free CPU in JDL doesn’t help

LCG-France, 22 July 2004, CERN25 Resource Broker II  LCG, in general, does not advertise normalised time units  Solution: request CPU resources for the slowest CPU (500Hz)  Problem: only v. few site have long enough queues  Solution: DIRAC agent scales CPU for particular WN before request to DIRAC  Problem: some sites have normalised their units!  Jobs with ∞loops  3 day job in week queue - killed by proxy expiry rather than CPU reqt  Jobs aborted by “proxy expired”  RB was re-using old proxies !!!!

LCG-France, 22 July 2004, CERN26 Resource Broker III  Job cancelled by RB but with message “cancelled by user”  Due to loss of communication between RB & CE - job rescheduled & killed on original CE  Some job are not killed until they fail due to inability to transfer data  DIRAC also re-schedules!  RB lost control of the status of all jobs  RB “stuck” - not responding to any request - solved without loss of jobs

LCG-France, 22 July 2004, CERN27 Disk Storage  Job runs in directory without enough space  Jobs running need ~2GB - problem where site has jobs sharing same disk server rather than local WN space

LCG-France, 22 July 2004, CERN28 Reliable Data Transfer  In case of data transfer failure the data on LCG is lost. There is no retry mechanism if the destination SE is temporarily not available  Problems with GRIDFTP server at CERN:  Certificates not understood  Refused connections

LCG-France, 22 July 2004, CERN29 Odds & Sods  LDAP of globus-mds server stops  OK - no jobs can be submitted to site  BUT also problems with authentication of GridFTP transfer  Empty output sandbox  Tricky to debug !  Jobs cancelled by retry count  Occurs on sites with many jobs running  DIRAC just submits more agents

LCG-France, 22 July 2004, CERN30 Conclusions

LCG-France, 22 July 2004, CERN31 Demand 2004  CPU:  14 M UI hours (1.4 M UI hours consumed so far)  Storage  HPSS 20 TB  Disk 2 TB Accessible from the LCG grid

LCG-France, 22 July 2004, CERN32 Demand 2005  CPU:  ~15 M UI hours  Storage  HPSS 30 TB ( ~15 TB recycled)  Disk 2 TB

LCG-France, 22 July 2004, CERN33 Tier2 centers  Feasible  Good network connectivity is essential  Limited functionality:  Number crunches (production simulation type tasks)  Standard technical solution  Hardware (CPU+storage)  Cluster software  Central consultancy support  Housing space  Adequate rooms in the labs (cooling, electric power, etc)

LCG-France, 22 July 2004, CERN34 Tier2 centers (2)  Local support  Stuff to be found (remote central watch tower ?)  24/24, 7/7 or best effort support  Serving the community  Regional Possible financing source Extra clients (security, resource sharing policies issues)  National French grid (segment) ?

LCG-France, 22 July 2004, CERN35 LHCb DC'04 Accounting

LCG-France, 22 July 2004, CERN36 Next Phases Reprocessing and Analysis

LCG-France, 22 July 2004, CERN37 Data reprocessing and analysis  Preparing data reprocessing phase:  Stripping – selecting events on DST files into several output streams by groups of physics  Scheduling jobs to sites where the needed data are Tier1’s (CERN, Lyon, PIC, CNAF, RAL, Karlsruhe)  The workload management is capable of automatic job scheduling to a site having data  Tools are being prepared to formulate reprocessing tasks.

LCG-France, 22 July 2004, CERN38 Data reprocessing and analysis (2)  User analysis:  Interfacing GANGA to submit jobs to DIRAC  Submitting user jobs to DIRAC sites: Security concerns – job are executed by the agent account on behalf of user  Submitting user jobs to LCG sites: Through DIRAC to have a common job Monitoring and Accounting Using user certificates to submit to LCG No agent submission:  Expecting high failure rate