CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.

Slides:



Advertisements
Similar presentations
Claudio Grandi INFN Bologna DataTAG WP4 meeting, Bologna 14 jan 2003 CMS Grid Integration Claudio Grandi (INFN – Bologna)
Advertisements

DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Workload Management David Colling Imperial College London.
Stephen Burke - WP8 Status - 9/5/2002 Partner Logo WP8 Status Stephen Burke, PPARC/RAL.
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
LNL CMS M.Biasotto, Bologna, 29 aprile LNL Analysis Farm Massimo Biasotto - LNL.
Réunion DataGrid France, Lyon, fév CMS test of EDG Testbed Production MC CMS Objectifs Résultats Conclusions et perspectives C. Charlot / LLR-École.
INFN - Ferrara BaBarGrid Meeting SPGrid Efforts in Italy BaBar Collaboration Meeting - SLAC December 11, 2002 Enrica Antonioli - Paolo Veronesi.
Workload Management meeting 07/10/2004 Federica Fanzago INFN Padova Grape for analysis M.Corvo, F.Fanzago, N.Smirnov INFN Padova.
CMS-ARDA Workshop 15/09/2003 CMS/LCG-0 architecture Many authors…
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
School on Grid Computing – July 2003 – n° 1 A.Fanfani INFN Bologna – CMS WP8 – CMS experience on EDG testbed  Introduction  Use of EDG middleware in.
DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.
1 Use of the European Data Grid software in the framework of the BaBar distributed computing model T. Adye (1), R. Barlow (2), B. Bense (3), D. Boutigny.
INFN Testbed status report L. Gaido WP6 meeting CERN - October 30th, 2002.
A tool to enable CMS Distributed Analysis
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
The EDG Testbed Deployment Details The European DataGrid Project
DataGrid is a project funded by the European Commission under contract IST Status and Prospective of EU Data Grid Project Alessandra Fanfani.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
EU 2nd Year Review – Feb – Title – n° 1 WP8: Progress and testbed evaluation F Harris (Oxford/CERN) (WP8 coordinator )
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Use of R-GMA in BOSS Henry Nebrensky (Brunel University) VRVS 26 April 2004 Some slides stolen from various talks at EDG 2 nd Review (
Computational grids and grids projects DSS,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
11 December 2000 Paolo Capiluppi - DataGrid Testbed Workshop CMS Applications Requirements DataGrid Testbed Workshop Milano, 11 December 2000 Paolo Capiluppi,
WP8 Status – Stephen Burke – 30th January 2003 WP8 Status Stephen Burke (RAL) (with thanks to Frank Harris)
Steve Traylen Particle Physics Department EDG and LCG Status 9 th December 2003
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Author - Title- Date - n° 1 Partner Logo WP5 Summary Paris John Gordon WP5 6th March 2002.
29 May 2002Joint EDG/WP8-EDT/WP4 MeetingClaudio Grandi INFN Bologna LHC Experiments Grid Integration Plans C.Grandi INFN - Bologna.
13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
J.J.Blaising April 02AMS DataGrid-status1 DataGrid Status J.J Blaising IN2P3 Grid Status Demo introduction Demo.
GRID Zhen Xie, INFN-Pisa, on DataGrid WP6 meeting1 Globus Installation Toolkit Zhen Xie On behalf of grid-release team INFN-Pisa.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
29 Sept 2004 CHEP04 A. Fanfani INFN Bologna 1 A. Fanfani Dept. of Physics and INFN, Bologna on behalf of the CMS Collaboration Distributed Computing Grid.
2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6
Andrew McNab - Manchester HEP - 17 September 2002 UK Testbed Deployment Aim of this talk is to the answer the questions: –“How much of the Testbed has.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
INFN - Ferrara BaBar Meeting SPGrid: status in Ferrara Enrica Antonioli - Paolo Veronesi Ferrara, 12/02/2003.
Site Certification Process (Round Table) Fabio Hernandez IN2P3 Computing Center - Lyon October
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
EDG Project Conference – Barcelona 13 May 2003 – n° 1 A.Fanfani INFN Bologna – CMS WP8 – Grid Planning in CMS Outline  CMS Data Challenges  CMS Production.
BaBar-Grid Status and Prospects
The EDG Testbed Deployment Details
Real Time Fake Analysis at PIC
U.S. ATLAS Grid Production Experience
INFN-GRID Workshop Bari, October, 26, 2004
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Scalability Tests With CMS, Boss and R-GMA
CMS report from FNAL demo week Marco Verlato (INFN-Padova)
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
The LHCb Computing Data Challenge DC06
Presentation transcript:

CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003

Motivations and goals u Purpose of the “stress test”: n Verify how EDG middleware is good for CMS Production n Verify the portability of CMS Production environment on a grid environment n Produce a reasonable amount of the PRS requested events u Goals n Aim for 1 million events (only FZ files, no Objectivity) n Measure performances, efficiencies and reasons of job failures n Try to make the system stable u Organization n Operations started November 30 th and ended at Xmas (~3 weeks) n The joint effort involved CMS, EDG and LCG people (~50 people, 17 from INFN) Mailing list:

Software and middleware u CMS Software used is the official production one n CMKIN and CMSIM: installed as rpm on all the sites u EDG Middleware releases: n (before 9/12) n (after 9/12) u Tools used (on EDG “User Interface”) n Modified IMPALA/BOSS system to allow for Grid submission of jobs n Scripts and ad-hoc tools to: s Replicate files s Collect monitoring information from EDG and from the jobs

UI IMPALA BOSS DB GRID SERVICES SE CE SE CE RefDB RC CE CMS sw CE CMS sw Write data WN data registration Job output filtering Runtime monitoring JDL JobExecuter dbUpdator parameters

Resources u The production is managed from 4 UI’s: n Bologna / CNAF n Ecole Polytechnique n Imperial College n Padova reduces the bottleneck due to the BOSS DB u Several RB’s seeing the same Computing and Storage Elements: n CERN (dedicated to CMS)(EP UI) n CERN (common to all applications)(backup!) n CNAF (common to all applications)(Padova UI) n CNAF (dedicated to CMS)(CNAF UI) n Imperial College (dedicated to CMS and BABAR)(IC UI) reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G

Resources SiteCENo. of CPU SEDisk space (GB) CERN lxshare lxshare0393 lxshare (=100  10) CNAF testbed00840grid007g1300 RAL gppce0516gppse05330 NIKHEF tbn0922tbn03430 Lyon ccgridli03120ccgridli07200 Legnaro cmsgrid00150cmsgrid Padova grid00112grid Ecòle Pol. polgrid14polgrid

Data management u Two practical approaches: n Bologna, Padova: FZ files (~230 MB sized) are directly stored at CNAF, Legnaro n EP, IC: FZ files are stored where they have been produced and later replicated to a dedicated SE at CERN. Goal: to test the creation of replicas of files u All sites use disk for the file storage, but: n CASTOR at CERN: FZ files replicated to CERN are also automatically copied into CASTOR (thanks to a new staging daemon from WP2) n HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS

Online Monitoring (MDS based)

Events vs. time (CMKIN)

Events vs. time (CMSIM) ~7 sec/event average ~2.5 sec/event peak (12-14 dec)

Final results (preliminary!) UI#CMKIN evts%#CMSIM evts % CNAF IC IN2P Padova total UI#CMKIN jobs#success (%)#CMSIM jobs#success (%) CNAF (83) (74) IC (90) (64) IN2P (69) (53) Padova (89) (56) total (82) (63)

Main issues SymptomCauseSolutionFrequency no matching resources II stuck by too many accesses - “fake” dbII used since slow job submission rate very high before low since slow sub. Standard output of job wrapper does not contain useful data 1) home dir not available on WN 2) exhausted resources on CE 3) race conditions for file updates between WN and CE 4) Glitches in the gass_transfer 5) …. GRAM-PBS script patch and JSS-Maradona patch under test after Xmas, since very high special for “long jobs” (~12 hours) Condor Failure Condor scheduler crashes (file-max parameter too low) increase file-max parameterlow cannot connect to RC server LDAP server overloadedCreate new RC’s and new collections; restart the LDAP server high edg-job-* commands hangMDS not responding due to local GRIS remove offending GRIS from MDS low Globus down/failed submission Gatekeeper unreachable?low Cannot download InputSandbox globus-url-copy problem between WN and RB (security, gridftp, etc.) ?low

Chronology u 29/11 – 2/12: reasonably smooth u 3/12 – 5/12: “inefficiency” due to CMS week u 6/12: RC problems begin; new collections created; Nagios monitoring online u 7/12 – 8/12: II in very bad shape u 9/12 – 10/12: deployment of 1.4.0; still problems with RC; CNAF and Legnaro resources not available; problems with CNAF RB u 11/12: Top level MDS stuck because of a CE in Lyon u 14/12 – 15/12: II stuck, most submitted jobs aborted u 16/12: failure in grid-mapfile update due to NIKHEF VO ldap server not reachable

Conclusions u Job failures are dominated by: n Standard output of job wrapper does not contain useful data: s many different causes s does affect mainly “long jobs” s some patches with possible solutions implemented n Replica Catalog stops responding: no real solution, but we will soon use RLS n Information System (GRIS,GIIS,dbII): hopefully R-GMA will solve these problems u Lots of smaller problems (Globus, Condor-G, machine configuration, defective disks, etc.) u Short term actions: n EDG released the 14/1 and deployed on PRODUCTION testbed n Test is going on in “no-stress” mode: s in parallel with the review preparation (testbed will remain stable) s it will measure the effect of new GRAM-PBS script and JSS-Maradona patches