Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003.

Similar presentations


Presentation on theme: "CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003."— Presentation transcript:

1 CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003

2 Motivations and goals u Purpose of the “stress test”: n Verify how EDG middleware is good for CMS Production n Verify the portability of CMS Production environment on a grid environment n Produce a reasonable amount of the PRS requested events u Goals n Aim for 1 million events (only FZ files, no Objectivity) n Measure performances, efficiencies and reasons of job failures n Try to make the system stable u Organization n Operations started November 30 th and ended at Xmas (~3 weeks) n The joint effort involved CMS, EDG and LCG people (~50 people, 17 from INFN) Mailing list:

3 Software and middleware u CMS Software used is the official production one n CMKIN and CMSIM: installed as rpm on all the sites u EDG Middleware releases: n 1.3.4 (before 9/12) n 1.4.0 (after 9/12) u Tools used (on EDG “User Interface”) n Modified IMPALA/BOSS system to allow for Grid submission of jobs n Scripts and ad-hoc tools to: s Replicate files s Collect monitoring information from EDG and from the jobs

4 UI IMPALA BOSS DB GRID SERVICES SE CE SE CE RefDB RC CE CMS sw CE CMS sw Write data WN data registration Job output filtering Runtime monitoring JDL JobExecuter dbUpdator parameters

5 Resources u The production is managed from 4 UI’s: n Bologna / CNAF n Ecole Polytechnique n Imperial College n Padova reduces the bottleneck due to the BOSS DB u Several RB’s seeing the same Computing and Storage Elements: n CERN (dedicated to CMS)(EP UI) n CERN (common to all applications)(backup!) n CNAF (common to all applications)(Padova UI) n CNAF (dedicated to CMS)(CNAF UI) n Imperial College (dedicated to CMS and BABAR)(IC UI) reduces the bottleneck due to intensive use of the RB and the 512-owner limit in Condor-G

6 Resources SiteCENo. of CPU SEDisk space (GB) CERN lxshare0227122lxshare0393 lxshare0384 100 1000(=100  10) CNAF testbed00840grid007g1300 RAL gppce0516gppse05330 NIKHEF tbn0922tbn03430 Lyon ccgridli03120ccgridli07200 Legnaro cmsgrid00150cmsgrid002500 Padova grid00112grid005670 Ecòle Pol. polgrid14polgrid2200 3864730

7 Data management u Two practical approaches: n Bologna, Padova: FZ files (~230 MB sized) are directly stored at CNAF, Legnaro n EP, IC: FZ files are stored where they have been produced and later replicated to a dedicated SE at CERN. Goal: to test the creation of replicas of files u All sites use disk for the file storage, but: n CASTOR at CERN: FZ files replicated to CERN are also automatically copied into CASTOR (thanks to a new staging daemon from WP2) n HPSS in Lyon: FZ files stored in Lyon are automatically copied into HPSS

8 Online Monitoring (MDS based)

9 Events vs. time (CMKIN)

10 Events vs. time (CMSIM) ~7 sec/event average ~2.5 sec/event peak (12-14 dec)

11 Final results (preliminary!) UI#CMKIN evts%#CMSIM evts % CNAF2536254313037548 IC7312512233759 IN2P3114250193212512 Padova151875268275031 total592875268625 UI#CMKIN jobs#success (%)#CMSIM jobs#success (%) CNAF24302029 (83)14121043 (74) IC647585 (90)290187 (64) IN2P31327914 (69)474253 (53) Padova13581215 (89)1188662 (56) total57624743 (82)33642145 (63)

12 Main issues SymptomCauseSolutionFrequency no matching resources II stuck by too many accesses - “fake” dbII used since 1.4.0 - slow job submission rate very high before 1.4.0 low since 1.4.0 + slow sub. Standard output of job wrapper does not contain useful data 1) home dir not available on WN 2) exhausted resources on CE 3) race conditions for file updates between WN and CE 4) Glitches in the gass_transfer 5) …. GRAM-PBS script patch and JSS-Maradona patch under test after Xmas, since 1.4.2 very high special for “long jobs” (~12 hours) Condor Failure Condor scheduler crashes (file-max parameter too low) increase file-max parameterlow cannot connect to RC server LDAP server overloadedCreate new RC’s and new collections; restart the LDAP server high edg-job-* commands hangMDS not responding due to local GRIS remove offending GRIS from MDS low Globus down/failed submission Gatekeeper unreachable?low Cannot download InputSandbox globus-url-copy problem between WN and RB (security, gridftp, etc.) ?low

13 Chronology u 29/11 – 2/12: reasonably smooth u 3/12 – 5/12: “inefficiency” due to CMS week u 6/12: RC problems begin; new collections created; Nagios monitoring online u 7/12 – 8/12: II in very bad shape u 9/12 – 10/12: deployment of 1.4.0; still problems with RC; CNAF and Legnaro resources not available; problems with CNAF RB u 11/12: Top level MDS stuck because of a CE in Lyon u 14/12 – 15/12: II stuck, most submitted jobs aborted u 16/12: failure in grid-mapfile update due to NIKHEF VO ldap server not reachable

14 Conclusions u Job failures are dominated by: n Standard output of job wrapper does not contain useful data: s many different causes s does affect mainly “long jobs” s some patches with possible solutions implemented n Replica Catalog stops responding: no real solution, but we will soon use RLS n Information System (GRIS,GIIS,dbII): hopefully R-GMA will solve these problems u Lots of smaller problems (Globus, Condor-G, machine configuration, defective disks, etc.) u Short term actions: n EDG-1.4.3 released the 14/1 and deployed on PRODUCTION testbed n Test is going on in “no-stress” mode: s in parallel with the review preparation (testbed will remain stable) s it will measure the effect of new GRAM-PBS script and JSS-Maradona patches


Download ppt "CMS Stress Test Report Marco Verlato (INFN-Padova) INFN-GRID Testbed Meeting 17 Gennaio 2003."

Similar presentations


Ads by Google