Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure.

Similar presentations


Presentation on theme: "CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure."— Presentation transcript:

1 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure & Operations Group IT Department Material provided by Ian Bird, Flavia Donno, Jamie Shiers and others

2 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it A Usual Deployment Talk… GridPP18: LCG Deployment - 2 Continually increasing workloads 50k-80k jobs per day Feb’07  ~12500 cpu-months / month LHC experiments now transferring ~ 1PB/month each

3 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it This Deployment Talk Future Deployment Issues – SL(C)4 – SRM – CE & RB – Deployment Windows Not “Deployment of Windows”! GridPP18: LCG Deployment - 3

4 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it LCG Commissioning Schedule GridPP18: LCG Deployment - 4 2006 2007 2008 SC4 – becomes initial service when reliability and performance goals met 01jul07 - service commissioned - full 2007 capacity, performance Initial service commissioning – increase performance, reliability, capacity to target levels, experience in monitoring, 24 x 7 operation, …. Introduce residual services Full FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4 first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1). April 1 st target is to allow experiments to prepare for July 1 st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…)

5 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it SL(C)4 Migration The target OS level for initial LHC operation – RHES5 out, but migration not feasible – Experiments would like to drop SL(C)3 builds SL3 built, but SL4 compatible middleware rpms are available now – no longer any problem preventing subsequent updates. Natively built WN rpms are under test; expected in PPS next week. – still some issues with the native UI build Plan your migration! GridPP18: LCG Deployment - 5

6 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it SRM v2.2 Server Status DPM version 1.6.3 available in production. SRM 2.2 features still not officially certified. Implementation stable. Use-case tests are OK. Copy not available but interoperability tests are OK. Few general issues to be solved. – Volunteers to install and test SRM 2.2 features welcome! BeStMan and StoRM: Copy in PULL mode not available in StoRM. Stable implementations. Recently some instability observed with BeStMan. Some use-case tests still not passing and under investigation. dCache: Stable implementation. Copy is available and working with all implementations excluding DRM. Working on some use-case tests. – Requires migration to v1.8.0 (which will support v1.1 & v2.2 transparently); beta version in April. CASTOR: The implementation has improved remarkably. A lot of progress during the last 3 weeks. Main instability causes found and fixed. Use-case tests OK. Copy not yet implemented but interoperability tests OK. – Stress tests at CERN now and CNAF from next week, but an upgrade to the underlying CASTOR version is required for efficient operation Deployment at CERN scheduled for mid-May; CNAF and RAL to follow soon afterwards to be ready for production use by July 1 st. GridPP18: LCG Deployment - 6

7 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it SRM v2.2 Client Status FTS – SRM client code has been unit-tested and integrated into FTS – Tested against DPM, dCache and StoRM. CASTOR and DRM test started. – Released to development testbed. GFAL/lcg-utils – New rpms available on test UI and being officially certified. No outstanding issues at the moment. ATLAS has started some tests. GLUE – V1.3 of the schema available http://glueschema.forge.cnaf.infn.it/Spec/V13 GridPP18: LCG Deployment - 7

8 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite WMS & LCG RB Reliability of the gLite WMS is being addressed with high priority – not yet ready to replace LCG RB – no plans (or effort?) to migrate LCG RB to SL4. Acceptance criteria for the RB have been agreed based on performance requirements from ATLAS and CMS GridPP18: LCG Deployment - 8 CMSATLAS Performance 2007 Dress rehearsalsNot specified but was 50K jobs/day in CSA06 20K successful jobs/day + analysis load 2008200K jobs/day through WMS <10 WMS 100K jobs/day100K <10 WMS Stability Not specified<1 restart of WMS or LB every month (== LCG RB)

9 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite WMS criteria A single WMS machine should demonstrate submission rates of at least 10K jobs/day sustained over 5 days, during which time the WMS services including the L&B should not need to be restarted. This performance level should be reachable with both bulk and single job submission. – During this 5 day test the performance must not degrade significantly due to filling of internal queues, memory consumption, etc. i.e. the submission rate on day 5 should be the same as that on day 1. Proxy renewal must work at the 98% level: i.e. <2% of jobs should fail due to proxy renewal problems (the real failure rate should be less because jobs may be retried). The number of stale jobs after 5 days must be <1%. The L&B data and job states must be verified: – After a reasonable time after submission has ended, there should be no jobs in "transient" or "cancelled" states – If jobs are very short no jobs should stay in "running" state for more than a few hours – After proxy expires all jobs must be in a final state (Done-Success or Aborted) GridPP18: LCG Deployment - 9

10 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite CE Similarly for the gLite CE – it is not yet reliable enough – reliability criteria have been defined – no port of the LCG CE to SL4 is foreseen. For both, development progress against the reliability criteria is reviewed weekly. Deployment of the gLite versions is not recommended at this stage, but if you do have them installed, please keep them running and track developments to help in testing, and be ready to deploy when production ready code becomes available! GridPP18: LCG Deployment - 10

11 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite CE criteria Performance: – 2007 dress rehearsals: 5000 simultaneous jobs per CE node. 50 user/role/submission node combinations (Condor_C instances) per CE node – End 2007: 5000 simultaneous jobs per CE node (assuming same machine as 2007, but expect this to improve) 1 CE node should support an unlimited number of user/role/submission node combinations, from at least 10 VOs, up to the limit on the number of jobs. (might be achieved with 1 Condor_C per VO with user switching done by glexec in blah) Reliability: – Job failure rates due to CE in normal operation: < 0.5%; Job failures due to restart of CE services or CE reboot <0.5%. – 2007 dress rehearsals: 5 days unattended running with performance on day 5 equivalent to that on day 1 – End 2007: 1 month unattended running without performance degradation GridPP18: LCG Deployment - 11

12 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it gLite CE Similarly for the gLite CE – it is not yet reliable enough – reliability criteria have been defined – no port to SL4 is foreseen. For both, development progress against the reliability criteria is reviewed weekly. Deployment of the gLite versions is not recommended at this stage, but – if you do have them installed, please keep them running and track developments to help in testing, and – be ready to deploy when production ready code becomes available! GridPP18: LCG Deployment - 12

13 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Planning future deployments Still a number of components to deploy before data taking – and even before the dress rehearsals. Remember, there is no longer a “big bang” model; individual components are released as they are ready – be prepared… Deployment/Intervention Scheduling – discussed at January workshop: when is “the least inconvenient time”? – has been discussed since then at the LCG Experiment Coordination Meeting, but no consensus. Opinion seems to be that this is not an issue for the “engineering run” Last system changes in September/October then things kept stable for the short run. – situation for 2008 to be decided before the run – Whatever, clear and early announcement of changes leads to ready acceptance by users… GridPP18: LCG Deployment - 13

14 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it WLCG Intervention Scheduling 1.Scheduled service interventions shall normally be performed outside of the announced period of operation of the LHC accelerator. 2.In the event of mandatory interventions during the operation period of the accelerator – such as a non-critical security patch – an announcement will be made using the Communication Interface for Central (CIC) operations portal and the period of scheduled downtime entered in the Grid Operations Centre (GOC) database (GOCDB). 3.Such an announcement shall be made at least one working day in advance for interventions of up to 4 hours. 4.Interventions resulting in significant service interruption or degradation longer than 4 hours and up to 12 hours shall be announced at the Weekly Operations meeting prior to the intervention, with a reminder sent via the CIC portal as above. 5.Interventions exceeding 12 hours must be announced at least one week in advance, following the procedure above. 6.A further announcement shall be made once normal service has been resumed. 7.[deleted] 8.Intervention planning should also anticipate any interruptions to jobs running in the site batch queues. If appropriate the queues should be drained and the queues closed for further job submission. GridPP18: LCG Deployment - 14

15 CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it LCG Commissioning Schedule GridPP18: LCG Deployment - 15 2006 2007 2008 SC4 – becomes initial service when reliability and performance goals met 01jul07 - service commissioned - full 2007 capacity, performance Initial service commissioning – increase performance, reliability, capacity to target levels, experience in monitoring, 24 x 7 operation, …. Introduce residual services Full FTS services; 3D; gLite 3.x; SRM v2.2; VOMS roles; SL(C)4 first collisions in the LHC. Full FTS services demonstrated at 2008 data rates for all required Tx-Ty channels, over extended periods, including recovery (T0-T1). April 1 st target is to allow experiments to prepare for July 1 st FDRs. Timescale tight based on SC3/4 experience! (4 months would have been better…)

16


Download ppt "CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure."

Similar presentations


Ads by Google