Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005.

Similar presentations


Presentation on theme: "Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005."— Presentation transcript:

1 Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005

2 June 2005Deployment update Current deployment issues Main GridPP concerns: gLite migration, fabric management & future of YAIM dCache Data migration – classic SE to SRM SE Security Ganglia deployment Use of ticketing system Use of UK testzone General Jobs at sites – improving (nb. Freedom of Choice is coming!) Few general EGEE VOs supported at GridPP sites

3 June 2005Deployment update 2 nd LCG Operations Workshop Took place in Bologna last week: http://infnforge.cnaf.infn.it/cdsagenda//fullAgenda.php?ida=a0517 http://infnforge.cnaf.infn.it/cdsagenda//fullAgenda.php?ida=a0517 Covered the following areas: –Daily operations –Pre-production service –Glite deployment and migration –Future monitoring (metrics) –Interoperation with OSG –User support (Executive Support Committee!) –VO management processes –Fabric management –Accounting (DGAS and APEL) –Little on security! Romain presented potential tools.

4 June 2005Deployment update LCG-2_4_0 Plan CPUs: 2_4_010642 2_3_1 912 2_3_0 2167 CPUs: 2_4_010642 2_3_1 912 2_3_0 2167

5 June 2005Deployment update Version Change in the last 100 days Others: Sites on older versions or down All sites in LCG-2

6 June 2005Deployment update Regions with less than 5 sites are not shown Canada Russia Italy Germany/Switzerland

7 June 2005Deployment update France Asia Pacific Northern SW

8 June 2005Deployment update Central SE

9 June 2005Deployment update UKI

10 June 2005Deployment update LCG-2_4_0 Lessons learned: –Harder than expected (rate independent of packaging) –Differences between regions --> ROCs matter –Release definition non trivial with 3 months intervals Components dependencies –X without Y and V is useless…. During certification we still find problems Upgrade and installation from scratch needed (time consuming) –Test pilots for deployment are useful –Early announcement of releases is useful –We need to introduce “updates” via APT to fix bugs that show during deployment –Number of sites is the wrong metric to measure success CPUs on new release needs to be tacked, not sites

11 June 2005Deployment update The next release Why? –SC3 is approaching and the needed components are not deployed at the sites What? –File transfer service (will need VDT 1.2.2) Servers for Tier1 and Tier0, clients for the rest –Improved monitoring sensors for gridFtp –RFC proxy extension for VOMS –New version of the GLUE schema (compatible) –LFC production service –Interoperability with GRID3/OSG –User level stdio monitoring (maybe later) –Bug fixes …….. as always When? –Aimed at mid June Who? –Tier 1 centers and Tier 2 centers participating in SC3 As fast as possible –Others? At their own pace –Updated release (fixes from 1st release) expected by July 1st.

12 June 2005Deployment update SITE FIREMAN VOMS LFC sharedLCG gLite SRM-SE myProxy gLite WLM RB UIs WNs gLiteLCG gLite-IO gLite-CE FTS LCG CE FTS R-GMA BD-II Data from LCG is owned by VO and role, gLite-IO service owns gLite data FTS for LCG uses user proxy, gLite uses service cert R-GMAs can be merged (security ON) CEs use same batch system Independent IS Catalogue and access control Coexistence & Extended Pre-Production Coexistence & Extended Pre-Production dgas APEL

13 June 2005Deployment update SITE VOMS LFC sharedLCG gLite SRM-SE myProxy gLite WLM RB UIs WNs LCG gLite-CE LCG CE FTS R-GMA BD-II FTS for LCG uses user proxy, gLite uses service cert CEs use same batch system Gradual Transition 1 gLite dgas APEL Optional additional WLM Data Management LCG Optional dgas accounting Optional additional WLM Data Management LCG Optional dgas accounting

14 June 2005Deployment update SITE VOMS LFC sharedLCG gLite SRM-SE myProxy gLite WLM UIs WNs LCG gLite-CE FTS BD-II Gradual Transition 2 gLite R-GMA FIREMAN dgas APEL Removed LCG WLM Optional Catalogue R-GMA in gLite mode Removed LCG WLM Optional Catalogue R-GMA in gLite mode

15 June 2005Deployment update SITE VOMS LFC sharedLCG gLite SRM-SE myProxy gLite WLM UIs WNs LCG gLite-CE FTS BD-II Gradual Transition 3 gLite R-GMA FIREMAN gLite-IO FTS Data from LCG is owned by VO and role, gLite-IO service owns gLite data dgas APEL Adding gLite-IO Second path to data Additional security model Data migration phase Adding gLite-IO Second path to data Additional security model Data migration phase

16 June 2005Deployment update SITE VOMS LFC sharedLCG gLite SRM-SE myProxy gLite WLM UIs WNs LCG gLite-CE BD-II Gradual Transition 4 gLite R-GMA FIREMAN gLite-IO FTS dgas APEL Finalize switch to new security model. LFC, now a local catalogue under VO control BDII later replaced by R-GMA Finalize switch to new security model. LFC, now a local catalogue under VO control BDII later replaced by R-GMA

17 June 2005Deployment update Metrics - EGEE General Agreement on the concept –detailed discussions on: time windows –Sliding windows (week, month, 3 month) quantities to watch for (RCs, ROCs, CICs…..) –ROCs based on RCs –CICs based on services –Release quality has to be measured To make progress: workgroup to define quantities –Organized by: Ognjen Prnjat (oprnjat@admin.grnet.gr)oprnjat@admin.grnet.gr –Small (˜5), Ognjen, Markus, Helene, Jeff T. and Jeremy –Ognjen will collect input –ROCs, CICs and OMC have to agree on ONE set of quantities

18 June 2005Deployment update Operations summary CIC On Duty is now well established –COD is just 6 month old!!!!! –Tools have evolved at a dramatic pace Portal, SFT,…… –Many rapid iterations Truly distributed effort Integration of new COD partner (Russia) went smoothly –Tuning of procedures is an ongoing process No dramatic changes (take resource size more into account)

19 June 2005Deployment update Accounting Last November still an area of concern –APEL now well established Support for batch systems is improving Several privacy related problems have been understood and solved –gLite Accounting: DGAS Some concerns about amount of information published –Can be handled by proper authorization? Collaboration with APEL on batch sensors (BBQS, Condor,..) –DGAS agreed to provide them Will be introduced initially on a voluntary basis –Sites will give feedback (including privacy issues)

20 June 2005Deployment update Current deployment issues (recap) Main GridPP concerns: gLite migration, fabric management & future of YAIM dCache Data migration – classic SE to SRM SE Security Ganglia deployment Use of ticketing system Use of UK testzone General Jobs at sites – improving (nb. Freedom of Choice is coming!) Few general EGEE VOs supported at GridPP sites

21 June 2005Deployment update Freedom of choice - VO Page

22 June 2005Deployment update Service Challenge 3

23 June 2005Deployment update SC2 SC3 LHC Service Operation Full physics run 200520072006 2008 First physics First beams cosmics June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation SC4 SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Apr07 – LHC Service commissioned SC timelines

24 June 2005Deployment update Service Challenge 3 - Phases High level view: Throughput phase –2 weeks sustained in July 2005 “Obvious target” – GDB of July 20 th –Primary goals: 150MB/s disk – disk to Tier1s; 60MB/s disk (T0) – tape (T1s) –Secondary goals: Include a few named T2 sites (T2 -> T1 transfers)  Encourage remaining T1s to start disk – disk transfers Service phase –September – end 2005 Start with ALICE & CMS, add ATLAS and LHCb October/November All offline use cases except for analysis More components: WMS, VOMS, catalogs, experiment-specific solutions –Implies production setup (CE, SE, …)

25 June 2005Deployment update SC implications SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July –Must have the release to be used in SC3 available in mid-June –Involved sites must upgrade for July –Not reasonable to expect those sites to commit to other significant work (pre-production etc) on that timescale –T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA, RAL and Expect SC3 release to include FTS, LFC, DPM, but otherwise be very similar to LCG-2.4.0 September-December: experiment “production” verification of SC3 services; in parallel set up for SC4 Expect “normal” support infrastructure (CICs, ROCs, GGUS) to support service challenge usage Bio-med also planning data challenges –Must make sure these are all correctly scheduled

26 June 2005Deployment update SC3 issues Tier-1 network being extensively re-configured. Tests showed up to 40% packet loss! Waiting for UKLight to be fixed. Not intending to use dual- homing but dCache have provided a solution Lancaster link up at the link level What is the bandwidth of the Lancaster connection Edinburgh hardware problem with raid-array to be used as SE – IBM investigating Lancaster set up test system. Now deploying more hardware Need clarification about classification of volatile vs permanent data in respect of Tier-2s The file transfer service should be ready now but has problems with the client component RAL would like longer period for testing tape than suggested in SC3 plans There has been an issue with CMS preferring to use Phedex and not to use FTS for transfers. We need to add into the plans a period to do Phedex only transfer tests dCache mailing list very active now. There have been problems with the installation scripts

27 June 2005Deployment update SC3 issues continued We have questions about whether FTS uses SRM-put or SRM-cp. From September onwards SC3 infrastructure is to provide a production quality service for all experiments – remember comments about UKLight being a research network – risk!? Differing engagement with the experiments. Edinburgh needs a better releationship with LHCb There is an LCG workshop in mid-June where the experiment plans should be almost final! GridPP needs to do more load testing than is anticipated in SC3 Planning for SC4 needs to start soon. Currently we are pushing dCache but DPM is also supposed to be available.

28 June 2005Deployment update Imperial (London Tier-2) SRM/dCache Status –Production server installed gfe02.hep.ph.ic.ac.uk Information provider still developing –1.5TB Pool node added RHEL 4, 64 bit system Installed using dcache.org instructions http://www.dcache.org/downloads/dCache-instructions.txt –Extra 1.5TB ready to add when CMS ready –6TB being purchased. Should be in place by start of Setup Phase CMS Software –Service node provided –Phedex installed –Confirmation on FTS/Phedex issue sought

29 June 2005Deployment update Edinburgh Current LCG production setup: Compute Element (CE), Classic Storage Element (SE), 3 Worker Nodes (2 machines, 3 CPUs). Monitoring takes place on the SE, running LCG 2.4.0. About to add 2 Worker Nodes (2 CPUs in 1 machine) and have a User Interface (UI) in testing. We have a 22TB datastore available Plans £2000 available for 2 machines - one for dCache work and one to connect to EPCC's SAN (10 TBytes promised). Considering the procurement of more WNs but have no clear requirements from LHCb.

30 June 2005Deployment update Lancaster (current)

31 June 2005Deployment update Lancaster (planned) 1.LighPath and terminal Endbox installed. 2.Still require some hardware for our internal network topology. 3.Increase in Storage to ~84TB to possible ~92TB with working resilient dCache from CE

32 June 2005Deployment update Other areas…

33 June 2005Deployment update JRA4 request We have some idea of requirements from networking experts within JRA4 Draft requirements document available here: –https://edms.cern.ch/document/593620/1 Draft use case document available here: –https://edms.cern.ch/document/591777/1 We’re looking for more input from NOCs and GOCs If you have requirements, use cases or opinions on interfaces or needed metrics, please send them to us Even if you don’t have ideas at the moment, but would like to be involved in the process, please get in contact Contact details are at the end of the talk

34 June 2005Deployment update DTEAM discussion Review of team objectives – what is the team focus for the next 3 & 5 months Communications with the experiments Using a project tool to work better as a team Metrics!! Review of plans and what needs to be done to keep them up-to-date including GridPP challenges and SC4 Web-page status Areas raised at the T2B and DB meetings Security challenge involvement Accounting – status and making further progress Libraries and understanding expt. Needs Review dCache efforts Address issues with Quarterly reports & weekly reports Next release, test-zone and test-zone machines Data management – guidelines required Improving robustness GI – (Documentation (esp. releases), multi-Tier R-GMA, intro. New sites, LCFGng distribution (Kickstart & Pixieboot… ), jobs – how to get


Download ppt "Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005."

Similar presentations


Ads by Google