Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager Monday 3 rd September CHEP 2007, Victoria, Canada.

Similar presentations


Presentation on theme: "Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager Monday 3 rd September CHEP 2007, Victoria, Canada."— Presentation transcript:

1 Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager J.Coles@rl.ac.uk Monday 3 rd September CHEP 2007, Victoria, Canada

2 2 Overview 2 Current resource status 3 Tier-1 developments 4 Tier-2 reviews 5 Future plans 6 Summary 1 Background Acknowledgements: Material in this talk comes from various sources across GridPP including the web-site ( http://www.gridpp.ac.uk/ ), Blogs (http://planet.gridpp.ac.uk/) and meetings such as a GridPP collaboration meeting last week ( http://www.gridpp.ac.uk/gridpp19/ )

3 3 Background GridPP is a collaboration of particle physicists and computer scientists from the UK and CERN. The collaboration is building a distributed computing Grid across the UK for particle physicists. At the moment there is a working particle physics Grid across 17 UK institutions. A primary driver of this work is meeting the needs of WLCG. http://www.gridpp.ac.uk/pmb/People_and_Roles.htm

4 4 Our user HEP community is wider than the LHC Esr Earth Science research Magic Gamma ray telescope Planck – satellite for mapping cosmic m/w bg Cdf D0 H1 Ilc - International Linear Collider project (future electron-positron linear collider studies) MINOS - Main Injector Neutrino Oscillation Search, is an experiment at Fermilab designed to study the phenomena of neutrino oscillations NA48 Supernemo ZEUS CEDAR - Combined e-Science Data Analysis Resource for high-energy physics Mice - A neutrino factory experiments T2k - (http://neutrino.kek.jp/jhfnu/) Next Generation Long Baseline Neutrino Oscillation Experiment SNO

5 5 The GridPP2 project map shows the many areas of recent involvement phenoGRID is a VO dedicated to developing and integrating (for the experiments) some of the phenomenological tools necessary to interpret the events produced by the LHC. HERWIG; DISENT; JETRAD and DYRAD. http://www.phenogrid.dur.ac.uk/ Job submission framework http://gridportal.hep.ph.ic.ac.uk/rtm/ Real Time Monitor

6 6 However, provision of computing for the LHC experiments dominates activities The reality of the usage is much more erratic than the target fairshare requests ATLAS BaBar CMS LHCb ATLAS BaBar CMS LHCb Tier-2s are delivering a lot of the CPU time

7 7 So where are we with delivery of CPU resources for WLCG? Overall WLCG pledge has been met but it is not sufficiently used New machine room being built Shared resources CPU at the Tier-2 sites level

8 8 … storage is not quite so good More storage at site but not included in dCache Shared resource Additional procurement underway Storage at the Tier-2 site level

9 9 and so far usage is not great. ATLAS CMS ATLAS & Babar CMS & Babar CMS ATLAS Experiments target sites with largest amounts of available disk Local interests influence involvement in experiment testing/challenges London Tier-2 (available)London Tier-2 (used) GrowingSteady X

10 10 Tier-1 storage problems...aaahhhh… we now face a dCache-CASTOR2 migration The migration to Castor continues to be a challenge! At the GridPP User Board meeting on 20 June it was agreed that 6 month notice be given for dCache termination. Experiments have to fund storage costs past March 2008 for ADS/vtp tape service. Castor Data Slide provided by Glenn Patrick (GridPP UB chair)

11 11 At least we can start the migration now! Separate Instances for LHC Experiments ATLAS Instance-Version 2.1.3 in production. CMS Instance-Version 2.1.3 in production. LHCb Instance-Version 2.1.3 in testing. Known “challenges” remaining: Tape migration rates Monitoring at RAL (server load) SRM development (v2.2 timescale) disk1tape0 capability Bulk file deletion Repack The experiment data challenges Problems faced (sample): ID servers in which pools Tape access speed Address-in-use errors Disk-disk copies not working Changing storage class Sub-request pile up Network tuning -> server crash Slow scheduling Reserve space pile up Small files … PANIC over?

12 12 New RAL machine room (artists’ impression)

13 13 New RAL machine room - specs Shared room 800m2 can accommodate 300 racks + 5 robots 2.3MW Power/Cooling capacity September 2008 In GridPP3 extra fault management/hardware staff planned as size of farm increases Several Tier-2 sites are also building/installing new facilities.

14 14 GridPP is keeping a close watch on site availability/stability Site A Site B UK average (reveals non-site specific problems) Power outage SE problems Algorithm issue

15 15 We reviewed UK sites to determine their readiness for LHC startup Questionnaire sent followed by team review: Management and Context Hardware and Network Middleware and Operating Systems Service Levels and Staff Finance and Funding Users Tier-2 sites working well together – technical cooperation Most Tier-2 storage is RAID5. RAM 1GB-8GB per node. Some large shared (HPC) resources yet to join. HEP influence large Widespread install: Kickstart – local scripts - YAIM + tarball WNs. Cfengine use increasing. Concern about lack of full testing prior to use in production – PPS role Monitoring (& metric accuracy) needs to improve Still issues with site-user communication Missing functionality – VOMS, storage admin tool…. … and much more. Some examples of resulting discussions

16 16 Generally agreed that regular (user) testing has helped improve the infrastructure } Sites in scheduled maintenance Observation: It is very useful to work closely with a VO in resolving problems (nb. many of the problems solved here would have impacted others) as site admins are not able to test their site from a user’s perspective. However, GridPP sites have stronger ties with certain VOs and intensive support can not scale to all VOs…

17 17 Improved monitoring (with alarms) is a key ingredient for future progress MonAMI is being used by several sites. It is a "generic" monitoring agent. It supports the monitoring of multiple services (DPM, Torque..) and supports reporting to multiple monitoring systems (Ganglia, Nagios…). A UK monitoring workshop/tutorial is being arranged for October. But you still have to understand what is going on and fix it! http://monami.sourceforge.net/

18 18 Examples of recent problems caught by better monitoring Phenomenologist and Biomed “DOS” attacks CPU ends up with too many gatekeeper processes active. Excessive resource consumption seen at some DPM sites. Turns out that "hung" dpm.gsiftp connections from ATLAS transfers. Removing inefficient jobs

19 19 Some (example) problems sites face Incomplete software installation Slow user response Disks fail or SE decommissioned – how to contact users /tmp full with >50GB log files of an ATLAS user. Crippled worker node. Jobs fill Storage (no quotas) -> site fails Site Availability (rm). Site blacklisted. 1790 gatekeeper processes running as the user alice001 - but we have no ALICE jobs running. Confusion over queue setups for prioritisation Job connections left open with considerable consumption of cpu and network resources. ATLAS ACL change caused a lot of confusion. Dates and requirements changed and the script for sites (made available without the source) had bugs which caused concern. Very hard for sites to know if jobs are running successfully. There is a massive amount of wasted CPU time with job resubmission (often automated) Knowing when to worry if no jobs are seen Lack of sufficient information in tickets created by users (leads to problems assigning tickets and increased time resolving them) Slow or no response to site follow up questions Problem raised in multiple ways possible confusion about whether something is still a problem CPU storm – gatekeeper processes stall - no jobs submitted. User banned.

20 20 GridPP future 200620082007 GridPP3 Proposal Submitted July 13 GridPP2GridPP2+ GridPP3 End of GridPP2 (31 August 2007) Start of GridPP3 (1 April 2008) ? £25.9M http://www.ngs.ac.uk/access.html

21 21 Summary 2 Resource deployment ok – still a concern about utilisation 3 Some major problems for Tier-1 storage have eased 4 Availability and monitoring are now top priorities 6 GridPP now funded until 2011 1 GridPP has involvement with many HEP areas but WLCG dominates 5 Sites face “new” challenges and need more monitoring tools


Download ppt "Grid Computing for UK Particle Physics Jeremy Coles GridPP Production Manager Monday 3 rd September CHEP 2007, Victoria, Canada."

Similar presentations


Ads by Google