Presentation is loading. Please wait.

Presentation is loading. Please wait.

Production Manager’s Report PMB Jeremy Coles 13 rd September 2004.

Similar presentations


Presentation on theme: "Production Manager’s Report PMB Jeremy Coles 13 rd September 2004."— Presentation transcript:

1 Production Manager’s Report PMB Jeremy Coles J.Coles@rl.ac.uk 13 rd September 2004

2 Site status information (1)

3 Site status information (2) Up-to-date status at: https://www.gridpp.ac.uk/production_manager/status.htmlhttps://www.gridpp.ac.uk/production_manager/status.html

4 Headline news Tier-1 CPU upgrades in place for some time now Disk servers not yet in production – awaiting riser card revision for all machines as current series continue to break (25% failure rate) Tier-2s RAL-PPD is now supporting the Babar VO QMUL unable to upgrade until new port of LCG2 is working Liverpool is ramping up CPUs Edinburgh and Brunel recently completed tests General Babar expect to upgrade farms to SLC3 this autumn GridPP is now offering about 1900 job slots (almost = CPUs)

5 Deployment areas (1) Security – 2 incidents at CERN Middleware – C&T will issue a release each month but this will not necessarily move to a deployment release (see slide for pre-production service) Fabric –Front-end nodes for UK sites due to be delivered 14 th September. Many sites waiting. –LCFG will not be supported under SLC3 and CERN will not formally release Quattor updates. Discussion in parallel discussion this week. Documentation –Identified need for site administrators guide book –Intend to start using GOC news update service for site information Support –No change yet. EGEE operational plan due to be discussed at CERN workshop 1-3 November and also at EGEE CIC meeting today

6 EGEE Pre-production service Roadmap for migration of pre-production service to gLite: –Based on the following assumptions: JRA1 release plan v1.3 JRA1 testing takes 4 weeks SA1 certification takes 2 weeks Pre-production takes s/w only after certification is complete No problems found in either JRA1 testing or SA1 certification –ComponentAvailable from JRA1In pre-production (earliest)  R-GMAmid Septembermid October  CEend Septemberend October  Metadata Cat.end Septemberend October  File I/Oend Septemberend October  Accountingend Octoberend November  Data Schedulerend Octoberend November  File Cat.end Octoberend November  File Transferend Octoberend November  Logging & Book.end Octoberend November  Replica Cat.end Octoberend November  SEend Octoberend November  VOMSend Octoberend November  WMSend Octoberend November  SRMmid January ’05early February ’05

7 Deployment areas (2) Procedures – Also part of EGEE work. More work needs to be done in this area Accounting & monitoring –Deliverable for December will probably cover PBS and LSF Metrics –Some progress but clarity still required (see later slide) Deployment plan –Some progress in identifying dependencies but more work to be done –Proving difficult to get dates from various areas! –..\Planning\GridPP CP v0.1.ppt..\Planning\GridPP CP v0.1.ppt

8 Operational issues (selection) Slow response from sites –Upgrades, response to problems, etc –Problems reported daily – some problems last for weeks Lack of staff available to fix problems –All on vacation, … Misconfigurations (see next slide) Lack of configuration management – problems that are fixed reappear Lack of fabric management –Is it GDA responsibility to provide solutions to these problems? If so, we need more available effort (see slide on workshops etc) Lack of understanding (training?) –Admins reformat disks of SE … Firewall issues – –often no good coordination between grid admins and firewall maintainers PBS problems –Are we seeing the scaling limits of PBS? Forget to read documentation … Ian Bird talk GDB 8 th September

9 Site (mis) - configurations Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems: –– The variable VO SW DIR points to a non existent area on WNs. –– The ESM is not allowed to write in the area dedicated to the software installation –– Only one certificate allowed to be mapped to the ESM local account –– Wrong information published in the information system (Glue Object Classes not linked) –– Queue time limits published in minutes instead of seconds and not normalized –– /etc/ld.so.conf not properly configured. Shared libraries not found. –– Machines not synchronized in time –– Grid-mapfiles not properly built –– Pool accounts not created but the rest of the tools configured with pool accounts –– Firewall issues –– CA files not properly installed –– NFS problems for home directories or ESM areas –– Services configured to use the wrong BDII –– Wrong user profiles –– Default user shell environment too big Ian Bird talk GDB 8 th September

10 ATLAS on LCG Preliminary UK contribution ~ 20%. Third biggest contribution from RAL.

11 Phase 1 Statistics from LHCb Split DC’04 in 3 Phases: 1) Production: MC simulation (Done). 2) Stripping: Event pre-selection (To start soon). 3) Analysis (In preparation). 424 CPU · Years

12 Other VOs Alice - active Babar - active CMS - active D0 - active ZEUS VO - active It may interest you to know that we have successfully processed 14000 Monte Carlo events on your cluster at RAL so far, and that virtually all of our tests there have been successful. (James Ferrando – DESY – 9 th September 2004) No clear picture of what is happening on SAMGrid

13 What is the status of…? ANTARES (high-energy neutrino telescope in Med) case for GridPP2 Application Interface Staff CALICE – namely London (Imperial and UCL), NorthGrid (Manchester) and SouthGrid (Cambridge, Birmigham and RAL) The seamless use of SAMGrid in cooperation with PPDG based on EGEE and LCG D0 – integration by 2007 MICE at RAL PhenoGrid UK Dark Matter Collaboration UKQCD

14 Deployment questions Do we need a UK middleware integration testbed? What level of network monitoring is required (which site(s))? How does resource allocation work across Tier-2s? What validation is required of MoU obligations? Will we formally define the GridPP security policy? What does available mean? => Tier-2 or Deployment board UK Tier-1 supporting non-UK Tier-2s

15 Metrics (1)

16 Metrics (2)


Download ppt "Production Manager’s Report PMB Jeremy Coles 13 rd September 2004."

Similar presentations


Ads by Google