Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deployment Summary GridPP12 Jeremy Coles 1 st February 2005.

Similar presentations


Presentation on theme: "Deployment Summary GridPP12 Jeremy Coles 1 st February 2005."— Presentation transcript:

1 Deployment Summary GridPP12 Jeremy Coles J.Coles@rl.ac.uk 1 st February 2005

2 Contents LCG operations workshop EGEE structures Operations model Current status Support Planning Metrics Some of the recurring issues at GridPP12 Future activities

3 Some operational issues Slow response from sites (central perception) –Upgrades, response to problems, etc –Problems reported daily – some problems last for weeks Lack of staff available to fix problems –All on vacation, … Misconfigurations (units, gridmap-file builds, user profiles, pools …) Lack of configuration management – problems that are fixed reappear Lack of fabric management –Is it GDA responsibility to provide solutions to these problems? Lack of understanding (training?) –Admins reformat disks of SE … Firewall issues –coordination between grid admins and firewall maintainers PBS problems –Are we seeing the scaling limits of PBS? People not reading documentation … Background to workshop

4 LCG Workshop Nov 2004 Operational Security –Incident Handling Process –Variance in site support availability –Reporting Channels –Service Challenges Operational Support –Workflow for operations & security actions –What tools are needed to implement the model –“24X7” global support sharing operational load (CIC-on-duty) –Communications (news) –Problem Tracking System –Defining Responsibilities problem follow-up deployment of new releases –Interface to User Support LCG (EGEE) discussion on superset of topics discussed at GridPP11

5 LCG Workshop Nov 2004 Fabric Management –System installations (tools, intrfacing tools with each other) –Batch/scheduling Systems (openPBS/Torque, MAUI. fair-share) –Fabric monitoring –Software installation –Representation of site status (load) in the Information System Software Management –Operations on and for VOs (add/remove/service discovery) –Fault tolerance, operations on running services (stop,upgrades, re- starts) –Link to developers –What level of intrusion can be tolerated on the WNs (farm nodes) application (experiment) software installation –Removing/(re-adding) sites with (fixed)troubles –Multiple views in the information system (maintenance)

6 GDB LCG Grid Deployment Board One representative from each country (with a Regional Centre) involved in the LCG and one representative from each experiment Chairman changes annually Meet in person once per month What it does! Explores issues of global concern to the LCG community Makes decisions on deployment, operations and planning for LCG Provides mechanisms for resource forecasting How? By calling upon experts to present latest information on specific topics By creating and overseeing working groups to tackle important areas Currently three groups: The Security, Networking and Quattor groups Who is involved in UKI UK representative: John Gordon Security group coordinator: Dave Kelsey GDB secretary: Jeremy Coles

7 Proposed escalation procedure Because unstable and badly configured sites cause a big problem: –Unstable sites that have frequent problems Will appear on a list of bad sites –Sites that do not respond to problem reports Including not upgrading middleware versions –Will be removed from the information systems and maps –Will have to be re-certified to get back in –Will be reported to the GDB (LCG) or PMB (EGEE) representative as non-responsive

8 ROCs Regional Operations Centres (ROCs) Part of the EGEE SA1 activity (http://egee-sa1.web.cern.ch/egee%2Dsa1/)http://egee-sa1.web.cern.ch/egee%2Dsa1/ The regions are CERN, France, Italy, UK & Ireland, Germany & Switzerland, Northern Europe, South West Europe, South East Europe, Central Europe and Russia. What they do Coordinate regional efforts in all activities (support, operations representation, security) Take up operations and deployment issues at cross project meetings Provide forum for agreeing work needed – pre-production service How? Setup ROC structures within the region Create common groups to work on areas like pre-production services, helpdesk interfaces Meet fortnightly via telephone (http://agenda.cern.ch/displayLevel.php?fid=339) to discuss regional issues and problemshttp://agenda.cern.ch/displayLevel.php?fid=339 Who is involved for UK? General: John Gordon Support: Andy Richards Security: Romain Wartel EGEE Background

9 CICs Core Infrastructure Centre (CIC) The CICs cover more than one region and deal with operations issues. There are currently 4 CICs France, Italy, UK & Ireland and CERN Coordinated by the Operations Management Centre team at CERN. Meet weekly via telephone ( http://agenda.cern.ch/displayLevel.php?fid=258 ) http://agenda.cern.ch/displayLevel.php?fid=258 Each CIC is “on-duty” for 1 week in 4. What they do! Operational and performance monitoring Troubleshooting and following up identified problems Operate general grid services (e.g. VO related services) Provide information via the CIC portal http://cic.in2p3.fr/ http://cic.in2p3.fr/ How? Review monitoring data such as gstat, daily test results Enter problems identified into Savannah (moving to GGUS portal soon) Follow up problems using email and telephone contacts Troubleshoot using experts, Wiki etc. Who is involved in UKI Steve Traylen & Philippa Strange EGEE Background

10 CIC portal http://cic.in2p3.fr/

11 Regional Operations Centres (9) –Act as front-line support for user and operations issues –Provide local knowledge and adaptations User Support Centre (GGUS) –In FZK –provide single point of contact (service desk) Core Infrastructure Centres (4) –CICs build on the LCG GOC at RAL –Also run essential infrastructure services –Provide support for other (non-LHC) applications –Provide 2 nd level support to ROCs Coordination: –At CERN (Operations Management Centre) and CIC for HEP LCG-2/EGEE Operations Taipei provide operations centre, and 2 nd instance of GGUS –  start to build round-the-clock coverage Discussions with Grid3/OSG on how to collaborate on ops support –Share coverage?

12 (New) Operations Model Operations Center role rotates through the CICs –CIC on duty for one week –Procedures and tasks are currently defined first operations manual is available (living document) –tools, frequency of checks, escalation procedures, hand over procedures CIC on duty website:website –Problems are tracked with a tracking tool now central in Savannah migration to GGUS (remedy) with link to ROCs PT tools problems can be added at GGUS or ROC level –CICs monitor service, spot and track problems interact with sites on short term problems (service restart etc,) interact with ROCs on longer, non trivial problems all communication with a site is visible for the ROC build FAQs –ROCs support installation, first certification resolving complex problems

13 Operations Model OMC CIC ROC RC Other Grid RC

14 How does support map onto this? OMC CIC ROC helpdesk RC Other Grid RC Savannah GGUS

15 How does user support map onto this? OMC CIC ROC helpdesk RC Other Grid RC Savannah GGUS VO1 VO2 VO3

16 How does user support map onto this? OMC CIC ROC helpdesk RC Other Grid RC Savannah GGUS VO1 VO2 VO3 We need to work out a better model for this in the UK

17 Site updates Most sites have stated an intention to move to SL3 and LCG 2.3 over the next few weeks

18 Monitoring progress http://goc.grid-support.ac.uk/gridsite/monitoring/ Produced: Certification tests GPPMon Maps RSS feeds Can we: Have a single view Integrate network info

19 Today’s functional test results The tests show similar patterns across EGEE as a whole How can tests be made more useable by those who can react?

20 Accounting progress http://goc.grid-support.ac.uk/gridsite/accounting/ Well done: Imperial College Manchester Oxford RAL Tier-1 RAL PPD Edinburgh Glasgow UCL – CCC Durham What next? More sites!! Provide older data Analyse & use ALL sites need to keep their log files. Details in the accounting page FAQ.

21 Ganglia Well done: Manchester Edinburgh Lancaster QMUL Sheffield Bristol Oxford Liverpool What next? We need all sites Review against MoUs Use data for warnings? http://www.gridpp.ac.uk/ganglia/

22 Status of planning

23 We have developed a plan for deployment at a high level. The deliverables form part of the GridPP2 project map. Each area has consequences for Tiers- 1, 2 and 3 in for example: Service challenges Data challenges Networking Security Resource provision Core services MoU commitments Functionality Accounting Scheduling of use Support …. It is still evolving and there is a lot of work here!

24 What metrics and why? Number of sites in production – simple count based on GOCDB information? Number of registered users – count of certificates issued? Number of active users Number of supported VOs Percentage of available resources utilised Peak number of concurrent jobs – measured by Gstat for grid jobs Average number of concurrent jobs – measured by Gstat for grid jobs Number of jobs not terminated by themselves or the batch system Accumulated site downtime per week (scheduled and un-scheduled) Total CPUs deployed CPUs available Storage available and used CPU hours per VO UK relative contribution to experiments The list shared before… Subject of DTEAM discussion 16:00-18:00 today What is actually useful now?

25 LHCb DC feedback LCG Job Submission Summary Table LCG Efficiency: 61 % … but note Tony Cass’s comments earlier of improving performance

26 DO MC performance – CESuccessFailed bohr0001.tier2.hep.man.ac.uk2373 cclcgceli01.in2p3.fr-14 grid-ce.physik.uni-wuppertal.de-- gridkap01.fzk.de256419 golias25.farm.particle.cz19815 heplnx131.pp.rl.ac.uk2464 lcgce02.gridpp.rl.ac.uk29310 mu6.matrix.sara.nl3977 tbn18.nikhef.nl1542 Total408974 Efficiency 98 % Is this “much less than production quality” ? 98.8% 98.4% 96.7%

27 DO MC performance LCG Efficiency 99 % We need to be careful with what we mean ! Error Aborted35LCG error: f.e. file not found Cancelled21Done by us for various reasons Cleared5Done by us, enough events Running10D0 softw.error: infinite loop Scheduled3Can be OK, CZ disk crash Total74Really 35 LCG errors

28 Ability to plan (service challenges, networking, resources) Responsiveness of sites Security gLite, gLite, gLite GridPP12 Deployment issues } This is a “production” service Concept behind the “pre-production” service: New middleware (gLite, …) can be demonstrated and validated before being deployed in production Understand the migration strategy to 2nd generation middleware Use the existing production service as the baseline comparison

29 Ability to plan (service challenges, networking, resources) Responsiveness of sites Security gLite, gLite, gLite Tier-2s operating as real Tier-2s Use of Tier-2s (experiment models) Metrics (“get fit” plan) Use of Tier-2 SEs SRM = Storage Really Matters! Engagement with experiments On-demand tests and other tools Support Communications } This is a “production” service GridPP12 Deployment issues

30 Deployment web-pages WORK IN PROGRESS

31 Summary LCG workshop was useful. Some progress but not enough answers. Roadmaps proposed. EGEE has a deployment structure and GridPP deployment works within the UKI ROC/CIC We need to unravel the support problems and introduce something that works well for UK Sites are responding to requests but sometimes slowly. Better communications are needed. We still have significant planning challenges to overcome (LCG SC1 failed and there is no clear gLite migration strategy. gLite could require a step back in deployment terms! Implications of experiment computing models.) By the next GridPP meeting we must be reporting on carefully defined metrics THANK YOU to everyone involved. Please remember - we need your feedback to improve the deployment mechanisms and GridPP service.


Download ppt "Deployment Summary GridPP12 Jeremy Coles 1 st February 2005."

Similar presentations


Ads by Google