Deployment metrics and planning (aka Potentially the most boring talk this week) GridPP16 Jeremy Coles 27 th June 2006.

Slides:



Advertisements
Similar presentations
London Tier2 Status O.van der Aa. Slide 2 LT 2 21/03/2007 London Tier2 Status Current Resource Status 7 GOC Sites using sge, pbs, pbspro –UCL: Central,
Advertisements

S.L.LloydATSE e-Science Visit April 2004Slide 1 GridPP – A UK Computing Grid for Particle Physics GridPP 19 UK Universities, CCLRC (RAL & Daresbury) and.
NorthGrid status Alessandra Forti Gridpp15 RAL, 11 th January 2006.
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…
Your university or experiment logo here What is it? What is it for? The Grid.
User Board - Supporting Other Experiments Stephen Burke, RAL pp Glenn Patrick.
Applications Area Issues RWL Jones GridPP16 QMUL 28 th June 2006.
GridPP From Prototype to Production David Britton 21/Sep/06 1.Context – Introduction to GridPP 2.Performance of the GridPP/EGEE/wLCG Grid 3.Some Successes.
Quarterly report ScotGrid Quarter Fraser Speirs.
Tony Doyle “GridPP2 Proposal”, GridPP7 Collab. Meeting, Oxford, 1 July 2003.
S.L.LloydGridPP CB 29 Oct 2002Slide 1 Agenda 1.Introduction – Steve Lloyd 2.Minutes of Previous Meeting (23 Oct 2001) 3.Matters Arising 4.Project Leader's.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.
UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP Vendor Day 30 th April.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Deployment Summary GridPP12 Jeremy Coles 1 st February 2005.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
LT 2 London Tier2 Status Olivier van der Aa LT2 Team M. Aggarwal, D. Colling, A. Fage, S. George, K. Georgiou, W. Hay, P. Kyberd, A. Martin, G. Mazza,
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Quarterly report ScotGrid Quarter Fraser Speirs.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
1 st EGEE Conference – April UK and Ireland Partner Dave Kant Deputy ROC Manager.
Organisation Management and Policy Group (MPG): Responsible for setting and policy decisions and resolving any issues concerning fractional usage, acceptable.
GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.
Project Management Sarah Pearce 3 September GridPP21.
John Gordon CCLRC e-Science Centre LCG Deployment in the UK John Gordon GridPP10.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
Jeremy Coles UK LCG Operations The Geographical Distribution of GridPP Institutes Production Manager.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
Dan Tovey, University of Sheffield User Board Overview Dan Tovey University Of Sheffield.
GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.
GridPP Deployment Status GridPP15 Jeremy Coles 11 th January 2006.
Multicore Accounting John Gordon, STFC-RAL WLCG MB, July 2015.
Dave Newbold, University of Bristol8/3/2001 UK Testbed 0 Sites Sites that have committed to TB0: RAL (R) Birmingham (Q) Bristol (Q) Edinburgh (Q) Imperial.
GridPP3 project status Sarah Pearce 24 April 2010 GridPP25 Ambleside.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Production Manager’s Report PMB Jeremy Coles 13 rd September 2004.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Your university or experiment logo here What is it? What is it for? The Grid.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
Performance analysis extracts from GridPP OC metrics report For UKI operations meeting 15 th June 2005.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.
LHCOPN operational model Guillaume Cessieux (CNRS/FR-CCIN2P3, EGEE SA2) On behalf of the LHCOPN Ops WG GDB CERN – November 12 th, 2008.
LCG Accounting Update John Gordon, CCLRC-RAL 10/1/2007.
Slide § David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP delivering The UK Grid.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
18/12/03PPD Christmas Lectures 2003 Grid in the Department A Guide for the Uninvolved PPD Computing Group Christmas Lecture 2003 Chris Brew.
London Tier-2 Quarter Owen Maroney
Presentation transcript:

Deployment metrics and planning (aka Potentially the most boring talk this week) GridPP16 Jeremy Coles 27 th June 2006

Overview 2 Even more metrics….zzZ 3 zzzz zzzzz ZZZZZZ ZZZZZZZ 4 What came out of the recent deployment workshops 5 What is happening with SC4 6 Summary 1 An update on some of the high-level metrics

Available job slots have steadily increased Contribution to EGEE varies between 15% and 20%. From this plot stability looks like a problem! Thanks to Fraser for data update

Our contribution to EGEE work done remains significant but… … but be aware that not all sites have published all data to APEL. Only 1 GridPP site is not currently publishing

CPU usage has been above 60% since May Update for GridPP15

This is because most VOs have doubled job rates – note LHCb!

IC-HEP are developing a tool to show job histories (per CE or Tier-2) View for GridPP CEs covering last week

..but it looks a little rough sometimes! Over 5000 jobs running

The largest GridPP users by VO for last 3 months LHCb ATLAS BABAR CMS BIOMED DZERO ZEUS

VOs = a big success But we do now need to make sure that schedulers are giving the correct priority to LHC VO jobs! The ops VO will be used for monitoring from the start of July

Ranked CEs for Apr-Jun 2006 Thanks to Gidon and Olivier for this plot.

Ranked CEs for Apr-June 06

Successful time / total time Thanks to Gidon and Olivier for this plot.

An interesting view by Tier

A little out of date Q1 view for contribution and occupancy Some sites appear more successful at staying full even when overall job throughput is not saturating the resources. For Q2 most sites should show decent utilisation. (of course this plot involves estimates and assumes 100% availability).

Storage has seen a healthy increase – but usage ~40% SRM V2.2 is delayed. There have been several workshops/meetings taking forward the details of storage types (custodial vs permanent etc.)

Scheduled downtime is better than EGEE average …. Still not really good enough to meet MoU targets. Sites need to update without draining site… there are still open questions in the area of what available means. GOCDB needs finer granularity for different services.

So are there any recent trends!? This is the percentage of time that a site was down for a given period – if down for whole month the monthly stack (each colour) would be 100%

% SFTs failed for UKI Seems better than the EGEE average for April and May but slightly worse in June so far. These figures really need translating into hours unavailable and the impact on the 95% annual availability target.

SFTs per site - time Generally April and May seem to be improvements on January to March

Number of trouble tickets More tickets in Q so far! This seems correlated with the increased job loads. The profile is really quite similar between Q1 and Q2 2006

Average time to close tickets Tickets are usually from grid operator on duty. We need to look at factors behind these times. Note that just a few tickets staying open for a long time can distort the conclusions. We need better defined targets. The MoU talks about time to response of 12hrs (prime time) and 72 hrs (not prime time).

Middleware upgrade profiles remain similar gLite was deployed late but released on time raising questions about project wide communications. Our target remains 1 month from agreed start date. EGEE wants to move to rolling updates but there are still issues around tracking (publishing) component versions installed.

Disk to disk transfer rates The testing went well (thanks to Graeme) but we have a lot to do to improve rates. Suspected/actual problems and possible solutions are listed in the SC wiki:

Some key work areas for Q3 and Q Improving site availability/monitoring (e.g. Nagios scripts with alarms) Getting the transfer rates higher Understanding external connectivity data transfer needs Understand performance differences across the sites Adapt to rolling update of middleware model Implement storage accounting Improve cross-site support Understand WLCG MoU mapping to UK Tier-2 structure (and how we meet it) Take part in LCG experiment challenges (SC4 and beyond) Streamlining of the support structure (helpdesk) SRM upgrades (SRM v2.2) New resources integration (start to address the CPU:disk imbalance vs requirements) Security: incident response Exploiting SuperJanet upgrades Improved alignment with UK National Grid Service The usual: documentation and communication

Workshop outputs Tier-2 workshop/tutorials already covered – next planned for January 2007 OSG/EGEE operations workshop RELEASE AND DEPLOYMENT PROCESS –Why do sites need to schedule downtime for upgrades? –Release: Is local certification needed? sites required for testing against batch systems –Links to deployment timetable and progress area USER SUPPORT –How to improve communications (role of TCG was even debated!) –Experiment/VO experience. Improving error messaging! SITE VALIDATION –Site Availability Monitoring (SFTs for critical services – will remove some of the general SFT problems that end up logged against sites) VULNERABILITY & RISK ANALYSIS –New in EGEE-II = SA3. –Move to a new policy for going public with vulnerabilities –RATS (risk analysis teams) Service Challenge technical workshop –Review of individual Tier-1 rates and problems –Experiments plans are getting clearer and were reviewed –Commitment to use GGUS for problem tickets

Identified experiment interactions (please give feedback!) ScotGrid (Signed up to ATLAS SC4) Durham Edinburgh Glasgow – PPS site involved with work for ATLAS NorthGrid (Signed up to ATLAS SC4) Lancaster – Involved with ATLAS SC4 Liverpool Manchester – Already working with ATLAS but not SC4 specific Sheffield SouthGrid Birmingham Bristol Cambridge Oxford – ATLAS? RAL-PPD – Will get involved with CMS London Tier-2 Brunel – Offer to contribute to ATLAS MC production. Imperial – Working with CMS QMUL – ATLAS? (manpower concerns) RHUL – Bandwidth concern. ATLAS MC? UCL

Summary 2 Within EGEE and WLCG our contribution remains strong 3 Some issues with SFTs and scheduled downtime 4 Workshops over last 2 weeks have been useful 6 We need more sites to be involved with experiment challenges 1 There is a lot of data but not in a consistent format 5 Some clear tasks for next 6 months