LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29.

Slides:



Advertisements
Similar presentations
Lim Sei cK.  Information!  What information is expected in a progress report?  The answer to this question depends, as you might expect, on the.
Advertisements

Alberto Camacho Jessica George Maria Moya Rekeisha Scott Stephanie Williams Group B:
Welcome to RAI, the future of collaborative Project Risk Management Overview of Project Risk and Issue Management RAI for the Project Manager RAI for the.
Change Management Chris Colomb Trish Fullmer Jordan Bloodworth Veronica Beichner.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Network trouble ticket standardisation -
2011 Regional Meetings – St Louis Updated 2012 Volunteer Recruitment System (VRS) Stan Marshall, Jr. Chair National Operations Committee Miho Kikujo Sr.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Operations update Guillaume Cessieux.
OSG Area Coordinators Meeting Operations Rob Quick 2/22/2012.
The National Grid Service User Accounting System Katie Weeks Science and Technology Facilities Council.
EGI: SA1 Operations John Gordon EGEE09 Barcelona September 2009.
Ian Bird LHCC Referee meeting 23 rd September 2014.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 4 – Conclusion Guillaume.
Year 7 Curriculum Evening Spalding High School Community.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Software Engineering Saeed Akhtar The University of Lahore Lecture 8 Originally shared for: mashhoood.webs.com.
1 Developing and Implementing Electronic Health Records for Behavioral Health Services Strategic Planning for Providers to Improve Business Practices October.
Cluster Management Scorecard FITT (Fostering Interregional Exchange in ICT Technology Transfer)
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 5 Guillaume Cessieux (CNRS/IN2P3-CC,
EGEE-III Enabling Grids for E-sciencE EGEE and gLite are registered trademarks 2008 report on LHCOPN from ASPDrawer
LHCOPN operational working group Guillaume Cessieux (CNRS/FR-CCIN2P3 – EGEE SA2) third meeting CERN – December th, 2008
LHCOPN operational working group report Guillaume Cessieux (FR-CCIN2P3 / EGEE-SA2) on behalf of the Ops WG LHCOPN meeting, , Copenhagen.
ATLAS Experience with GGUS Guido Negri INFN – Milano Italy.
Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
Network infrastructure at FR-CCIN2P3 Guillaume Cessieux – CCIN2P3 network team Guillaume. cc.in2p3.fr On behalf of CCIN2P3 network team LHCOPN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ENOC - Status and plans Guillaume Cessieux.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Standard network trouble tickets exchange.
State of Georgia Release Management Training
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
Version Control and SVN ECE 297. Why Do We Need Version Control?
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
The Claromentis Digital Workplace An Introduction
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
PCAP Close Out Feb 2, 2004 BNL. Overall  Good progress in all areas  Good accomplishments in DC-2 (and CTB) –Late, but good.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1 & SA2-ENOC Interactions status and plans.
CERN IT Department CH-1211 Geneva 23 Switzerland t James Casey CCRC’08 April F2F 1 April 2008 Communication with Network Teams/ providers.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
LHCOPN operational model - 4 use-cases Guillaume Cessieux (FR-CCIN2P3 / EGEE networking support) on behalf of the Ops WG LHCOPN meeting, , Berlin.
Finance/Insurance CRM Edition Contact a Sales Rep for a demo ext 2008.
David Foster, CERN GDB Meeting April 2008 GDB Meeting April 2008 LHCOPN Status and Plans A lot more detail at:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.
Ian Bird LCG Project Leader Status of EGEE  EGI transition WLCG LHCC Referees’ meeting 21 st September 2009.
INFORMATION AND PROGRESS An analysis of what is happening in the Caribbean with information, decision- making and progress in Education.
Development Project Management Jim Kowalkowski. Outline Planning and managing software development – Definitions – Organizing schedule and work (overall.
EMI INFSO-RI Testbed for project continuous Integration Danilo Dongiovanni (INFN-CNAF) -SA2.6 Task Leader Jozef Cernak(UPJŠ, Kosice, Slovakia)
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
LHCOPN operational model Guillaume Cessieux (CNRS/FR-CCIN2P3, EGEE SA2) On behalf of the LHCOPN Ops WG GDB CERN – November 12 th, 2008.
EGI Process Assessment and Improvement Plan – EGI core services – Tiziana Ferrari FedSM project 1EGI Process Assessment and Improvement Plan (Core Services)
Marco Cattaneo, 3-June Event Reconstruction for LHCb  What is the scope of the project?  What are the goals (short+medium term)?  How do we organise.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
PRACE-EGI helpdesk integration
LHCOPN Operations: Yearly review
Networking support (SA2) tasks for EGI
Nordic ROC Organization
Network performance issues recently raised at IN2P3-CC
Workflows at Austin Water Labs
Presentation transcript:

LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona,

Outline Operations status –TTS stats –Change management –Backup tests Ongoing –Relationships with WLCG –Around GGUS LHCOPN meeting, Barcelona, GCX2

What was reported in the TTS? 395 tickets in the TTS since –381 solved (96%) –7 in progress Normal ongoing issues or scheduled work –5 unsolved Mainly performance issue not understood Duplicate or erroneous tickets cancelled or postponed work –2 assigned Twiki review pending (CA-TRIUMF, NDGF) LHCOPN meeting, Barcelona, GCX3

5 long standing issues 1 infrastructure –#55697: , FR-CCIN2P3, BGP flapping with CH-CERN Ongoing issue, root cause not yet found, ~1 flap/day, not service affecting 4 administratives –#48335: , Additional prefix for CA-TRIUMF Missing notification of acceptance from NDGF, UK-T1-RAL and US-FNAL- CMS –#52959: , UK-T1-RAL, Review of LHCOPN twiki Only missing routing policies to be udpated –#56415: , NDGF, Review of LHCOPN twiki Not started –#56417: , CA-TRIUMF, Review of LHCOPN twiki Not started Ops phoneconf seems not so successful to get this solved LHCOPN meeting, Barcelona, GCX4

Overall breakdown per category and type of problem LHCOPN meeting, Barcelona, GCX5 80% of tickets are L2 related events

Number of tickets put in the TTS per month LHCOPN meeting, Barcelona, GCX6 AVG: 23 tickets/month

Ticket’s ownership per site LHCOPN meeting, Barcelona, GCX7 Nearly 1/4 th of tickets NL-T1 has 6 LHCOPN links

Ownership of tickets per month per site LHCOPN meeting, Barcelona, GCX8

Kind of tickets per month LHCOPN meeting, Barcelona, GCX9

KPI-1: Infrastructure vs operations behavior LHCOPN meeting, Barcelona, GCX10 Less than 15 “significant” events / month?

Change management Only 5 tickets flagged as « change » ! –Is the infrastructure that stable? Flag set on GGUS submit interface LHCOPN meeting, Barcelona, GCX11

Conclusion on TTS stats L2 events are regular then well managed NL-T1 seems to have a very good implementation of the Ops model Administrative stuff frozen –Twiki review, change management etc. Not fascinating but minimum vital Decrease in the monthly number of tickets –Feeling from sites that not all tickets are useful –Need to ensure minimum vital is here by correlating with monitoring LHCOPN meeting, Barcelona, GCX12

Backup tests? Previously agreed: Each resilience possibility should be demonstrated at least once a year –Failures can count as a test if they are properly reported (particularly paths’ symmetry) Only two sites have reported a backup test or a demonstration of backup efficiency for No recent change in the infrastructure so no need to test? LHCOPN meeting, Barcelona, GCX13

Following MDM deployment related issues Only deployment issues? –Physical set up etc. –Interaction with sites Should be tracked through tickets –Still in GN3 helpdesk system? GN3 people have no access to GGUS LHCOPN people have no access to GN3 helpdesk Should be visible –How, where? LHCOPN meeting, Barcelona, GCX14

What’s missing to go ahead? Network SLD –What is a « significant » event requiring care etc. Monitoring –Have we service impacting events? –Correlation with Operations –Evidences instead of feelings Particularly for performance issues Fill the gap between WLCG Ops and LHCOPN Ops –Gap by design but bridge expected LHCOPN meeting, Barcelona, GCX15

Relationships with WLCG (1/4) Lot of work previously done by Wayne –Clear overview during Vancouver’s presentation Agreement from WLCG about! Only missing careful implementation? Minimum relationships should be made of –Exchanges during meetings –Operational exchange Clear process and KPI around –Facilitated with tickets’ linking Dashboard of service affecting issue –Sharing LHCOPN monitoring information LHCOPN meeting, Barcelona, GCX16

Relationships with WLCG (2/4) Main stoppers –Meetings Not acting and represented as a whole community through a LHCOPN representative or “liaison officer” Too often asked to be there « just in case » –Operational exchanges Complex and hard to get used to them with very few issues involving WLCG (~1 each 3 months?) –Post mortem analysis hard as a lot of exchanges seems off the record –Now high resiliency network –A lot of things are site’s internal processes Common use of GGUS is giving a false feeling of relationships –We are not doing user support! Mistake to assume we can handle all network issue from our isolated island with our closed set of supporters –Need coordination and action from other teams (storage…) –Problem to interact with WLCG supporters LHCOPN meeting, Barcelona, GCX17

Relationships with WLCG (3/4) Sample expected workflow for WLCG inquiries: LHCOPN meeting, Barcelona, GCX18 Site Contact Site Network Team Relevant WLCG Team Experiment WLCG GGUS Internal Ticket System Site Network Team LHCOPN GGUS Relevant Network Team WLCG GGUS Networking? YesNo LHCOPN Related? YesNo LHCOPN GGUS Internal Ticket System Site Contact WLCG GGUS

Relationships with WLCG (4/4) Workplan –A dashboard showing tickets impacting WLCG Done: Particular view on the dashboard –Ability to link WLCG and LHCOPN tickets Upcoming: Parent/Child relationship –Cross reference still here (no associated workflow) But problem to interact with WLCG supporters –No cross helpdesk access to update tickets –On site processes Push for carefull implementation of « Site’s contact »? –Internal site’s processes LHCOPN meeting, Barcelona, GCX19

Around GGUS (1/6): GGUS status list LHCOPN meeting, Barcelona, GCX20

Around GGUS (2/6): LHCOPN submit interface LHCOPN meeting, Barcelona, GCX21

Around GGUS (3/6): WLCG submit interface LHCOPN meeting, Barcelona, GCX22

Around GGUS (4/6): Merging Pros Should we unify/merge LHCOPN helpdesk within the standard GGUS? +Consider networks like other resources (computing, storage, software...) Network are not standalone resource, coordination between sites required +Maybe better fit in reporting reports True +Now standard way to send enquiries to sites? Yes for Grid issues, not always for network teams, less Grid centred, unwilling to go at project level But for a project’s dedicated network? +Maybe some central manpower could be gained +Regularly chasing pending tickets... Very unclear who can do that, and if this will be successful (cf. twiki review) +Less specific software and support from GGUS No key economy for them: Still using same database, hosts etc. and sharing some code +Ease interactions with WLCG supporters Issues evolving in two different worlds Write access to our helpdesk restricted to network teams LHCOPN meeting, Barcelona, GCX23

Around GGUS (5/6): Merging Cons –We have something stable and working Definitely, but that should not prevent improvements –Completely tailored for us and closely matching our operational model Seems hard to merge frontends and unify workflows –Be far from interferences with Grid world Isolation could be achieved with particular views? Was a key concern from network teams –Not shaped to do user support But coordinating network teams Maintenance not in GGUS No strong preference from the GGUS team Confirmed, not a problem for them LHCOPN meeting, Barcelona, GCX24

Around GGUS (6/6): Conclusion about merging Our helpdesk was designed to coordinate network teams not to support WLCG users –Really different from standard GGUS –Appears as an internal coordination tool Benefits not so clear, was mainly thought to ease integration in WLCG Ops –But we are not doing Grid Ops Network issue ≠ Grid issue Networks are not standalone resources (storage, cpu etc.) –Similar to software issues handled externally (in savannah) We should not be customer faced –Selected inquiries going through storage teams (“Site contact”) Let’s also see how EGI will converge around user support LHCOPN meeting, Barcelona, GCX25

Conclusion about LHCOPN Operations Ops status: Clear place for improvements –Unequal following of processes by sites because missing clear feeling of usefulness and evidence of network failures –L2 events well handled while administrative workflow is forgotten WLCG relationships to be implemented and nurtured –Performance issues need smart and timely solving –Skeleton of coordination with WLCG Ops to be improved No outstanding benefit to unify LHCOPN helpdesk with WLCG’s one –Maybe better and enough to carefully link our workflow with WLCG Ops Wait monitoring & SLDs before next set of improvements –Timeline? –Particularly revitalise tickets’ handling and ensure minimum is here LHCOPN meeting, Barcelona, GCX26

Questions 1. Pushing for administrative things to be done? –twiki review, backup tests etc. 2. LHCOPN representative? –Maybe not responsible for Ops but more liaising as a single contact point –Share and justify the workload 3. GGUS merging –Opinion? LHCOPN meeting, Barcelona, GCX27