Presentation is loading. Please wait.

Presentation is loading. Please wait.

LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29.

Similar presentations


Presentation on theme: "LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29."— Presentation transcript:

1 LHCOPN: Operations status LHCOPN: Operations status Guillaume.Cessieux @ cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29

2 Outline Operations status –TTS stats –Change management –Backup tests Ongoing –Relationships with WLCG –Around GGUS LHCOPN meeting, Barcelona, 2010-06-29 GCX2

3 What was reported in the TTS? 395 tickets in the TTS since 2009-02 –381 solved (96%) –7 in progress Normal ongoing issues or scheduled work –5 unsolved Mainly performance issue not understood Duplicate or erroneous tickets cancelled or postponed work –2 assigned Twiki review pending (CA-TRIUMF, NDGF) LHCOPN meeting, Barcelona, 2010-06-29 GCX3

4 5 long standing issues 1 infrastructure –#55697: 2010-03-10, FR-CCIN2P3, BGP flapping with CH-CERN Ongoing issue, root cause not yet found, ~1 flap/day, not service affecting 4 administratives –#48335: 2009-04-30, Additional prefix for CA-TRIUMF Missing notification of acceptance from NDGF, UK-T1-RAL and US-FNAL- CMS –#52959: 2009-11-04, UK-T1-RAL, Review of LHCOPN twiki Only missing routing policies to be udpated –#56415: 2010-03-12, NDGF, Review of LHCOPN twiki Not started –#56417: 2010-03-12, CA-TRIUMF, Review of LHCOPN twiki Not started Ops phoneconf seems not so successful to get this solved LHCOPN meeting, Barcelona, 2010-06-29 GCX4

5 Overall breakdown per category and type of problem LHCOPN meeting, Barcelona, 2010-06-29 GCX5 80% of tickets are L2 related events

6 Number of tickets put in the TTS per month LHCOPN meeting, Barcelona, 2010-06-29 GCX6 AVG: 23 tickets/month

7 Ticket’s ownership per site LHCOPN meeting, Barcelona, 2010-06-29 GCX7 Nearly 1/4 th of tickets NL-T1 has 6 LHCOPN links

8 Ownership of tickets per month per site LHCOPN meeting, Barcelona, 2010-06-29 GCX8

9 Kind of tickets per month LHCOPN meeting, Barcelona, 2010-06-29 GCX9

10 KPI-1: Infrastructure vs operations behavior LHCOPN meeting, Barcelona, 2010-06-29 GCX10 Less than 15 “significant” events / month?

11 Change management Only 5 tickets flagged as « change » ! –Is the infrastructure that stable? Flag set on GGUS submit interface LHCOPN meeting, Barcelona, 2010-06-29 GCX11

12 Conclusion on TTS stats L2 events are regular then well managed NL-T1 seems to have a very good implementation of the Ops model Administrative stuff frozen –Twiki review, change management etc. Not fascinating but minimum vital Decrease in the monthly number of tickets –Feeling from sites that not all tickets are useful –Need to ensure minimum vital is here by correlating with monitoring LHCOPN meeting, Barcelona, 2010-06-29 GCX12

13 Backup tests? Previously agreed: Each resilience possibility should be demonstrated at least once a year –Failures can count as a test if they are properly reported (particularly paths’ symmetry) Only two sites have reported a backup test or a demonstration of backup efficiency for 2010 https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnBackupTestsResults2010 No recent change in the infrastructure so no need to test? LHCOPN meeting, Barcelona, 2010-06-29 GCX13

14 Following MDM deployment related issues Only deployment issues? –Physical set up etc. –Interaction with sites Should be tracked through tickets –Still in GN3 helpdesk system? GN3 people have no access to GGUS LHCOPN people have no access to GN3 helpdesk Should be visible –How, where? LHCOPN meeting, Barcelona, 2010-06-29 GCX14

15 What’s missing to go ahead? Network SLD –What is a « significant » event requiring care etc. Monitoring –Have we service impacting events? –Correlation with Operations –Evidences instead of feelings Particularly for performance issues Fill the gap between WLCG Ops and LHCOPN Ops –Gap by design but bridge expected LHCOPN meeting, Barcelona, 2010-06-29 GCX15

16 Relationships with WLCG (1/4) Lot of work previously done by Wayne –Clear overview during Vancouver’s presentation http://indico.cern.ch/materialDisplay.py?contribId=17&materialId=slides&confId=59842 Agreement from WLCG about! Only missing careful implementation? Minimum relationships should be made of –Exchanges during meetings –Operational exchange Clear process and KPI around –Facilitated with tickets’ linking Dashboard of service affecting issue –Sharing LHCOPN monitoring information LHCOPN meeting, Barcelona, 2010-06-29 GCX16

17 Relationships with WLCG (2/4) Main stoppers –Meetings Not acting and represented as a whole community through a LHCOPN representative or “liaison officer” Too often asked to be there « just in case » –Operational exchanges Complex and hard to get used to them with very few issues involving WLCG (~1 each 3 months?) –Post mortem analysis hard as a lot of exchanges seems off the record –Now high resiliency network –A lot of things are site’s internal processes Common use of GGUS is giving a false feeling of relationships –We are not doing user support! Mistake to assume we can handle all network issue from our isolated island with our closed set of supporters –Need coordination and action from other teams (storage…) –Problem to interact with WLCG supporters LHCOPN meeting, Barcelona, 2010-06-29 GCX17

18 Relationships with WLCG (3/4) Sample expected workflow for WLCG inquiries: LHCOPN meeting, Barcelona, 2010-06-29 GCX18 Site Contact Site Network Team Relevant WLCG Team Experiment WLCG GGUS Internal Ticket System Site Network Team LHCOPN GGUS Relevant Network Team WLCG GGUS Networking? YesNo LHCOPN Related? YesNo LHCOPN GGUS Internal Ticket System Site Contact WLCG GGUS

19 Relationships with WLCG (4/4) Workplan –A dashboard showing tickets impacting WLCG Done: Particular view on the dashboard –Ability to link WLCG and LHCOPN tickets Upcoming: Parent/Child relationship –Cross reference still here (no associated workflow) But problem to interact with WLCG supporters –No cross helpdesk access to update tickets –On site processes Push for carefull implementation of « Site’s contact »? –Internal site’s processes LHCOPN meeting, Barcelona, 2010-06-29 GCX19

20 Around GGUS (1/6): GGUS status list LHCOPN meeting, Barcelona, 2010-06-29 GCX20

21 Around GGUS (2/6): LHCOPN submit interface LHCOPN meeting, Barcelona, 2010-06-29 GCX21

22 Around GGUS (3/6): WLCG submit interface LHCOPN meeting, Barcelona, 2010-06-29 GCX22

23 Around GGUS (4/6): Merging Pros Should we unify/merge LHCOPN helpdesk within the standard GGUS? +Consider networks like other resources (computing, storage, software...) Network are not standalone resource, coordination between sites required +Maybe better fit in reporting reports True +Now standard way to send enquiries to sites? Yes for Grid issues, not always for network teams, less Grid centred, unwilling to go at project level But for a project’s dedicated network? +Maybe some central manpower could be gained +Regularly chasing pending tickets... Very unclear who can do that, and if this will be successful (cf. twiki review) +Less specific software and support from GGUS No key economy for them: Still using same database, hosts etc. and sharing some code +Ease interactions with WLCG supporters Issues evolving in two different worlds Write access to our helpdesk restricted to network teams LHCOPN meeting, Barcelona, 2010-06-29 GCX23

24 Around GGUS (5/6): Merging Cons –We have something stable and working Definitely, but that should not prevent improvements –Completely tailored for us and closely matching our operational model Seems hard to merge frontends and unify workflows –Be far from interferences with Grid world Isolation could be achieved with particular views? Was a key concern from network teams –Not shaped to do user support But coordinating network teams Maintenance not in GGUS No strong preference from the GGUS team Confirmed, not a problem for them LHCOPN meeting, Barcelona, 2010-06-29 GCX24

25 Around GGUS (6/6): Conclusion about merging Our helpdesk was designed to coordinate network teams not to support WLCG users –Really different from standard GGUS –Appears as an internal coordination tool Benefits not so clear, was mainly thought to ease integration in WLCG Ops –But we are not doing Grid Ops Network issue ≠ Grid issue Networks are not standalone resources (storage, cpu etc.) –Similar to software issues handled externally (in savannah) We should not be customer faced –Selected inquiries going through storage teams (“Site contact”) Let’s also see how EGI will converge around user support LHCOPN meeting, Barcelona, 2010-06-29 GCX25

26 Conclusion about LHCOPN Operations Ops status: Clear place for improvements –Unequal following of processes by sites because missing clear feeling of usefulness and evidence of network failures –L2 events well handled while administrative workflow is forgotten WLCG relationships to be implemented and nurtured –Performance issues need smart and timely solving –Skeleton of coordination with WLCG Ops to be improved No outstanding benefit to unify LHCOPN helpdesk with WLCG’s one –Maybe better and enough to carefully link our workflow with WLCG Ops Wait monitoring & SLDs before next set of improvements –Timeline? –Particularly revitalise tickets’ handling and ensure minimum is here LHCOPN meeting, Barcelona, 2010-06-29 GCX26

27 Questions 1. Pushing for administrative things to be done? –twiki review, backup tests etc. 2. LHCOPN representative? –Maybe not responsible for Ops but more liaising as a single contact point –Share and justify the workload 3. GGUS merging –Opinion? LHCOPN meeting, Barcelona, 2010-06-29 GCX27


Download ppt "LHCOPN: Operations status LHCOPN: Operations status cc.in2p3.fr Network team, FR-CCIN2P3 LHCOPN meeting, Barcelona, 2010-06-29."

Similar presentations


Ads by Google