Presentation is loading. Please wait.

Presentation is loading. Please wait.

LHCOPN operational handbook Documenting processes & procedures Presented by Guillaume Cessieux (CNRS/IN2P3-CC) on behalf of CERN & EGEE-SA2 LHCOPN meeting,

Similar presentations


Presentation on theme: "LHCOPN operational handbook Documenting processes & procedures Presented by Guillaume Cessieux (CNRS/IN2P3-CC) on behalf of CERN & EGEE-SA2 LHCOPN meeting,"— Presentation transcript:

1 LHCOPN operational handbook Documenting processes & procedures Presented by Guillaume Cessieux (CNRS/IN2P3-CC) on behalf of CERN & EGEE-SA2 LHCOPN meeting, CERN 2008-06-16

2 2 Goals Living procedures – Avoid a 100 pages, static document, never updated Summarize the current view – Document the strict minimum … but be accurate enough – Define a methodology to aid discussion Define roles & responsibilities – Separate roles from implementation Clearly express where we are – Highlight some weaknesses – Begin the improvement process Be careful as it is hard to have « agreement » from all entities

3 3 Operational Model Need to identify the major operational components and formalise their interactions – Information repositories GGUS, TTS, Twiki, PerfSonar etc. – Actors Site network support, ENOC, E2ECU, USLHCNet etc. Grid Operations. – Processes Who is responsible for which information? How does communication take place? – Actor Repository – Actor Actor For what purpose does communication take place? – Resolving identified issues – Authorising changes and developments A minimal design is needed to deal with the major issues – Incident Management – Problem Management – Change Management Slide from David Foster – GDB 2008-04-02

4 4 Caution Theses slides are only depicting the minimum set of interactions needing to take place between entities

5 5 Drawing conventions Actor D Information repository 1 A is responsible for 1 (the set up, not for its contents) Process E Actor C * Actor A (Current implementation) Actor B A starts process E A «interacts » with B Information repository 2 B reads and writes into 1 C reads into 2 2 notifies D (alarms…) 1 and 2 exchange TT Possible initiator of the process on current slide = optional (relations) or not yet existing (actors and information repositories) B may « interact » with C

6 6 Outlines LHCOPN Actors Actors and information repositories management Information access – Current & Desired Processes – Problem management Incident management Maintenance management – Change management Handling multi-hop troubles To be discussed

7 7 Grid Projects (LCG (EGEE)) Sites (T0/T1) Sites (T0/T1) L2 Networks providers (GEANT2,NRENs) European / Non European Public/Private L2 Networks providers (GEANT2,NRENs) European / Non European Public/Private LHCOPN Actors Sites (T0/T1) LCU Actor L2 Networks providers (GÉANT2,NRENs…) European / Non European Public/Private NOC/ Router operators Grid data managers NOC Infrastructure Operators Users DANTE L2 Global NOC (E2ECU)

8 8 Grid TTS (GGUS) Global web repository (Twiki) DANTE Actors and information repositories management L2 NOC (E2ECU) LHCOPN TTS (GGUS) L2 Monitoring (perfSONAR e2emon) L3 monitoring E2ECU’s TTS (PAC) LCU (ENOC) Information repository Actor MDMBGP A is responsible for B ? BA Operational procedures Operational contacts Technical information Change management DB Statistics reports Planning Grid Project operation (EGEE SA1)

9 9 Information access: Current BABAA reads BA reads and writes B L2 NOC (E2ECU) Sites L2 Monitoring (perfSONAR e2emon) L3 BGP monitoring E2ECU’s TTS L2 network providers Global web repository (Twiki) ENOC SA2 Information repository Actor

10 10 Information access: Desired BA BA A reads B A reads and writes B L2 NOC (E2ECU) Sites LHCOPN TTS (GGUS) L2 Monitoring (perfSONAR e2emon) L3 monitoring E2ECU’s TTS LCU (ENOC) L2 network providers Grid TTS (GGUS) Grid projects TT exchange between A and B AB Global web repository (Twiki) Statistics Planning

11 11 Problem management process problem cause and location unknown Global web repository (Twiki) L2 - L3 Monitoring Site * Router operators * Grid Data manager LHCOPN TTS (GGUS) Planning A goes to process BAB Start L3 incident management OK L2 incident management OK other process? BAA reads BA B A interacts with B 1 2 3 4 5

12 12 L3 Incident management process L2 NOC (E2ECU) Source site involved Grid Site involved A notifies B Grid Data manager * Router operators Router operators Grid Data manager A AB B A interacts with B Other Sites 1.2 LHCOPN TTS (GGUS) Grid TTS (GGUS) L2 incident management 1.4 1.1 2.1 2.2 3.1 3.2 3.3 (1.3) BAA reads and writes BA goes to process BAB

13 13 Sites linked L2 Incident management process * L2 NOC (E2ECU) Sites linked * L2 network providers Grid Grid Data manager Router operators Other Sites LHCOPN TTS (GGUS) * End of L3 incident management L2 Monitoring Grid TTS (GGUS) A notifies B A AB B A interacts with BBAA reads and writes B

14 14 L3 Maintenance management process L2 NOC (E2ECU) LHCOPN TTS (GGUS) Planning Other sites Source site * Router operators Grid Data manager “Negotiation” phase between steps 2 and 3 Grid Grid TTS (GGUS) 1.2 2 3.1 3.2 4 1.1 A notifies B A AB B A interacts with BBAA reads and writes B

15 15 Sites L2 Maintenance management process L2 NOC (E2ECU) * Source L2 network provider LHCOPN TTS (GGUS) Planning Linked Sites Grid Data manager Router operators Grid Grid TTS (GGUS) All sites 1 2 3 45 6 7 A notifies B A AB B A interacts with BBAA reads and writes B

16 16 L3 Change management process IP addresses changed, new IP prefixes, new BGP filtering... L2 NOC (E2ECU) Global web repository (Twiki) L3 Monitoring Source site * Router operators Grid Data manager Other sites L3 maintenance management 1.1 1.2 1.3 2 (3) A notifies B A AB B A interacts with BBAA reads and writes B Grid Grid TTS (GGUS) 2

17 17 Sites L2 Change management process New L2 link, L2 link using another physical path, change of L2 network provider for a segment... L2 NOC (E2ECU) * L2 network provider Global web repository (Twiki) L2 Monitoring ABA interacts with B BA BA A reads B A reads and writes B Sites Router operators Grid Data manager L3 Monitoring LCU L2 maintenance management L3 Change management 1.1 1.22.1 2.2 2.3 2.4 3.1 3.2 (4) (5) A goes to process BAB

18 18 Handling multi hops troubles * Site 1 Site 2 Site 3 Problem example: - Site 1 unables to reach site 3 but ables to reach site 2 - Site 2 ables to reach site 3  L3 problem assigned by site 1 to site 3  If no resolution, site 1 reassigns it to site 2  Keep only one ticket per trouble  Enable serialisation of trouble resolution  Problem’s responsibility transfered with ticket’s re-assignement  Initiator follows trouble

19 19 To be discussed (1/3) Service quality – MoU Checking & statistics review – Periodical review of opened tickets – Escalation process Quick notification to the Grid projects – Responsibility of Grid Data managers – How and to whom? – Maybe some events could be « silently » handled backup under maintenance etc. Reduce the number of events broadcasted – Focus on important things 5 sites affected by same trouble  Only one notification to the Grid

20 20 To be discussed (2/3) E2ECU plays a key role – Are/should/could all current L2 network providers being handled by the E2ECU? What is current status? Which segments are handled? – More technical information should be exchanged Heavy dependency on perfSONAR inside NRENs – Sites forgotten Tickets delivery, processes, reports, monitoring access... – Transition to production quality?

21 21 To be discussed (3/3) Role of the LCU – No LCU overhead on processes: Offline role! Statistics Set up of communication channels and information repositories Assessment of processes from a spectator Many automated notifications foreseen – If broken (e-mails delayed, …) – Acknowledgements needed?

22 22 Questions & Discussion


Download ppt "LHCOPN operational handbook Documenting processes & procedures Presented by Guillaume Cessieux (CNRS/IN2P3-CC) on behalf of CERN & EGEE-SA2 LHCOPN meeting,"

Similar presentations


Ads by Google