Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks LHCOPN operations Presentation and training.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks LHCOPN operations Presentation and training."— Presentation transcript:

1 EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks LHCOPN operations Presentation and training CERN’s session 1- Goals and general overview of operational model Guillaume Cessieux (FR IN2P3-CC, EGEE SA2) CERN, 2009-04-02

2 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Agenda Goal Overview Actors Information repositories Events management –Incident –Maintenance –Change Grid interactions Processes tools 2

3 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Remark Everything documented and maintained on –https://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModelhttps://twiki.cern.ch/twiki/bin/view/LHCOPN/OperationalModel 3

4 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Goal of the ops model Smartly manage LHCOPN at L3 delivering best network service as possible to WLCG LHCOPN objectives –T0 – T1 traffic  T1 – T1 traffic as best effort T1-T1 links primary goal: T0-T1 backups links +Backup through generic IP LHCOPN is key block of infrastructure around WLCG 4

5 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX MoU (1/3) No particular MoU on LHCOPN operations, part of WLCG MoU signed by T1s –http://lcg.web.cern.ch/LCG/MoU/Goettingen/MoU-Goettingen- 18MAR09.pdfhttp://lcg.web.cern.ch/LCG/MoU/Goettingen/MoU-Goettingen- 18MAR09.pdf –Page A.3.2 (T0), A.3.4 (T1s) –For T0: 5

6 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX MoU (2/3) 6 For T1s:

7 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX MoU (3/3) Raw conclusion –T0:  Response delay: 6 hours  Unexpected downtimes: 3.65 days/year ~ 87 hours –T1s  Response delay: 12 hours  Unexpected downtimes: 7.3 days/year ~ 175 hours This seems really achievable –Cf. https://edms.cern.ch/document/982588/ https://edms.cern.ch/document/982588/  But true scheduled downtimes previously not correctly handled –Delays in announcements to be respected... 7

8 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Overview (1/2) Federated operational model with key responsibilities on sites –Interaction with network providers –Management of network devices on sites –Interaction with the Grid Some information centralised –Serialisation of fault resolution and avoid duplicated information –TTS, web repository… 8

9 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Overview (2/2) 9 Site ASite B NREN A* NREN BNREN C LHCOPN TTS (GGUS) All sites 1 2 3 Users 4

10 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Actors 10 Grid Projects (LCG (EGEE)) Sites (T0/T1) Sites (T0/T1) L2 Networks providers (GEANT2,NRENs) European / Non European Public/Private L2 Networks providers (GEANT2,NRENs) European / Non European Public/Private Sites (T0/T1) ENOC L2 Networks providers (GÉANT2,NRENs…) European / Non European Public/Private NOC/ Router operators Grid data contacts L2 NOC Infrastructure Operators Users DANTE Operation LQA (CH-CERN) Actors’ level

11 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Information repositories Grid TTS (GGUS) Global web repository (Twiki) DANTE Operation LHCOPN TTS (GGUS) L2 Monitoring (perfSONAR e2emon) L3 monitoring ENOC Information repository Actor MDMBGP A is responsible for B BA Operational procedures Operational contacts Technical information Change management DB Statistics reports Grid Project operation (EGEE SA1) L2 NOC CH-CERN 11

12 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Threshold for processes Any events –more than 1 hour –or more than 5 times an hour –Should have a ticket in the TTS Otherwise could be silently handled –But good to report them (statistics, cross checking…) 12

13 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX 3 kinds of events Incident –Unscheduled event –Generic process when cause and location unknown Maintenance –Scheduled event Change –Scheduled change on the infrastructure –Implemented by a maintenance if it impacts! 13

14 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L2 vs L3 LHCOPN built as L2 paths ending on sites –True, some exceptions… Shortcuts –L2: OFF-SITE: optical level, fibre cuts in NREN, etc. –L3: ON-SITE: Router down, power cut, BGP flaps, filtering, IOS upgrade etc. 14

15 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Processes 6 key processes to handle 3 kinds of event Incident management –Global Problem management processes 1)L3 incident management 2)L2 incident management –Escalated incident management  (~ trouble > 1 week) Maintenance management 3)L3 maintenance management 4)L2 maintenance management Change management process 5)L3 change management 6)L2 change management Unscheduled (Minimum for on duty people…) Scheduled Complexity 15

16 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Change VS Maintenance Change to broadcast and document the change Any change with a impact should be implemented with an associated maintenance 16

17 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Sample events Incident –L2: Dark fibre outage –L3: Router down, BGP filtering, bad routing Maintenance –L2: Fibre rerouted, fibre to be cleaned –L3: Scheduled power cut on site, IOS upgrade Major change –L2: New LHCOPN link –L3: New IP adresses, prefixes, filtering 17

18 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Who is who Router operator –People acting on sites’ network devices Network provider –NRENs, GÉANT2 etc. Grid data contact –Role supported by each sites –Typicaly FTS and Dcache managers etc. 18

19 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Responsibilities Outages on links between T0 and T1 are of responsibility of T1s (who ordered the link) Responsibility for outages on T1-T1 links are being investigated Responsibility for GGUS' ticket is on the site which the ticket is assigned to –Only one entity responsible at any time  Avoid the no one move effect 19

20 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Generic process Global web repository (Twiki) L2 - L3 Monitoring Site * Router operators * Grid Data contact LHCOPN TTS (GGUS) Start L3 incident management OK L2 incident management OK escalated incident management 1 2 3 4 5 A goes to process BABBAA reads BA B A interacts with B 20

21 Enabling Grids for E-sciencE A- INCIDENT MANAGEMENT 1.1 Incident management 21

22 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L3 incident management process 22 Source site involved Site involved A notifies B Grid Data contact * Router operators Router operators A AB B A interacts with B Affected sites 1.1 LHCOPN TTS (GGUS) L2 incident management 1.4 1.2 2(1.3) BAA reads and writes BA goes to process BAB Scope: Router down, BGP filtering, bad routing...

23 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L3 incident management process Sample use case: Power cut at FR-CCIN2P3 1.Incident registration: Put a GGUS ticket into the TTS 2.Warn Grid data contact and give them reference of network ticket 3.Update it 4.Close it 23

24 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L2 incident management process 24 Sites linked * L2 NOC Grid Data contact * Router operators LHCOPN TTS (GGUS) * End of L3 incident management A notifies B A AB B A interacts with BBAA reads and writes B 1.11.3 1.2 2 escalated incident management (3) Affected sites Scope: Dark fibres outages...

25 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L2IM: use case: RENATER fibre cut CERN-IN2P3-LHCOPN-001 down 1.Start Generic process 2.Start L3 incident management –Nothing at CH-CERN, should be L2 related 3.Then go to L2 incident management 1.See with RENATER NOC what happens  Maybe open a ticket to their NOC 2.Put a ticket in the LHCOPN TTS 3.Warn Grid data contact (and give them ticket #) 4.Follow 25

26 Enabling Grids for E-sciencE B- MAINTENANCE MANAGEMENT 1.2 Maintenance management 26

27 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Maintenance notice delay Impact durationNotice window More than 1 hour1 week Less than 1 hour2 days No impact1 day 27 Otherwise events might be computed in statistics as Incident...

28 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L3 Maintenance management 28 Impacted sites Source sites Grid Data contact * Router operators A notifies B A AB B A interacts with BBAA reads and writes B Impacted sites Router operators LHCOPN TTS (GGUS) 1.1 1.2 2 3 Affected sites Scope: scheduled power outage on site, router IOS upgrade,...

29 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L3M: use case: Scheduled power cut in network area at FR-CCIN2P3 Impact > 1h, maintenance window = 1w Warn Grid data contact and see if ok (Ask DE-KIT and CH-CERN if no overlaping event foreseen) Put a ticket about in the TTS –Yes one week in advance –Give ticket # to Grid data contact Update, follow, close ticket the D-day 29

30 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L2 maintenance management Linked Sites * L2 NOC Linked Sites Grid Data contact Router operators A notifies B A AB B A interacts with BBAA reads and writes B Linked Sites Router operators LHCOPN TTS (GGUS) 1.1 1.4 1.2 1.3 2 Affected sites Scope: fibre physically rerouted, fibre to be cleaned... 30

31 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L2M: use case : RENATER scheduled work on CERN- IN2P3-LHCOPN-001 Received ticket from RENATER Link will be down 6 hours –No impact on service to be confirmed with DE-KIT and CH- CERN –See also with Grid data contact as this may impact performance Put a ticket at least 1d before the event –Give reference to Grid data contact Update and follow ticket 31

32 Enabling Grids for E-sciencE C- CHANGE MANAGEMENT 1.3 Change management 32

33 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L3 Change Management Linked Sites Source site Grid Data contact Router * operators Affected Sites Router operators L3 maintenance management Global web repository (Twiki) A notifies B A AB B A interacts with BBAA reads and writes B Monitoring 1.1 1.2 2.1 2.2 (2.3) (4) LHCOPN TTS (GGUS) 3 Affected sites Scope: IP addresses change, new prefix propagated, new filtering 33

34 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L3C: use case: Change of p2p IP adresses No change on service delivered after –Not of interest for Grid data contact Discuss with CH-CERN and DE-KIT about the change Document the scheduled change on twiki and update technical informations –https://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabasehttps://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabase –https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome part "Technical Information"https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome Put an informational ticket on the TTS about –With DANTE Ops (e2emon) & ENOC in CC to have monitoring adapted – operations AT dante.org.uk;enoc.support AT cc.in2p3.fr Implement (=commit) the change with a maintenance 34

35 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L2 Change Management 35 Linked site Grid Data contact Router operators L2 maintenance management Global web repository (Twiki) * L2 NOC Monitoring Linked Sites Router operators A notifies B A AB B A interacts with BBAA reads and writes B 1.1 1.2 1.3 2.1 2.2 2.3 3 LHCOPN TTS (GGUS) L3 change management (4) Affected sites Scope: New LHCOPN L2 link, L2 link with new physical path, change of L2 network provider for a segment...

36 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX L2C: use case: New link CERN-IN2P3- LHCOPN-002 provided by SWITCH See with SWITCH NOC details See with CH-CERN, DE-KIT new p2p IPs and routing policy Document the scheduled change and update technical informations –https://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabasehttps://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabase –https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome part "Technical Information"https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome Put an informational ticket in the TTS, warning ENOC and DANTE Ops and all sites about what is foreseen –Operations AT dante.org.uk;enoc.support AT cc.in2p3.fr No change on infrastructure without tickets Put a L3 maintenance ticket to commit changes –IP adresses, routing and testing period before production use Then warn Grid data contact: New bandwidth and backup possiblities for project 36

37 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Grid interactions (1/4) Sites (T0/T1) Grid Project (LCG) Sites (T0/T1s) Grid Data Manager Router Operators/ Site NOC Grid Network Networks providers Network providers A B C 37

38 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Grid interactions (2/4) A.Daily operational workflow –Scheduled and unscheduled outages – what to practically do? Try to also avoid overlap of network events & Grid events –Each site is responsible – No central entity → Use simple existing things in place –Existing tools, processes and communication channel to be used EGEE broadcasting tools etc. –Grid data contacts could also report in the daily WLCG phoneconf when needed https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsMeetings Can be done offline: e-mails, reading minutes etc. –But phone turned on when needed Key point: VOs and experiments are reached here 38

39 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Grid interactions (3/4) B. Upper level and long term interactions –Regular problems, improvement and change requests, global assessment of the service delivered etc. → A LHCOPN representative could be the exchange point between LHCOPN and Grid — Report to Grid from quarterly LHCOPN network ops phoneconf Global view of infrastructure and ops Quality assessment, key incident report etc. — Import items from Grid on the agenda — Write conclusions into some quarterly reports 39

40 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Grid interactions (4/4) 40 Sample FR-CCIN2P3 implementation:

41 Enabling Grids for E-sciencE LHCOPN Ops dissemination, CERN, 2009-04-02 GCX Conclusion Only 2 incident management processes to be fully known This is light? Model should be flexible enough for site dependant implementation –From huge layered NOCs to single guy Open to improvements! 41


Download ppt "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks LHCOPN operations Presentation and training."

Similar presentations


Ads by Google