Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.

Similar presentations


Presentation on theme: "Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai."— Presentation transcript:

1 Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai

2 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2 Outline Monitoring and Operational tools –SAM framework sensors availibility metrics –FCR –gstat, GOCDB, SAM Admin Portal, COD Dashboard Grid Operations (COD)

3 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3 Monitoring tools Service Availibility Monitoring (SAM)

4 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4 SAM -- Overview Grid service-level monitoring framework successor of SFT used in Grid Operations basis for Availibility Metrics VO-based submissions –VO-specific tests services tested currently: CE, gCE SE RB sBDII BDII FTS LFC JobWrapper tests

5 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 5 Central SAM submissions Official CERN submissions –Production and Certified sites –ops (+ dteam) VO –job submitted in every hour –basis of COD alarms –https://lcg-sam.cern.ch:8443/sam/sam.pyhttps://lcg-sam.cern.ch:8443/sam/sam.py PPS –ops VO –hourly –https://lcg-sam.cern.ch:8443/sam-pps/sam.pyhttps://lcg-sam.cern.ch:8443/sam-pps/sam.py SAM Admin Portal –ops VO –on-demand –Certified + Uncertified sites

6 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 6 VO specific tests submission LHCb –successfully migrated to SAM (only CE, gCE) –VO specific test (Dirac installation) Atlas –all sensors –submitted from SAM UI CMS –set up, but no regular submission yet

7 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 7 SAM Portal

8 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 8 SAM Internals framework structure –client submission framework –(developed by CERN team) sensors –developed by different contributors + CERN team –tests: plug-in modules –server web services portal Oracle DB accessed by web services static (GOCDB) + dynamic (BDIIs) info

9 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 9 Sensors – CE, gCE, SE, SRM CE, gCE –job submission UI → RB → CE → WN chain –CA certificates (on WN) –software middleware version (WN) –replica management lcg-utils default SE + 3 rd -party replication –RGMA, Apel, etc. SE, SRM –UI ↔ SE/SRM lcg-utils (LFC)

10 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1010 Sensors – LFC, FTS LFC –lfc-ls + create file in /grid/ FTS –BDII entry check –listing channels glite-transfer-channel-list (ChannelManagement service) –transfer test (in development): submitting transfer jobs between SRMs in all Tier0 and Tier1 sites (N-N testing) checking the status of jobs Note! The test is relying on availability of SRMs in sites

11 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai1 Standalone sensors – BDII, RB sBDII (Gstat) –accessibility –sanity checks top-level BDIIs (Gstat) –accessibility –reliability of data (number of entries) RB –jobs submission UI → important RBs → “reliable” CEs –time of matchmaking

12 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1212 JobWrapper tests JobWrapper –requested by experiments, also useful in operations –testing all WNs SAM always tests just an arbitrary one –tests executed by CE wrapper script executed with every production job –test results passed to the job published to the SAM DB –test code core scripts in the release tests on software area (signed tarball) –soon in production

13 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1313 Availability metrics - algorithm t ∈ CriticalTests TestResult (N,t)Status of node N =Status of site S = CE1CE2CEnSRM 1SRM 2SRM nsite BDII AND OR Everything is calculated for each VO that defined critical tests in FCR Results make sense only if VO submits tests!!! N ∈ instances(C) Status (N) Status of service C = ∧ ∨ ∧ = boolean AND ∨ = boolean OR

14 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1414 Availability metrics - algorithm II service and site status in every hour daily, weekly, monthly availability scheduled downtime information from GOCDB details of the algorithm on GOC: http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation

15 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1515 Availability metrics - GridView

16 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1616 Availability metrics - data export

17 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1717 VO tools Freedom of Choice for Resources (FCR)

18 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1818 FCR -- Overview Freedom of Choice for Resources https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi VO policy enforcement tool critical test and resource selection for VOs by manipulating top-level BDII information goal is to be able to –select which aspects of site funcionality are important for the VO –blacklist unreliable sites –always use stable, "important" sites –less reliable sites based on SAM results

19 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1919 FCR -- Overview integrated with SAM –sharing the same DB optional usage –BDII configuration parameter –FCR output: ldif file information from GOCBD + BDII DN-based authentication (2-levels)

20 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2020 FCR Admin Portal

21 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2121 FCR User Pages read-only view of VO settings tells if the resource is available at the moment grouping selection

22 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai2 FCR User Portal

23 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2323 Monitoring tools gstat, SAM Admin Portal, COD dashboard

24 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2424 gstat (Sinica) –http://goc.grid.sinica.edu.tw/gstat/http://goc.grid.sinica.edu.tw/gstat/ –Information System (BDII) monitoring –response time, consistency (sanity), completeness –site-BDII + top-level BDII –aggregated and detailed views –plots (history) –refreshed in every 5 mins (non- intrusive)

25 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2525 gstat

26 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2626 SAM Admin Portal –https://monitoring.egee.man.poznan.pl/admin2https://monitoring.egee.man.poznan.pl/admin2 –on-demand SAM submission –easy to use –target site selection –used by: ROCs: certification of a site ROCs, site admins, CODs: speed up debugging

27 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2727 SAM Admin Portal

28 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2828 GOCDB –https://goc.grid-support.ac.uk/gridsite/gocdb2/index.phphttps://goc.grid-support.ac.uk/gridsite/gocdb2/index.php –central database to store static site information –all EGEE sites have to register –contact, security contact, certification status, site type –scheduled maintainence –used by script that generates top-level BDII config file monitoring tools SAM DB → SAM, FCR, Availability calc. operations management tools

29 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2929 GOCDB

30 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3030 CIC Operations Portal COD management –schedule for rotations –COD dashboard –COD handover notes ROC management –ROC contacts –weekly reports VO management –VO ID cards (VO contacts, etc.) EGEE broadcast

31 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3131 CIC Operations Portal

32 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3232 GGUS (FZK) Global GRID User Support http://ggus.org ticketing system for the EGEE GRID based on Remedy tickets created by –individual users (manually) –Grid Operators (via COD Dashboard) news, documentation

33 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai3 GGUS Portal

34 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3434 Operations Grid Operations

35 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3535 EGEE Operations Structure Regional Operations Centres (ROC) –One in each region (incl. Asia-Pacific) –Front-line support for user and operations issues point of contact for sites in the region –Provide local knowledge and adaptations –Manage daily Grid operations – oversight, troubleshooting –Run infrastructure services for Asia-Pacific region –Asia-Pacific roc@lists.grid.sinica.edu.tw Jason Shih, Min-Hong Tsai, Shu-Ting Liao –CERN (catch-all ROC) egee-roc-cern@cern.ch Nicholas Thackray

36 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3636 COD COD is Operator on Duty –was: CIC-on-Duty global LCG/EGEE GRID monitoring 1 (2) ROCs responsible for the whole GRID operations at a time –12 ROCs involved –weekly rotation weekly WLCG-OSG-EGEE Operations meeting –ROCS, Tier1, VOs –all sites invited

37 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3737 COD Procedures https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures Looking at monitoring tools –SAM, Certificate Monitoring pages Open tickets using COD Dasboard Escalate expired tickets Process site responses (update tickets accordingly) End of duty: hand-over notes Update the GOC wiki pages

38 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3838 COD Dashboard summary of necessary monitoring information + tools for ticket processing tickets linked to GGUS tickets GOCDB information –site downtime information! SAM alarms ticket creation and management tool tools for related e-mail

39 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3939 COD Dashboard

40 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4040 Connection between the used tools COD dashboard Monitoring tools GGUS Grid Operators (COD) Problem tracking and reporting Ticket follow-up Modifications on the tickets SAM

41 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4141 defines the steps to be taken during the lifetime of a ticket –tickets don't get forgotten! avaliable on CIC Portal –(https://edms.cern.ch/document/701575)https://edms.cern.ch/document/701575 prioritization alarms depending on the amount of resources at the site Escalation Procedure

42 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4242 Escalation Steps 1.ticket creation 2.first mail (to: site + ROC) 3.second mail (to: site + ROC) 4.suspension from the GRID before 4.: a) mail to ROC b)mail to OCC for validation c)site is invited to the weekly operations meeting

43 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4343 Escalation Procedure -- Quarantine site categories –low: CPU <20 –normal: 20 < CPU < 100 –high: 100 < CPU between 2.-3. and 3.-4. –low + normal: 3 days –high: 1 days

44 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai4 COD Escalation Procedure Create ticket Close ticket When deadline reached Problem solved ? last escalation ? Extend deadline Suspend site Escalate mail yes no site responds mail

45 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4545 What a site is expected to do Look at the monitoring tools (SAM) –try to notice & fix failures before the CODs COD notification about a failure –fix it ASAP –contact the ROC for help if needed Scheduled downtime –enter it in GOCDB –broadcast it in advance –broadcast when it's finished weekly site reports (at COD portal) –input to weekly Operations meeting

46 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4646 What a site could do problems → contact the ROC –best way: GGUS ticket question → ask the ROC open a ticket if there is a failure in Central Services –LFC, SAM, etc.

47 Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4747 Happy End Thanks for your attention :)


Download ppt "Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai."

Similar presentations


Ads by Google