Download presentation
Presentation is loading. Please wait.
Published byBasil Leo Webster Modified over 9 years ago
1
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai
2
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2 Outline Monitoring and Operational tools –SAM framework sensors availibility metrics –FCR –gstat, GOCDB, SAM Admin Portal, COD Dashboard Grid Operations (COD)
3
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3 Monitoring tools Service Availibility Monitoring (SAM)
4
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4 SAM -- Overview Grid service-level monitoring framework successor of SFT used in Grid Operations basis for Availibility Metrics VO-based submissions –VO-specific tests services tested currently: CE, gCE SE RB sBDII BDII FTS LFC JobWrapper tests
5
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 5 Central SAM submissions Official CERN submissions –Production and Certified sites –ops (+ dteam) VO –job submitted in every hour –basis of COD alarms –https://lcg-sam.cern.ch:8443/sam/sam.pyhttps://lcg-sam.cern.ch:8443/sam/sam.py PPS –ops VO –hourly –https://lcg-sam.cern.ch:8443/sam-pps/sam.pyhttps://lcg-sam.cern.ch:8443/sam-pps/sam.py SAM Admin Portal –ops VO –on-demand –Certified + Uncertified sites
6
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 6 VO specific tests submission LHCb –successfully migrated to SAM (only CE, gCE) –VO specific test (Dirac installation) Atlas –all sensors –submitted from SAM UI CMS –set up, but no regular submission yet
7
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 7 SAM Portal
8
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 8 SAM Internals framework structure –client submission framework –(developed by CERN team) sensors –developed by different contributors + CERN team –tests: plug-in modules –server web services portal Oracle DB accessed by web services static (GOCDB) + dynamic (BDIIs) info
9
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 9 Sensors – CE, gCE, SE, SRM CE, gCE –job submission UI → RB → CE → WN chain –CA certificates (on WN) –software middleware version (WN) –replica management lcg-utils default SE + 3 rd -party replication –RGMA, Apel, etc. SE, SRM –UI ↔ SE/SRM lcg-utils (LFC)
10
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1010 Sensors – LFC, FTS LFC –lfc-ls + create file in /grid/ FTS –BDII entry check –listing channels glite-transfer-channel-list (ChannelManagement service) –transfer test (in development): submitting transfer jobs between SRMs in all Tier0 and Tier1 sites (N-N testing) checking the status of jobs Note! The test is relying on availability of SRMs in sites
11
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai1 Standalone sensors – BDII, RB sBDII (Gstat) –accessibility –sanity checks top-level BDIIs (Gstat) –accessibility –reliability of data (number of entries) RB –jobs submission UI → important RBs → “reliable” CEs –time of matchmaking
12
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1212 JobWrapper tests JobWrapper –requested by experiments, also useful in operations –testing all WNs SAM always tests just an arbitrary one –tests executed by CE wrapper script executed with every production job –test results passed to the job published to the SAM DB –test code core scripts in the release tests on software area (signed tarball) –soon in production
13
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1313 Availability metrics - algorithm t ∈ CriticalTests TestResult (N,t)Status of node N =Status of site S = CE1CE2CEnSRM 1SRM 2SRM nsite BDII AND OR Everything is calculated for each VO that defined critical tests in FCR Results make sense only if VO submits tests!!! N ∈ instances(C) Status (N) Status of service C = ∧ ∨ ∧ = boolean AND ∨ = boolean OR
14
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1414 Availability metrics - algorithm II service and site status in every hour daily, weekly, monthly availability scheduled downtime information from GOCDB details of the algorithm on GOC: http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation http://goc.grid.sinica.edu.tw/gocwiki/SAME_Metrics_calculation
15
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1515 Availability metrics - GridView
16
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1616 Availability metrics - data export
17
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1717 VO tools Freedom of Choice for Resources (FCR)
18
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1818 FCR -- Overview Freedom of Choice for Resources https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi VO policy enforcement tool critical test and resource selection for VOs by manipulating top-level BDII information goal is to be able to –select which aspects of site funcionality are important for the VO –blacklist unreliable sites –always use stable, "important" sites –less reliable sites based on SAM results
19
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 1919 FCR -- Overview integrated with SAM –sharing the same DB optional usage –BDII configuration parameter –FCR output: ldif file information from GOCBD + BDII DN-based authentication (2-levels)
20
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2020 FCR Admin Portal
21
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2121 FCR User Pages read-only view of VO settings tells if the resource is available at the moment grouping selection
22
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai2 FCR User Portal
23
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2323 Monitoring tools gstat, SAM Admin Portal, COD dashboard
24
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2424 gstat (Sinica) –http://goc.grid.sinica.edu.tw/gstat/http://goc.grid.sinica.edu.tw/gstat/ –Information System (BDII) monitoring –response time, consistency (sanity), completeness –site-BDII + top-level BDII –aggregated and detailed views –plots (history) –refreshed in every 5 mins (non- intrusive)
25
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2525 gstat
26
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2626 SAM Admin Portal –https://monitoring.egee.man.poznan.pl/admin2https://monitoring.egee.man.poznan.pl/admin2 –on-demand SAM submission –easy to use –target site selection –used by: ROCs: certification of a site ROCs, site admins, CODs: speed up debugging
27
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2727 SAM Admin Portal
28
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2828 GOCDB –https://goc.grid-support.ac.uk/gridsite/gocdb2/index.phphttps://goc.grid-support.ac.uk/gridsite/gocdb2/index.php –central database to store static site information –all EGEE sites have to register –contact, security contact, certification status, site type –scheduled maintainence –used by script that generates top-level BDII config file monitoring tools SAM DB → SAM, FCR, Availability calc. operations management tools
29
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 2929 GOCDB
30
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3030 CIC Operations Portal COD management –schedule for rotations –COD dashboard –COD handover notes ROC management –ROC contacts –weekly reports VO management –VO ID cards (VO contacts, etc.) EGEE broadcast
31
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3131 CIC Operations Portal
32
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3232 GGUS (FZK) Global GRID User Support http://ggus.org ticketing system for the EGEE GRID based on Remedy tickets created by –individual users (manually) –Grid Operators (via COD Dashboard) news, documentation
33
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai3 GGUS Portal
34
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3434 Operations Grid Operations
35
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3535 EGEE Operations Structure Regional Operations Centres (ROC) –One in each region (incl. Asia-Pacific) –Front-line support for user and operations issues point of contact for sites in the region –Provide local knowledge and adaptations –Manage daily Grid operations – oversight, troubleshooting –Run infrastructure services for Asia-Pacific region –Asia-Pacific roc@lists.grid.sinica.edu.tw Jason Shih, Min-Hong Tsai, Shu-Ting Liao –CERN (catch-all ROC) egee-roc-cern@cern.ch Nicholas Thackray
36
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3636 COD COD is Operator on Duty –was: CIC-on-Duty global LCG/EGEE GRID monitoring 1 (2) ROCs responsible for the whole GRID operations at a time –12 ROCs involved –weekly rotation weekly WLCG-OSG-EGEE Operations meeting –ROCS, Tier1, VOs –all sites invited
37
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3737 COD Procedures https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures Looking at monitoring tools –SAM, Certificate Monitoring pages Open tickets using COD Dasboard Escalate expired tickets Process site responses (update tickets accordingly) End of duty: hand-over notes Update the GOC wiki pages
38
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3838 COD Dashboard summary of necessary monitoring information + tools for ticket processing tickets linked to GGUS tickets GOCDB information –site downtime information! SAM alarms ticket creation and management tool tools for related e-mail
39
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 3939 COD Dashboard
40
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4040 Connection between the used tools COD dashboard Monitoring tools GGUS Grid Operators (COD) Problem tracking and reporting Ticket follow-up Modifications on the tickets SAM
41
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4141 defines the steps to be taken during the lifetime of a ticket –tickets don't get forgotten! avaliable on CIC Portal –(https://edms.cern.ch/document/701575)https://edms.cern.ch/document/701575 prioritization alarms depending on the amount of resources at the site Escalation Procedure
42
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4242 Escalation Steps 1.ticket creation 2.first mail (to: site + ROC) 3.second mail (to: site + ROC) 4.suspension from the GRID before 4.: a) mail to ROC b)mail to OCC for validation c)site is invited to the weekly operations meeting
43
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4343 Escalation Procedure -- Quarantine site categories –low: CPU <20 –normal: 20 < CPU < 100 –high: 100 < CPU between 2.-3. and 3.-4. –low + normal: 3 days –high: 1 days
44
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai4 COD Escalation Procedure Create ticket Close ticket When deadline reached Problem solved ? last escalation ? Extend deadline Suspend site Escalate mail yes no site responds mail
45
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4545 What a site is expected to do Look at the monitoring tools (SAM) –try to notice & fix failures before the CODs COD notification about a failure –fix it ASAP –contact the ROC for help if needed Scheduled downtime –enter it in GOCDB –broadcast it in advance –broadcast when it's finished weekly site reports (at COD portal) –input to weekly Operations meeting
46
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4646 What a site could do problems → contact the ROC –best way: GGUS ticket question → ask the ROC open a ticket if there is a failure in Central Services –LFC, SAM, etc.
47
Grid Operatioins, Tier2 Admin Workshop, 03 Dec. 2006, Mumbai 4747 Happy End Thanks for your attention :)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.