EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Feedback on SAM from SA1 site representatives.

Slides:



Advertisements
Similar presentations
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
Advertisements

HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
WLCG Nagios and the NGS. We have a plan NGS is using a highly customised version of the (SDSC written) INCA monitoring framework. It was became too complicated.
Enabling Grids for E-sciencE SA1 EGEE-II INFSO-RI The Pre-Production Service in WLCG/EGEE A. Retico, N. Thackray CERN – Geneva, Switzerland PPS.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
INFSO-RI Enabling Grids for E-sciencE Information and Monitoring Status and Plans Plzeň, 10 July 2006 Steve Fisher/RAL.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Update Authorization Service Christoph Witzig,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CIC portal Requirements from users WLCG service.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROCs Top 5 Middleware Issues Daniele Cesini,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI SAM New Requirements from the SA1 Survey.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
II EGEE conference Den Haag November, ROC-CIC status in Italy
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
INFSO-RI Enabling Grids for E-sciencE GOCDB Requirements John Gordon, STFC.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarksEGEE-III INFSO-RI MPI on the grid:
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
NGI and Site Nagios Monitoring
POW MND section.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Presentation transcript:

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Feedback on SAM from SA1 site representatives A.Forti, D.Cesini, J.Templon CERN – May 21 st 2007

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Introduction As site representatives we sent a survey asking for feedbacks on SAM to: –WLCG Site Managers –Roc Managers –VO managers We received feedback from various site/ROC managers and from 1 VO (CMS): –Exactly 19 mails – not so many but with a quite useful content –We had a total of about 40 issues from that we aggregated in the following categories:  SAM CORE TEST  SAM RESULT VISUALIZTION  SAM-ADMIN INTERFACE  GRIDVIEW  Feedback from the VO (CMS only)

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Introduction Many answers started with something like: “SAM is a good and very useful tool….” Site-managers and the CMS VO were happy to have the opportunity to improve the tool

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM TEST SAM CORE/TEST

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM TEST 1) Some SAM failures are due to SAM, or rather not to the site. “Central services used by SAM should be checked (RB/WMS, BDII). The SAM-BDII problems are a good example of this.” “This is true also for central SEs used to test replica” “Errors due to "external sources should be filtered out" and not ask the sys admin to go through and take them off by hand in the cic portal.” “Critical tests have to only depend on a single site, not multiple sites, including transfer tests. “ “Re-design the tests to work their way down to the site, stop at the first point of failure and bill that error to the service in question. In the example of the recurrent central BDII failure, when that occurs, the test does not proceed, marks central BDII red and leaves the rest alone.” “One option could be to have the relevant SAM result boxes have one or more additional color coded flags (like a small portion of the box) to indicate the status of the central service(s) upon which that test depends.” VOTES: 5

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM TEST 2) In evaluating site metrics, failures due to SAM problems should be removed. “A new status of Unknown would help. Tests could be pass/fail/unknown.” “A simple change would be daily availability=up/up+down and not up/24. A more sophisticated use would be unknown in a series of up or down would not break the sequence. Ie up-up-up-unknown-up-up would be treated as all ups, similar for downs. “ VOTES: 3 3) ReadOnly DB Access / API to retrieve SAM data/ Interface with local monitoring. “Direct access (read-only) for gaining datasets from SAM database to local monitoring systems” “An official method for pulling SAM status results into NAGIOS would be good” “Applies also to GridView” “Is there any API planned for acquiring SAM results? “ VOTES: 3

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM TEST 4) On demand notifications. “Enabling on demand notifications for failure of a service in a site (or region) by site (or roc) admin. I.E. – notify me if one SE in the UKI roc fails for one VO or notify me if a SAM test for a node in my site fails “ VOTES: 2 5) New functionality sensors. “SAM functionality sensors for WMS, LB, WMSLB, MyProxy, VOMS services should be provided.” 6) Too much CPU occupation. “SAM framework should be improved so as to reduce the CPU occupation of a single SAM test on a WN. Special care should be devoted that, in situations in which (due to some communication problem between RB/WMS, CE and WN) the job cannot download/upload input/output sandbox, the job should be terminated in a reasonable time (i.e. not more that 5-10 minutes). “ Enabled on CIC portal but only for OPS. Selection criteria should be improved.

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM TEST 7) Single test submission. “Last time I tried it, it ran all the tests, and usually we want to concentrate on just the test that is failing” 8) Finer grain representation for failing tests. “Tests can fail, and then be immediately fixed. The downtime, however, seems to be until the next test. So, it is possible to be down for 3 minutes, and get flagged for 4 hours instead. I propose that tests that fail be run more frequently (every hour, every 1/2 hour?) so a more accurate representation of a site can be generated.” 9) Do not base metrics only on OPS TEST Site failing ops test could be 100% operational for VOs

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM TEST 10) Falsification of results worries. “There seems to be no authorization for publishing into SAM -- this means it would be very easy to falsify test results. Considering that test results are used to generate metrics that may influence funding, there is an incentive for this” 11) Improve support for local installations of SAM “In particular the requirement for Oracle is unacceptable -- surely a database connectivity library could be used so that any SQL server would do. “ 12) Doc improvements “Documentation for adding new tests/sensors could be better. “ VOTES:2 13) Brokerinfo test “It is still having some trouble locating the.BrokerInfo file. Please fix that.”

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM TEST 14) SRM Tests independent from external services. “Would be nice to have SRM tests not dependent on any external services (BDIIs, etc), but which are directly accessing the SRMs. ( “ 15) How SAM deals with the case where an SE is full of experiment data Currently leads to sites failing tests and eventually being added to experiment blacklists. I think there needs to be some finer grained control over whether a site is available or not. It can still provide useful CPU cycles even when the disk is full ) SAM should avoid "babysitting" tests “Like checking if a certificate is going to expire (So we wonder why SAM is doing this, and are not happy about having to change our infrastructure (like putting up a gridftp port on the SRM server) just to service a test that we are already doing internally, as all sites should be doing.)”

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM Result Visualization SAM RESULT VISUALIZATION

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM Result Visualization 1) A site overall view. “All services for one site in a view.” “A view showing on one page all the tests of a given site for all the services.” “I'd like to see all failed tests for all supported/checked VOs at given site on one page.” 2) The "new" pastel colors are hardly visible “the old "traffic-light" colors let you you could easily see which sites have problems. Would it be possible to configure things like colors using user style-sheets?” 3) Support for more generic filtering of sites (e.g. regexp). I often want to display all Grid-Ireland sites but currently the finest grain available is Region (i.e. UKI) VOTES: 2 VOTES: 6

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM Result Visualization 4) The default page layout is quite inefficient. “There is a lot of white space (actually blue space!) at the top of the page so again, the actual results are usually way off the page. “ 5) The columns displayed could be more configurable. “In particular displaying Region, Sitename and NodeName by default means most of the test results are off the page. It should be possible to turn off such columns if wanted. “ 6) SAM interface is slow. VOTES: 2

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM Result Visualization 7) Show only failed tests. “Usually I don't care about tests with status OK, so it would be useful to see all failed (especially critical) tests for all services. “ 8) Show Site FCR status “And maybe it is also possible to see on SAM pages what each VO thinks about my site - I mean status in FCR (used or no). “ 9) Hint for problems solution. “It would be very helpful if, for each test, there was a handy URL to a page with common solutions to problems which can cause the test to fail. “ VOTES: 3

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM Result Visualization 10) Timestamps are confusing. “Sometimes it can be seen that tests timestamps are not chronologically ordered. This specially happens for SAM tests submitted from SAM Admin's Page.” 11) The use of downtimes from GOCDB is not consistent. “The time zones are incorrectly used (the problem may be in GOCDB or in SAM), and not all services are labeled as in SD (e.g. SE is not labeled as in SD, while CE or gCE is). “ 12) Non secure http page for test results. “Perhaps a non-secure view of the hi-level results could be available. One does not always has its laptop with his own certificate.”

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM-ADMIN SAM ADMIN

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on SAM-ADMIN 1) The SAM admin page actions are still very slow “Very often it does not know the status of the submitted job. “ 2) Add the possibility to submit a single test (i.e. only those that are failed) 3) Add the possibility to submit using personal certificates Useful to test the infrastructure for the regional VOs

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on GridView GridView

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback on GridView 1) Improve abbreviated site names in GridView 2) Data retrieved from SAM and GridView is not always consistent “Tier2 names in GridView are nonsense.” “The use of abbreviate site names in GridView is a real pain.” 3) Check central services status: use a grey color for their inefficiencies “Have a green/red/grey color code, with grey indicating the central services inefficiency. “ “In particular the Tier1 availability metrics. Gridview is the more usable interface so we'd like to believe it. “ 4) Add numbers to the plot 5) Reduce the 10% discretization in plots “The availability is only ever 70%, 80%, 90%..This is strange since we are aiming for a 95% target. “ VOTES: 3

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback from CMS From The CMS VO

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback from CMS CMS uses SAM framework to: –run experiment-specific tests on the worker node –check various details of the local configuration and accessibility of the CMS software and services –test low-level functionalities of SRM services in accordance to their usage by the CMS data management system –use SAM to run tests on CMS-specific services. (foreseen)

Enabling Grids for E-sciencE EGEE-II INFSO-RI SAM REVIEW May 21st Feedback from CMS CMS request of improvement (Some of them are already planned in the SAM development, but CMS would like anyway to mention them for their importance.) –possibility to run different tests with different proxies or VOMS FQANs; –support for experiment-specific service types; –support for automatic renewal of VOMS proxies; –more flexible HTTP queries to the SAM database; –analogous queries to produce history or availability plots which can be included in any web page; –introduction of a new service type for SRMv2; –different levels of test criticality; –possibility to run "on demand" experiment-specific tests from a web interface; –a more user-friendly web interface for SAM results, integrated with plots, all from a single entry point; –possibility for the CIC-on-duty to act also on failures on experiment-specific tests; –support for alarms and notifications.