Presentation is loading. Please wait.

Presentation is loading. Please wait.

Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Operations in PL-Grid M. Radecki, T. Szepieniec,

Similar presentations


Presentation on theme: "Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Operations in PL-Grid M. Radecki, T. Szepieniec,"— Presentation transcript:

1 Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Operations in PL-Grid M. Radecki, T. Szepieniec, M. Krakowian, T. Szymocha, M. Zdybek, D. Harezlak, and J. Andrzejewski ACC CYFRONET AGH Cracow Grid Workshop Cracow, 11.10.2010

2 2 Outline  Goal of Grid Operations  PL-Grid services for users  User registration and account management – PL-Grid Portal  Incident reporting  Usage monitoring  PL-Grid services for Polish NGI  service availability monitoring  grid usage accounting  issue tracking  High level view on EGI, NGI and PL-Grid Operations  Incident Management in PL-Grid  Grid Infrastructure Monitoring  Operations Communication and Documentation

3 3 Goal of PL-Grid Operations  coordinate and fulfill activities and processes required to provide and manage services for PL-Grid users  manage the technology required to provide and support these services

4 4 PL-Grid infrastructure services  Services for users  access to computing power and storage space in 5 largest Polish computing centers  scientific software (e.g Gaussian, Fluent, Povray)  user account management system  facilities to report problems & service requests  resource usage monitoring system  application portals and other tools for users (soon)  PL-Grid as Polish NGI is obliged to provide some services interfaced to EGI  service availability monitoring system  issue tracking and user support system  accounting (resource usage) system

5 5 User account management  Motivation: necessity to determine if user is entitled to use PL-Grid resources  Registration process confirms a user is researcher affiliated to Polish research unit or ward: undergraduates, PhD students authorized by supervisor  Registration must be on-line for user  Implementation: PL-Grid Portal based on Liferay engine  Successful user registration results in Portal account - PL-Grid “entry point” for the user  Easily extended with new functionality using JSR 268 portlets  Ability to re-use rich Liferay components library like e.g. forum, wiki  PL-Grid specific features  Easy personal certificate access - ability to get X.509 certificate on-line scope limited to PL-Grid services only  User account data integrated with PL-Grid tools & services User login used for services allowing login/password authentication/authorization  Broadcast tool to contact all users

6 6 User account management – 1 st year experiences  PL-Grid user registration opened at last year's CGW  PL-Grid Portal technology changed from Java Spring through Google Web Tookit to Liferay  Agreed formal process description documents indispensable  user registration important for all PL-Grid computing centers  procedure security  User statistics (as of 10.10.2010)  Registered users: 204 PL-Grid staff: 64  independent researchers: 56  wards: 84 Jan – Oct 2010 no. of registered users

7 7 PL-Grid Scientific Software & Helpdesk  PL-Grid offers access to both commercial and free scientific applications  NAMD, ADF, Blender, CFour, CPMD, Dalton, Fluent, Gamess, Gaussian, Gromacs, NWChem, Povray, Turbomole  Availability of software and current status are monitored and results are feed to incident management system  higher availability for users  Users can check if program failed due to their fault of computing center problem  Issues with monitoring  monitoring system designed for site admins, web interface unacceptable for users, consider possibility of using myEGI portal when available  PL-Grid Helpdesk allows reporting issues, problems and service requests  Reporting can be done via phone call, e-mail or PL-Grid Helpesk web interface, phone call reports are registered by operator  Report registration returns a user with incident identifier allows to refer and modify the incident later on  Incident transferred to EGI level if solution lies beyond the scope of Polish NGI still can be managed via PL-Grid Helpdesk

8 8 Resource Usage Monitoring System  Motivation: PL-Grid grant accounting, daily data reports for users  In first prototype available the users can track their resource usage  status of jobs daily  daily workload (CPU-, walltime) per computing center  Currently used in parallel with EGI accounting - APEL

9 9 EGI, NGI & PL-Grid Operations – high level view EGI: Central Operator on Duty NGI: Regional Operator on Duty EGI Operations Dashboard GGUS PL-Grid Helpdesk WebSvc Regional Technical Support Site Administrators use Operations Support Teams Operations Support Tools Monitoring JMS

10 10 PL-Grid Operations: Incident Management “The main objective of incident management process is to resume regular state of affairs as quickly as possible and minimize the impact of business processes." Service Operation based on ITIL(R) V3  Identification  incidents are triggered by monitoring system, users or technical staff  Registration  issue tracking system (PL-Grid adapted Request Tracker)  incident reported by user or staff is always registered  only long-standing (>24h) problems reported by monitoring system are registered  Classification  regular middleware services / PL-Grid applications  Escalation  experts are responsible for making sure the problem is solved or reassign  incidents can be escalated to EGI for software problems  Solution applied & Tested => Issue Closed  administrator of failed resource applies solution  triggers execution of the monitoring system probes  check if user is satisfied => if all OK, close incident

11 11 Incident Management – PL-Grid experience  Pro-active procedures for troubleshooting in first 24h monitoring system reported incidents, involving Regional Technical Support  Incident solution process can be useful source of knowledge  PL-Grid introduced Operational Problems Knowledge Base  Regional Technical Support team creates entries  data to be re-used when similar problem occurs again  publicly available - web pages indexed by search engines  entry contains full error message and detailed solution procedure - in case of problems – paste your error message in Google Search  KB population started in Aug 2009, ~50 entries  knowledge base link: https://weblog.plgrid.pl/category/1st-line-support/https://weblog.plgrid.pl/category/1st-line-support/  Incident Management Metrics – evaluate performance  quantitative e.g. number of incidents, individual submitters, GGUS share etc.  focused on teams response time  Issues  team reaction time metrics indicate room for improvement, need to promote incident handling procedures among supporters/experts  Knowledge Base requires initial investment, but more entries, more it pays off

12 12 Grid Infrastructure Monitoring System  Motivation: not acceptable to wait for user to notify service problem  PL-Grid monitoring system is extended version of EGI nagios-based system for grid services availability monitoring  PL-Grid extensions  monitoring PL-Grid scientific software  probes for availability of PL-Grid VO (vo.plgrid.pl)  other middleware services (being integrated)  Alarms sent to EGI message bus (based on ActiveMQ JMS implementation) and then displayed in EGI Operations Dashboard (incl. PL-Grid extensions)  Issues  core services poorly or not monitored  monitoring system triggers incidents, nice to have possibility to monitor trends and predict failures  no control system, services does not have management interface – software maturity issue

13 13 Operations Communication & Documentation  PL-Grid Operations Center is distributed, resources are located in geographically distant centers – requires other than F2F means of communication  Solving operational problem requires interactive communication (better than e-mail)  Coordination of distributed teams require procedures, work descriptions and handovers  PL-Grid use bi-weekly teleconferences where operations issues can be discussed  Jabber service with automatically generated contact list to all registered PL-Grid staff  RTS fills daily handover reports and quarterly summary  Operational Documentation  Incident Handling in PL-Grid Helpdesk https://weblog.plgrid.pl/procedura-obslugi-helpdesku/  Operational Procedures for ROD, RTS and site admins https://weblog.plgrid.pl/procedury-operacyjne-pl-grid/

14 14 Questions?


Download ppt "Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Operations in PL-Grid M. Radecki, T. Szepieniec,"

Similar presentations


Ads by Google