Download presentation
Presentation is loading. Please wait.
1
Supporting Grid Environments
Leigh Grundhoefer Indiana University Thank you for inviting me to discuss operations issues fo grid envs. I have heard a lot of good prez
2
Agenda Introduction Grid3 environment
Operations model and implementation Conclusions 9 December 2018
3
Grid Support What is the structure for the support?
What kind of infrastructure? Definition of “instrumentation” software Deployment policies and procedures Error handling methods What is the structure for the support? Try to reduce duplication of effort Integration of grid support to a variable set of existing resource provider support mechanisms Interfacing support staff and grid experts 9 December 2018
4
Integrating grid support
NOC Facility Operations and Support Security Czar Grid ops Network gods Sys. Admin Resources 9 December 2018
5
Grid Operations Mission
Deploy, maintain, and operate a grid environment as a NOC manages a inter-network, providing a single point of operations for configuration support, monitoring of status and usage (current and historical), problem management, support for users, developers and systems administrators, provision of grid services, security incident response, and maintenance of grid information repositories. Proposed Areas of Research Access control and policy - Security Trouble Ticket System - Problem coordination Configuration and Information Services Health and Status Monitoring Experiment Scheduling 9 December 2018
6
Agenda Introduction Grid3 environment
Operations model and implementation Conclusions 9 December 2018
7
Grid3: an application grid laboratory
CERN LHC: US ATLAS testbeds & data challenges CERN LHC: USCMS testbeds & data challenges end-to-end HENP applications virtual data research - Virtual Data Toolkit virtual data grid laboratory - Software/Operations/Facilities 9 December 2018
8
Grid3 Overview Grid environment built from core Globus 2 and Condor middleware, as delivered through the Virtual Data Toolkit (VDT) and added to a compute cluster or storage resource. Multi-VO based security (Virtual Organisation Membership Service) No shell access to grid resources, no grid-based privileged access Monitoring Instrumentation and Service Metrics defined by the Project Plan Currently 32 sites and opportunistic use ~3200 CPUs Delivering the US LHC Data Challenges 9 December 2018
9
Integrated Monitoring Framework
Globus Meta Directory System (LDAP directory) MonALISA, Monitoring Agents in Large Integrated Service Architecture (Pub/Sub) MonALISA repository (WS/WAP) Ganglia performance monitoring (Multicast/Hierarchical) Job Monitoring System at the Advanced Center for Distributed Computing (non invasive archive) The Grid Site Status Cataloging System at iGOC (human/automatic managed DB) Our instrumenetation resides in the monitoring framework which partially displayed here. ( the Buffalo CCR Job Monitoring is a seperation but equal framework ) 9 December 2018
10
Grid3 – Monitoring Snapshots
Service monitoring GridCat The GridCatlog software is freely available for download and it assists operations in visual review of the status of the grid software at each of the grid sites. MonaLisa 9 December 2018
11
Grid3 – Monitoring Snapshots
Job Monitoring Acdc job monitoring 9 December 2018
12
Agenda Introduction Grid3 environment
Grid Operations model and implementation Conclusions 9 December 2018
13
Grid Operations Approach
The Operations group Sets up and maintains a cooperative grid community Facilitates work to and among responsible agents Has no direct control: uses notification with follow-ups Tunes services to the capabilities of the sites Cooperative and mentoring principles are employed: Identifies community vision – i.e. the Project Plan (anchor) Utilizes a participatory decision making process -- Taskforce Makes clear agreements -- Service Descriptions and MOUs Makes clear communication and conflict resolution a priority Weekly operations (problem solving) and management teleconferences. Main point of this slide is that is a cooperative facilitated effort. The GOC facilitates, has no direct control. 9 December 2018
15
Service Desk Activities
A common face to collaboratively-provided support Facilitate and support communications: Direct with site administrators and Grid users Web page resources Status reporting to mailing list Monitor status of Grid resources Coordinate and track: Problems Changes (software updates, resource additions) Security incidents Requests for assistance 9 December 2018
16
Service Desk Activities (cont.)
Provide reports Problem summaries, service desk activity Maintain the repository of support and process information User support, such as: How to join a VO How to get and maintain a cert How to run an application How to use monitoring tools Troubleshooting application failures Information about policies, etc. 9 December 2018
17
Provisioning Create and maintain the grid-controlled software packages and cache Provide site software not supported through VDT Verify software compatibility Provide ease-of-installation tools Develop instructions on how to plug things together Provide site installation and configuration support End-to-end troubleshooting for resources Provide and maintain common Grid services such as VOMS, GIIS, RLS, archives, and monitoring systems 9 December 2018
18
Leveraging the NOC Global NOC at Indiana University
The Global NOC provides 24x7 network engineering and operations services for research and education networks and international interconnections, including Internet2 Abilene, National LambdaRail, TransPAC and AMPATH networks, the STAR TAP and MANLAN layer 3 international exchange points, and the STAR LIGHT optical exchange. In addition, the Global NOC supports activities of the iVDGL Grid Operations Center and the REN-ISAC cybersecurity Watch Desk. By virtue of the R&E network, grid, and cybersecurity activities, the Global NOC possesses a unique and embracing view of R&E cyberinfrastructure. 9 December 2018
19
Monitoring the GOC services
NOC Mon Nagios NOC Contact DB Ticket 894 Trouble Tickets Grid Systems and Services(run every 15m) GOC 9 December 2018
20
9 December 2018
21
Problem to Trouble Ticket
Scope A single resource / Multiple resources Application wide VO wide Grid wide Operations Resource/ Operations Service Severity Critical, High, Elevated, Normal Problem Owner Problem Contact Problem Description 9 December 2018
22
Security/Incidence Handling
Monitoring Event GOC Site Fails Grid Catalog Test (run every 5 hours) Trouble Tickets NOC Monitors Grid Catalog Map Ticket 854 Grid Experts GOC Mon GridCat MonaLisa Contact DB Security/Incidence Handling Resource VO support Or Facility Resouce Resource 9 December 2018
23
Reactive Support workflow
GOC Web form & Telephone Ticket 803 Ticket 823 Ticket 833 Ticket 843 Trouble Tickets Grid Experts Web Docs Developers Contact DB User/Admin Application Failure Planned Outages Security problems Installation help Configuration assistance Identity management Authorization problems Other Support Centers Security/Incidence Handling 9 December 2018
24
Analysis of Effort by Area
~800 tickets total Issues relating to resource owners and providers 60% Special issues for Virtual Organizations (VO’s) % Issues relating to developers of applications and 10% workflow environments (portals) Support to individuals using Grid resources % Developers; e.g. VDS doesn’t work, how to get information in order to do their own resource brokering 9 December 2018
25
Agenda Introduction Grid3 environment
Operations model and implementation Conclusions 9 December 2018
26
Operations Enables Applications
Provide operational services that provide Applications with the “instruments” to: Publish site policies and environment Know the status of grid middleware on sites Know the job queue for compute resources Know the status and load of grid resources Access historical monitoring information Manage grid services Keep apprised of security incidents in the collaborative 9 December 2018
27
Lessons Learned Configuration management efforts in the development and deployment areas are rewarded many times over during production. A monitoring infrastructure allows a significant problem solving advantage, esp. redundant monitoring. Establishment of clear communications between resources providers, users and Virtual Organizations is hard. 9 December 2018
28
More Lessons Learned Human interactions in grid building costly
Keeping resource provider requirements light lead to heavy loads on gatekeeper hosts ( monitoring framework ) Diverse set of resource configurations made jobs requirements exchange difficult Troubleshooting: efficiency for submitted jobs was not as high as we’d like. 9 December 2018
29
Upcoming Challenges Shared problem handling with application-centric and VO centric support structures Ticket passing to and from other Grid environments Establishing a working monitoring framework for distributed storage resources and virtual data cataloging infrastructure 9 December 2018
30
Thank You - Questions? 9 December 2018
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.