Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

Similar presentations


Presentation on theme: "A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL."— Presentation transcript:

1 A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL

2 Overview Facility operations is a manpower-intensive activity at the RACF. Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) Software upgrades Hardware lifecycle management Integrity of facility services User account lifecycle management Cyber-security Experience with RHIC operations for the past 9 years. Support for ATLAS Tier 1 facility operations.

3 Experience with RHIC Operations 24x7 year-round operations since 2000. Facility systems classified into 3 categories: non-essential, essential and critical. Response to system failure depends on component classification: Critical components are covered 24x7 year-round. Immediate response is expected from on-call staff. Essential components have built-in redundancy/duplication and are addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability. Non-essential components are addressed the next business day. Staff provides primary coverage during normal business hours. Operators contact on-call person during off-hours and weekends.

4 Experience with RHIC Operations (cont.) Users report problems via ticket system, pagers and/or phone. Monitoring software instrumented with alarm system. Alarm system connected to selected pagers and cell phones. Limited alarm escalation procedure (ie, contact back-up if primary is not available) during off-hours and weekends. Periodic rotation of primary and back-up on-call list for each subsystem. Automatic response to alarm conditions in certain cases (ie, shutdown of Linux Farm cluster in case of cooling failure). Facility operations in RHIC has worked well over past 8 years.

5 Service Level Agreement Table 1 Summary of RCF Services and Servers ServiceServerRankComments Network to Ring 1 Internal Network 1 External Network 1ITD handles RCF firewall 1ITD handles HPSSrmdsXX1 AFS ServerrafsXX1 AFS File systems 1 NFS Server 1 NFS home directoriesrmineXX1 CRS Managementrcrsfm, rcras1Rcrsfm is 1, rcras is 2 Web server (internet)www.rhic.bnl.gov 1 Web server (intranet)www.rcf.bnl.gov1 NFS data disksrmineXX1 Instrumentation 2 SAMBArsmb00 DNSrnisXX2Should fail over NISrnisXX2Should fail over NTPrnisXX2Should fail over RCF gateways 2Multiple gateway machines ADSM backup 2 Wincenterrnts002/3 CRS Farm 2 LSFrlsf002 CAS Farm 2 rftp 2 Oracle 2 Objectivity 2 MySQL 2 Emai l 2/3 Printers 3

6 A New Operational Model for the RACF RHIC facility operations is a system-based approach. Some systems support more than one service, and some services depend on multiple systems – unclear lines of responsibilities. Service-based operational approach better suited for distributed computing environment in ATLAS. Tighter integration of monitoring, alarm mechanism and problem tracking – automate where possible. Define a system and service dependency matrix.

7 Service/System Dependency Matrix

8 Monitoring in the new SLA Monitor service and system availability, system performance and facility infrastructure (power, cooling, network). Mixture of open-source and RACF-written components. Nagios Infrastructure Condor RT Choices guided by desired features: historical logs, ease of integration with other software, support from open-source community, ease of configuration, etc.

9 Nagios Monitor service availability. Host-based daemons configured to use externally-supplied “plugins” to obtain service status. Host-based alarm response customized (e-mail notification, system reboot, etc). Connected to RT ticketing system for alarm logging and escalation.

10 Nagios (cont.)

11 Infrastructure (Cooling) The growth of the RACF has put considerable strain on power and cooling. UPS back-up power for RACF equipment. Custom RACF-written script to monitor power and cooling issues. Alarm logging and escalation through RT ticketing system. Controlled automatic shutdown of Linux Farm during cooling or power failures.

12 Infrastructure (Network) Use of cacti to monitor network traffic and performance. Can be used at switch or system level. Historical information and logs. To be instrumented with alarms and be integrated in the alarm logging and escalation.

13 Condor Condor does not have native monitoring interface. RACF created its own web-based, monitoring interface. Interface used by staff for performance tuning. Connected to RT for alarm logging and escalation. Monitoring functions Throughput Service Availability Configuration Optimization

14 RT Flexible ticketing system. Historical records available. Coupled to monitoring software for alarm logging and escalation. Integrated in service-based SLA.

15 Implementing new SLA Create Alarm Management Layer (AML) to interface monitoring to RT. Alarm conditions configurable via custom-written rule engine. Clearer lines of responsibilities for creating, maintaining and responding to alarms. AML creates RT ticket in appropriate category and keeps track of responses. AML escalates alarm when RT ticket is not addressed within (configurable) amount of time. Service Coordinators oversee management of service alarms.

16 How It Works

17 What data is logged? Host, service, host group, and service group Alarm timestamp NRPE (Nagios) message content Alarm status Notification status RT ticket status (new, open, resolved) Timestamp of lastest RT update Due date RT ticket information (number, queue, owner, priority, etc)

18 Example Configuration (rule) File [linuxfarm-testrule] host: testhost(\d) (Regular expression compatible) service: condorq, condor hostgroup: any queue: Test after_hours_PageTime: 30 work_hours_PageTime: 60 work_hours_response_time: 120 (When does the problem need to be resolved by) after_hours_response_time: 720 (When does the problem need to be resolved by) auto_up: 1 (Page people) down_hosts: 2 (Number of down hosts to be a real problem) firstContact: test-person@pager secondContact: test-person@bnl.gov

19 New Response Mechanism

20 Summary Well-established procedures from RHIC operational experience. Need service-based SLA for distributed computing environment. Create Alarm Management Layer (AML) to integrate RT with monitoring tools and create clearer lines of responsibilities for staff. Some features already functional. Expect full implementation by late summer 2008.


Download ppt "A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL."

Similar presentations


Ads by Google