1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB 13-10-2006.

Slides:

Advertisements

Similar presentations

13,000 Jobs and counting…. Advertising and Data Platform Our System.

Advertisements

MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.

Protect Your Business and Simplify IT with Symantec and VMware Presenter, Title, Company Date.

© 2015 Dbvisit Software Limited | dbvisit.com An Introduction to Dbvisit Standby.

Deployment Options Frank Bergmann

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

Understanding Back-End Systems Chapter 9. Front-End Systems Front- end systems are those processes with which a user interfaces, and over which a customer.

Tier-1 Overview Andrew Sansum 21 November Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

15 Copyright © 2005, Oracle. All rights reserved. Performing Database Backups.

7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.

OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.

HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.

Chapter © 2006 The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/ Irwin Chapter 7 IT INFRASTRUCTURES Business-Driven Technologies 7.

SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.

15 Copyright © 2007, Oracle. All rights reserved. Performing Database Backups.

Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.

11 DISASTER RECOVERY Chapter 13. Chapter 13: DISASTER RECOVERY2 OVERVIEW  Back up server data using the Backup utility and the Ntbackup command  Restore.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

E.Soundararajan R.Baskaran & M.Sai Baba Indira Gandhi Centre for Atomic Research, Kalpakkam.

1 IBM TIVOLI Business Continuance Seminar Training Document.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

KIT – The cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) ITIL and Grid services at GridKa CHEP 2009,

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

Feedback from the Tier1s GDB, September CNAF 24x7 support On-call person for all critical infrastructural services (cooling, power etc..) Manager.

1 D0 Taking Stock By Anil Kumar CD/LSCS/DBI/DBA June 11, 2007.

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

24 x 7 support in Amsterdam Jeff Templon NIKHEF GDB 05 september 2006.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.

Disaster Recovery: Can Your Business Survive Data Loss? DR Strategies for Today and Tomorrow.

Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

My Name: ATLAS Computing Meeting – NN Xxxxxx Service Level Agreement(SLA) The intelligence layer Tony Chan Jason Smith Mizuki Karasawa March 24,

CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.

PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

EMTTS UAT Day1 & Day2 Powered by:. Topics CoversTopics Remaining Comparison Network Infrastructure Separate EP Hosting Fault Tolerance.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Service Report ~~~ WLCG Management Board, 10 th November

Operation team at Ccin2p3 Suzanne Poulat –

Setting up NGI operations Ron Trompert EGI-InSPIRE – ROD teams workshop1.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Bob Jones EGEE Technical Director

Cross-site problem resolution Focus on reliable file transfer service

Castor services at the Tier-0

WLCG Service Interventions

CCNET Managed Services

R-COD model readiness in FR

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

2 Basic Infrastructure Power supply –200 KVA UPS –500 KVA diesel generator 300 KW air conditioned –Survived last summer which was exceptionally hot Network –Spanish NREN (RedIris), same level of support than GÉANT. –Support at the level of 24x7 (emergency telephone)

3 High Availability in Critical Servers Today many servers still running on “WN-like” h/w –Many new services in the last years –Urgency to deploy/test/run them Currently moving critical services to a standardized “server-like” building block h/w –Dual power supply –Mirrored system disk high quality standard HDs, hot swappable –Dual ethernet (using 2 separated switches)

4 High Availability in Critical Servers Basic infrastructure –DNS: Use secondary server in case of primary failure h/w: move to robust platform in the near future Databases –FTS (oracle) and LFC (mysql): RAID1 system and DB disk Regular hot backup (FTS: 24h ; LFC: 1h) h/w: move to robust platform in the near future

5 High Availability in Critical Servers Storage –Castor: Still using castor1 in production. Servers are not HA. Need to recover from backup in case of disaster. Now migrating to castor2. Production servers will be deployed in reliable h/w and in (as much as possible) HA configuration. –dCache: Core services already deployed in 5 servers with reliable h/w Deployment schema has already some HA –2 servers for PnfsManager (PostgreSQL replicated w. Slony) –2 server for dcache-core services –1 server for SRM service & PostgreSQL associated DB

6 Monitoring: sensors Currently using –Nagios: for alarm handling Operator also watching SAM monitoring pages. Currently in the process of interfacing this as a local Nagios alarm –Ganglia: for metric time-dependence monitoring Plan to evaluate other tools, like lemon, with integrated capabilities and possibility of full monitoring history archiving. Missing a dashboard that facilitates global status check to the MoD –Planning to create one that integrates the different views –Interested in sharing information with other sites

7 Monitoring: actions Two engineers from collaborating company (TID) developing INGRID INGRID: –framework for implementing “expert system” that takes recovery actions depending on the given services alarms Not yet in production. Plan to deploy it for most critical services by 2007.

8 Manager on Duty In charge of: –Monitoring: support mailing list + alarms for critical services –Redirecting issues to relevant experts –Tracking the problem until its resolution internal ticketing system in place to follow up and used as “knowledge database” –Contacting back the user –Writing a daily logbook/report with main incidences

9 Manager on Duty Pool of 7 people (will be 10 in 2007) weekly shifts (wed-wed) Today: MoD only active during working hours 24x7 Plan –Implement SMS service for critical service alarms for all PIC employees –MoD on-call during non-working hours Will act as 1st line support for alarms. Will be able to call 2nd line expert for escalation if needed. On-call system being developed now (formal issues with contracts, pay extra hours vs extra holidays?, voluntary scheme) –Plan to finalise definition of 24x7 procedures by Dec and start operating it by March-2007.

10 Summary We are not planning to have staff on site 24x7 => Emphasis put on: –Deploy services in a reliable/robust way –Monitoring + automating recovery actions as much as possible Pool of engineers taking Manager on Duty shifts –will evolve to cover non-working hours through an on- call schema We are clearly not there yet, but targeting to have it by end of Q