Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.

Slides:

Advertisements

Similar presentations

NAGIOS AND CACTI NETWORK MANAGEMENT AND MONITORING SYSTEMS.

Advertisements

How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters.

Northgrid Status Alessandra Forti Gridpp22 UCL 2 April 2009.

Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.

NorthGrid status Alessandra Forti Gridpp12 Brunel, 1 February 2005.

Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.

Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010.

NorthGrid status Alessandra Forti Gridpp13 Durham, 4 July 2005.

S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.

ManageEngine TM Applications Manager 8 Monitoring Custom Applications.

Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.

Maintaining and Updating Windows Server 2008

Andrew McNab - Manchester HEP - 22 April 2002 UK Rollout and Support Plan Aim of this talk is to the answer question “As a site admin, what are the steps.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

London Tier 2 Status Report GridPP 12, Brunel, 1 st February 2005 Owen Maroney.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Inventory:OCSNG + GLPI Monitoring: Zenoss 3

Enterprise PI - How do I manage all of this? Robert Raesemann J Jacksonville, FL.

CCR GRID 2010 (Catania) Daniele Gregori, Stefano Antonelli, Donato De Girolamo, Luca dell’Agnello, Andrea Ferraro, Guido Guizzunti, Pierpaolo Ricci, Felice.

SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Belle MC Production on Grid 2 nd Open Meeting of the SuperKEKB Collaboration Soft/Comp session 17 March, 2009 Hideyuki Nakazawa National Central University.

Andrew McNabNorthGrid, GridPP8, 23 Sept 2003Slide 1 NorthGrid Status Andrew McNab High Energy Physics University of Manchester.

Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.

Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.

Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

ALICE Use of CMF (CC) for the installation of OS and basic S/W OPC servers and other special S/W installed and configured by hand PVSS project provided.

Your university or experiment logo here Tier1 Deployment Steve Traylen.

Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.

HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.

2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6

5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.

+ AliEn site services and monitoring Miguel Martinez Pedreira.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

BaBar Cluster Had been unstable mainly because of failing disks Very few (

15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK

RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

The GridPP DIRAC project DIRAC for non-LHC communities.

The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.

Overview of cluster management tools Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Maintaining and Updating Windows Server 2008 Lesson 8.

INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.

Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

Owen McShane Northgrid systems manager Christmas talk Dec 2006.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.

Servizi core INFN Grid presso il CNAF: setup attuale

NGI and Site Nagios Monitoring

Pete Gronbech, Kashif Mohammad and Vipul Davda

Presentation transcript:

Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011

Efficiency The ratio of the effective or useful output to the total input in any system.

Pledges siteJob slots HEPSPEC06 TB Pledged HS06 Pledge TB Lancaster2,36828, , Liverpool5728, , Manchester2,77022, , Sheffield4004, ,198252

CPU efficiency

Usage NorthGrid Normalised CPU time (HEPSPEC06) by SITE and VO. TOP10 VOs (and Other VOS). September February 2011.

Successful jobs rate ANALY_MANC (192046/52963) ANALY_LANCS (155233/61368) ANALY_SHEF (161043/25537) ANALY_LIV (146994/21563) UKI-NORTHGRID-MAN-HEP (494559/35496) UKI-NORTHGRID-LANCS-HEP (252864/33889) UKI-NORTHGRID-LIV-HEP (227395/15185) UKI-NORTHGRID-SHEF-HEP (140804/8525)

Lancaster – keeping things smooth Our main strategy for efficient running at Lancaster involves comprehensive monitoring and configuration management. Effective monitoring allows us to jump on incidents and spot problems before they bite us on the backside, as well as enabling us to better understand, and therefore tune, our systems. Cfengine on our nodes, and kusu on the HEC machines, enables us to pre-empt misconfiguration issues on individual nodes, quickly ratify errors and ensure swift, homogenous rollout of configs and changes. Whatever the monitoring, alerts keep us in the know. Among the many tools and tactics we use to keep on top of things are: Syslog (with Logwatch mails), Ganglia, Nagios (with alerts), Atlas Panda Monitoring, Steve’s Pages, on-board monitoring and alerts for our Areca raid arrays, Cacti for our network (and the HEC nodes), plus a whole bunch of hacky scripts and bash one-liners!zzzz

Lancaster – TODO list We’ll probably never stop finding things to polish, but some things that are at are on top of the wishlist (in that we wish we could get time to implement them!) are: A site dashboard (a huge, beautiful site dashboard) More ganglia metrics! And more in-depth nagios tests, particularly for batch system monitoring and raid monitoring (recent storage purchases have 3ware and Adaptec raids). Intelligent syslog monitoring as the number of nodes at our site grow. Increased network and job monitoring, the more detailed the picture we have of what’s going on the better we can tune things. Other ideas for increasing our efficiency include SMS alerts, internal ticket management and introducing a more formalised on-call system.

Planning, design and testing - Storage and node specifications Network design, e.g. – Minimise contention – Bonding Extensive HW and SW soak testing, experimentation, tuning Adjustments and refinement UPS coverage Liverpool hardware measures

Builds and maintenance - dhcp, kickstart, yum, puppet, yaim, standards Monitoring - nagios (local and gridpp), ganglia, cacti/weathermap, log monitoring, tickets and mail lists. testnodes – local software that checks worker-nodes to isolate potential “blackhole” conditions. Liverpool Building and monitoring

Manchester install & config & monitor Have to look after ~550 machines Install Dhcp, Kickstart, YAIM, Yum, Cfengine Monitor Nagios (ganglia), cfengine, weathermap, raid cards monitoring, custom scripts to parse log files, OS tools. Each machines has a profile for each tool Difficult to keep consistent changes Manpower reduced can't afford this bad tracking

Manchester Integration with RT Use Nagios for monitoring nodes and services – Both external tests (eg ssh to port) – And internal tests (via node's nrpe daemon) Use RT (“Request Tracker”) for tickets – Includes Asset Tracker which has a powerful as has a web interface and links to tickets

Manchester Integration with RT (2) Previously maintained lists of hosts and group membership in Nagios cfg files – Now make these from the AT MySQL DB Obvious advantages in monitoring services only where cfengine has installed them Automatic cross link between AT and nagios Future extensions to other lists as dhcp, cfengine, online and offline nodes

Sheffield: efficiency 2 clusters Jobs requiring better network bandwidth directed to WNs with better backbone Storage 90 TB (9 disk pools, SW RAID5 (without raid controllers)) Absence of raid controllers increases site efficiency : – No common failures related to RAID controllers:  unavailable disk servers and data loss 2 TB disks seagate barracuda disks, fast and robust 5x16bay unit with 2 fs, 4x24 bay unit with 2 fs Cold spare unit on standby in each server Simple cluster structure makes it easy to support high efficiency and to upgrade it to new requirements of experiments

Sheffield:efficiency Monitoring (checks are on a regular basis several times a day) – Ganglia: general check of the cluster health – Regional nagios, warnings sent via from regional nagios – Logwatch/syslog check – GRIDMAP – All ATLAS monitoring tools : – ATLAS SAM test page – AtTLAS (and LHCb) Site Status Board – DDM dashboard – PANDA monitor – Detailed check of atlas performance (check the reason for a particular failure of production and analysis jobs)

Sheffield:efficiency Installation – Use PXE boot – Redhat kickstart install – Using many cron jobs for monitoring – Bash post-install (includes yaim) Cron jobs – Monitor the temperature in cluster room (in case of temperature raise only some of the worker nodes shut down automatically) – Generate a web page of queues and jobs for both grid and local – Check and restart of vital services if they are down (bdii, srm) – Generate a warning in case of disk failure (in any server)