Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.

Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011

Efficiency The ratio of the effective or useful output to the total input in any system.

Pledges siteJob slots HEPSPEC06 TB Pledged HS06 Pledge TB Lancaster2,36828,6271116 3,395515 Liverpool5728,318406 2,996554 Manchester2,77022,264687 7,284224 Sheffield4004,800297 1,198252

CPU efficiency

Usage NorthGrid Normalised CPU time (HEPSPEC06) by SITE and VO. TOP10 VOs (and Other VOS). September 2010 - February 2011.

Successful jobs rate ANALY_MANC (192046/52963) ANALY_LANCS (155233/61368) ANALY_SHEF (161043/25537) ANALY_LIV (146994/21563) UKI-NORTHGRID-MAN-HEP (494559/35496) UKI-NORTHGRID-LANCS-HEP (252864/33889) UKI-NORTHGRID-LIV-HEP (227395/15185) UKI-NORTHGRID-SHEF-HEP (140804/8525)

Lancaster – keeping things smooth Our main strategy for efficient running at Lancaster involves comprehensive monitoring and configuration management. Effective monitoring allows us to jump on incidents and spot problems before they bite us on the backside, as well as enabling us to better understand, and therefore tune, our systems. Cfengine on our nodes, and kusu on the HEC machines, enables us to pre-empt misconfiguration issues on individual nodes, quickly ratify errors and ensure swift, homogenous rollout of configs and changes. Whatever the monitoring, e-mail alerts keep us in the know. Among the many tools and tactics we use to keep on top of things are: Syslog (with Logwatch mails), Ganglia, Nagios (with e-mail alerts), Atlas Panda Monitoring, Steve’s Pages, on-board monitoring and e-mail alerts for our Areca raid arrays, Cacti for our network (and the HEC nodes), plus a whole bunch of hacky scripts and bash one-liners!zzzz

Lancaster – TODO list We’ll probably never stop finding things to polish, but some things that are at are on top of the wishlist (in that we wish we could get time to implement them!) are: A site dashboard (a huge, beautiful site dashboard) More ganglia metrics! And more in-depth nagios tests, particularly for batch system monitoring and raid monitoring (recent storage purchases have 3ware and Adaptec raids). Intelligent syslog monitoring as the number of nodes at our site grow. Increased network and job monitoring, the more detailed the picture we have of what’s going on the better we can tune things. Other ideas for increasing our efficiency include SMS alerts, internal ticket management and introducing a more formalised on-call system.

Planning, design and testing - Storage and node specifications Network design, e.g. – Minimise contention – Bonding Extensive HW and SW soak testing, experimentation, tuning Adjustments and refinement UPS coverage Liverpool hardware measures

Builds and maintenance - dhcp, kickstart, yum, puppet, yaim, standards Monitoring - nagios (local and gridpp), ganglia, cacti/weathermap, log monitoring, tickets and mail lists. testnodes – local software that checks worker-nodes to isolate potential “blackhole” conditions. Liverpool Building and monitoring

Manchester install & config & monitor Have to look after ~550 machines Install Dhcp, Kickstart, YAIM, Yum, Cfengine Monitor Nagios (ganglia), cfengine, weathermap, raid cards monitoring, custom scripts to parse log files, OS tools. Each machines has a profile for each tool Difficult to keep consistent changes Manpower reduced can't afford this bad tracking

Manchester Integration with RT Use Nagios for monitoring nodes and services – Both external tests (eg ssh to port) – And internal tests (via node's nrpe daemon) Use RT (“Request Tracker”) for tickets – Includes Asset Tracker which has a powerful as has a web interface and links to tickets

Manchester Integration with RT (2) Previously maintained lists of hosts and group membership in Nagios cfg files – Now make these from the AT MySQL DB Obvious advantages in monitoring services only where cfengine has installed them Automatic cross link between AT and nagios Future extensions to other lists as dhcp, cfengine, online and offline nodes

Sheffield: efficiency 2 clusters Jobs requiring better network bandwidth directed to WNs with better backbone Storage 90 TB (9 disk pools, SW RAID5 (without raid controllers)) Absence of raid controllers increases site efficiency : – No common failures related to RAID controllers:  unavailable disk servers and data loss 2 TB disks seagate barracuda disks, fast and robust 5x16bay unit with 2 fs, 4x24 bay unit with 2 fs Cold spare unit on standby in each server Simple cluster structure makes it easy to support high efficiency and to upgrade it to new requirements of experiments

Sheffield:efficiency Monitoring (checks are on a regular basis several times a day) – Ganglia: general check of the cluster health – Regional nagios, warnings sent via email from regional nagios – Logwatch/syslog check – GRIDMAP – All ATLAS monitoring tools : – ATLAS SAM test page – AtTLAS (and LHCb) Site Status Board – DDM dashboard – PANDA monitor – Detailed check of atlas performance (check the reason for a particular failure of production and analysis jobs)

Sheffield:efficiency Installation – Use PXE boot – Redhat kickstart install – Using many cron jobs for monitoring – Bash post-install (includes yaim) Cron jobs – Monitor the temperature in cluster room (in case of temperature raise only some of the worker nodes shut down automatically) – Generate a web page of queues and jobs for both grid and local – Check and restart of vital services if they are down (bdii, srm) – Generate a warning email in case of disk failure (in any server)

Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.

Similar presentations

Presentation on theme: "Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.

Similar presentations

Presentation on theme: "Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011."— Presentation transcript:

Similar presentations

About project

Feedback