Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January 2009 1 and medium.

Slides:

Advertisements

Similar presentations

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.

Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.

Fabric Management at CERN BT July 16 th 2002 CERN.ch.

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO

ELFms status and deployment, 25/5/2004 ELFms, status, deployment Germán Cancio for CERN IT/FIO HEPiX spring 2004 Edinburgh 25/5/2004.

CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.

12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

Camilo Lara KIP HLT Production Readiness Review 1 HLT Cluster Management.

The CERN Computer Centres October 14 th 2005 CERN.ch.

Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

SOE and Application Delivery Gwenael Moreau, Abbotsleigh.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

International Workshop on Large Scale Computing, VECC, Kolkata, Feb 8-10, LCG Software Activities in India Rajesh K. Computer Division BARC.

WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.

7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.

EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2

May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.

Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.

20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]

Microsoft Management Seminar Series SMS 2003 Change Management.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.

Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

Automated management…, 26/7/2004 Automated management of large fabrics with ELFms Germán Cancio for CERN IT/FIO LCG-Asia Workshop Taipei, 26/7/2004

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.

Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.

Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.

 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.

Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.

Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.

Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN

System Monitoring with Lemon

Status of Fabric Management at CERN

Germán Cancio CERN IT/FIO LCG workshop, 24/3/04

LEMON – Monitoring in the CERN Computer Centre

Status and plans of central CERN Linux facilities

German Cancio CERN IT .quattro architecture German Cancio CERN IT.

The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:

Presentation transcript:

Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium

2 Power and Power Compute Power –Single large system Boring –Multiple small systems CERN, Google, Microsoft… Multiple issues: Exciting Electrical Power –Cooling & €€€ Characteristics

3 Box Management What’s Going On? Power & Cooling Challenges

4 Box Management What’s Going On? Power & Cooling Challenges

5 Box Management –Installation & Configuration –Monitoring –Workflow What’s Going On? Power & Cooling Challenges

6 ELFms Vision Node Configuration Management Node Management Leaf Lemon Performance & Exception Monitoring Logistical Management Toolkit developed by CERN in collaboration with many HEP sites and as part of the European DataGrid Project. See

7 Quattor Node Configuration Manager NCM CompACompBCompC ServiceAServiceBServiceC RPMs / PKGs SW Package Manager SPMA Managed Nodes SW server(s) HTTP SW Repository RPMs Install server HTTP / PXE System installer Install Manager base OS XML configuration profiles Configuration server HTTP CDB SQL backend SQL CLI GUI scripts XML backend SOAP Used by 18 organisations besides CERN; including two distributed implementations with 5 and 18 sites.

8 Configuration Hierarchy CERN CC name_srv1: time_srv1: ip-time-1 lxbatch cluster_name: lxbatch master: lxmaster01 pkg_add (lsf5.1) lxplus cluster_name: lxplus pkg_add (lsf5.1) disk_srv lxplus001 eth0/ip: pkg_add (lsf5.1_debug) lxplus020 eth0/ip: lxplus029

9 Scalable s/w distribution… DNS-load balanced HTTP MM’ Backend (“Master”) Frontend L1 proxies L2 proxies (“Head” nodes) Server cluster HHH … Rack 1Rack 2…… Rack N Installation images, RPMs, configuration profiles

10 … in practice!

11 Box Management –Installation & Configuration –Monitoring –Workflow What’s Going On? Power & Cooling Challenges

12 Lemon Correlation Engines User Workstations Web browser Lemon CLI User Monitoring Repository TCP/UDP SOAP Repository backend SQL Nodes Monitoring Agent Sensor RRDTool / PHP apache HTTP

13 All the usual system parameters and more –system load, file system usage, network traffic, daemon count, software version… –SMART monitoring for disks –Oracle monitoring number of logons, cursors, logical and physical I/O, user commits, index usage, parse statistics, … –AFS client monitoring –…–… “non-node” sensors allowing integration of –high level mass-storage and batch system details Queue lengths, file lifetime on disk, … –hardware reliability data –information from the building management system Power demand, UPS status, temperature, … –and full feedback is possible (although not implemented): e.g. system shutdown on power failure What is monitored See power discussion later

14 Monitoring displays

15 As Lemon monitoring is integrated with quattor, monitoring of clusters set up for special uses happens almost automatically. –This has been invaluable over the past year as we have been stress testing our infrastructure in preparation for LHC operations. Lemon clusters can also be defined “on the fly” –e.g. a cluster of “nodes running jobs for the ATLAS experiment” note that the set of nodes in this cluster changes over time. Dynamic cluster definition

16 Box Management –Installation & Configuration –Monitoring –Workflow What’s Going On? Power & Cooling Challenges

17 LEAF is a collection of workflows for high level node hardware and state management, on top of Quattor and LEMON: HMS (Hardware Management System): –Track systems through all physical steps in lifecycle eg. installation, moves, vendor calls, retirement –Automatically requests installs, retires etc. to technicians –GUI to locate equipment physically –HMS implementation is CERN specific, but concepts and design should be generic SMS (State Management System): –Automated handling (and tracking of) high-level configuration steps Reconfigure and reboot all LXPLUS nodes for new kernel and/or physical move Drain and reconfig nodes for diagnosis / repair operations –Issues all necessary (re)configuration commands via Quattor –extensible framework – plug-ins for site-specific operations possible LHC Era Automated Fabric

18 5. Take out of production Close queues and drain jobs Disable alarms LEAF workflow example Operations HMS 1. Import 11. Set to production SMS 2. Set to standby 7. Request move technicians 6. Shutdown work order Node 4. Refresh 13. Refresh NW DB 8. Update 9. Update Quattor CDB 3. Update 12. Update 10. Install work order 14. Put into production

19 Simple –Operator alarms masked according to system state Complex –Disk and RAID failures detected on disk storage nodes lead automatically to a reconfiguration of the mass storage system: Integration in Action SMS Mass Storage System Disk Server LEMON Lemon Agent RAID degraded Alarm Alarm Monitor Alarm Analysis set Standby Draining: no new connections allowed; existing data transfers continue. set Draining

20 Box Management –Installation & Configuration –Monitoring –Workflow What’s Going On? Power & Cooling Challenges

21 System managers understand systems (we hope!). –But do they understand the service? –Do the users? A Complex Overall Service 21

22 User Status CERN

23 SLS Architecture

24 SLS Service Hierarchy

25 SLS Service Hierarchy

26 Box Management –Installation & Configuration –Monitoring –Workflow What’s Going On? Power & Cooling Challenges

27 Megawatts in need –Continuity Redundancy where? –Megawatts out Air vs Water –Green Computing Run high… … but not too high Containers and Clouds You can’t control what you don’t measure Power & Cooling

28 Thank You! Thanks also to Olof Bärring, Chuck Boeheim, German Cancio Melia, James Casey, James Gillies, Giuseppe Lo Presti, Gavin McCance, Sebastien Ponce, Les Robertson and Wolfgang von Rüden