Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.

Slides:

Advertisements

Similar presentations

Fabric Management at CERN BT July 16 th 2002 CERN.ch.

Advertisements

Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH Home server AFS using openafs 3 DB servers. Web server AFS Mail Server.

Production RedHat7.2 Services: Retooling during the upgrade FIO-LCG contribution Tim Smith.

Project Management Summary Castor Development Team Castor Readiness Review – June 2006 German Cancio, Giuseppe Lo Presti, Sebastien Ponce CERN / IT.

“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.

Components and Architecture CS 543 – Data Warehousing.

CHEP 2015 Analysis of CERN Computing Infrastructure and Monitoring Data Christian Nieke, CERN IT / Technische Universität Braunschweig On behalf of the.

Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.

Interfacing a Managed Local Fabric to the GRID LCG Review Tim Smith IT/FIO.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

Abstract The automated multi-platform software nightly build system is a major component in the ATLAS collaborative software organization, validation and.

WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

27/04/05Sabah Salih Particle Physics Group The School of Physics and Astronomy The University of Manchester

So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.

1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.

October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.

Rocks ‘n’ Rolls An Introduction to Programming Clusters using Rocks © 2008 UC Regents Anoop Rajendra.

Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.

EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO

19-May-2003 Solaris service: Status and plans at CERN Ignacio Reguero IT / Product Support / Unix Infrastructure Presented by Manuel Guijarro.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2

Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

INFSO-RI SA1 Service Management Alberto AIMAR (CERN) ETICS 2 Final Review Brussels - 11 May 2010.

PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,

Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.

20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.

G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP

CERN.ch 1 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]

Microsoft Management Seminar Series SMS 2003 Change Management.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.

Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.

Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype

ASIS + RPM: ASISwsmp German Cancio, Lionel Cons, Philippe Defert, Andras Nagy CERN/IT Presented by Alan Lovell.

Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik.

EMI INFSO-RI EMI Quality Assurance Tools Lorenzo Dini (CERN) SA2.4 Task Leader.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.

Oct. 6, 1999PHENIX Comp. Mtg.1 CC-J: Progress, Prospects and PBS Shin’ya Sawada (KEK) For CCJ-WG.

10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

Bringing Dynamism to OPNFV

Dag Toppe Larsen UiB/CERN CERN,

An Open Source Project Commonly Used for Processing Big Data Sets

Dag Toppe Larsen UiB/CERN CERN,

Status of Fabric Management at CERN

WP4-install status update

Status and plans of central CERN Linux facilities

Clustering Technology For Fault Tolerance

Technical Capabilities

The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:

Linux Cluster Tools Development

Presentation transcript:

Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT

2002/10/21White Box Farms: Contents  Scale  Behind the Scenes  Hardware  Complexity  Dynamics  Practical Steps  Software  Legacy  Projects

2002/10/21White Box Farms: Scale  ~1000 boxes  140k Jobs/wk  2400 int user  50 parallel reinstalls  Parallel cmd engines  350kSi2000  ~7/38 in top 500 clusters

2002/10/21White Box Farms: Complexity  Hardware  12 hardware acquisitions  38 combinations of CPU/Mem/Disk  Software  4 versions of RedHat OS  37 clusters (indep. configurations)  User Communities  30 expts/user communities + Public  12,000 users

2002/10/21White Box Farms: Dynamics  Hardware Drift  e.g. missing after reboot:  CPUs, Memory, Disks  Ethernet speed wrong  Volatile configurations  e.g. passwd file every couple of hours  Hardware Failures  Up to 4% of farm on holiday  Replacements generate new configurations Monitoring Inventory Tracking

2002/10/21White Box Farms: Vendor Call Analysis 1 every 2 days!

2002/10/21White Box Farms: Acquisition Cycles

2002/10/21White Box Farms: Addressing the Challenge  Interactive: Refresh from uniform batch machines  Batch: One large production facility  Shares (and priorities)  Selectable resources  Flexibility  Redundancy to reduced sensitivity to failures  Remedy Hardware workflows  But intractable  Scatter in job return times  Assumed but undeclared job requirements

2002/10/21White Box Farms: SW: Legacy from Maturity OS Applications Mgmt Tools KickStart SUE ASIS BIS /home /usr/cute /usr/local /var /opt

2002/10/21White Box Farms: BIS DB SW: Legacy from Maturity OS Applications Mgmt Tools KickStart SUE ASIS BIS Oracle AFS Local acrontabs /home /usr/cute /usr/local /var /opt crontabs Multiple owners, methods, formats Multiple locations

2002/10/21White Box Farms: A Clean Restart Node Configuration System Monitoring System Installation System Fault Mgmt System

2002/10/21White Box Farms: A Clean Restart: SnapShot Node Configuration System Monitoring System Installation System Fault Mgmt System HW SW Function State Software UpdateBase Installation RPM API PXE Kickstart

2002/10/21White Box Farms: State and Configuration Mgt  Clean Initial State  Linux Standards Base, RPM  Externally Specified  Configuration System, local cache  Versioned + Repository  CVS  No inherent drift  No external crontabs  No unregistered application provider triggered updates  Update verification nodes + release cycle  Procedures and Workflows  Transactions  Notifications

2002/10/21White Box Farms: Conclusions  Maturity brings…  Degradation of initial state definition  HW + SW  Accumulation of innocuous temporary procedures  Scale brings…  Marginal activities become full time  Many hands on the systems  Combat with strong management automation