Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

Slides:



Advertisements
Similar presentations
CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
Advertisements

Fabric Management at CERN BT July 16 th 2002 CERN.ch.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Production RedHat7.2 Services: Retooling during the upgrade FIO-LCG contribution Tim Smith.
Enterprise Web Architecture and Performance Shennon Shen & Scott Carey --- Plumtree Software Inc.
“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.
MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 11 Windows Server 2008 Virtualization.
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.
The CERN Computer Centres October 14 th 2005 CERN.ch.
Current Status of Fabric Management at CERN, 26/7/2004 Current Status of Fabric Management at CERN CHEP 2004 Interlaken, 27/9/2004 CERN IT/FIO: G. Cancio,
Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.
CERN DNS Load Balancing Vladimír Bahyl IT-FIO. 26 November 2007WLCG Service Reliability Workshop2 Outline  Problem description and possible solutions.
Interfacing a Managed Local Fabric to the GRID LCG Review Tim Smith IT/FIO.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
DataGrid is a project funded by the European Commission under contract IST IT Post-C5, Managing Computer Centre machines with Quattor.
EDG LCFGng: concepts Fabric Management Tutorial - n° 2 LCFG (Local ConFiGuration system)  LCFG is originally developed by the.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
EDG WP4: installation task LSCCW/HEPiX hands-on, NIKHEF 5/03 German Cancio CERN IT/FIO
19-May-2003 Solaris service: Status and plans at CERN Ignacio Reguero IT / Product Support / Unix Infrastructure Presented by Manuel Guijarro.
© Copyright 2013 TONE SOFTWARE CORPORATION New Client Turn-Up Overview.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Computer Systems Lab The University of Wisconsin - Madison Department of Computer Sciences Linux Clusters David Thompson
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
German Cancio – WP4 developments Partner Logo System Management: Node Configuration & Software Package Management
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Deployment work at CERN: installation and configuration tasks WP4 workshop Barcelona project conference 5/03 German Cancio CERN IT/FIO.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
G. Cancio, L. Cons, Ph. Defert - n°1 October 2002 Software Packages Management System for the EU DataGrid G. Cancio Melia, L. Cons, Ph. Defert. CERN/IT.
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
Software Management with Quattor German Cancio CERN/IT.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
1 Putchong Uthayopas, Thara Angsakul, Jullawadee Maneesilp Parallel Research Group, Computer and Network System Research Laboratory Department of Computer.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
CERN DNS Load Balancing VladimírBahylIT-FIO NicholasGarfieldIT-CS.
Fabric Management with ELFms BARC-CERN collaboration meeting B.A.R.C. Mumbai 28/10/05 Presented by G. Cancio – CERN/IT.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Linux Configuration using April 12 th 2010 L. Brarda / CERN (some slides & pictures taken from the Quattor website) ‏
CERN 19/06/2002 Kickstart file generator Andrea Chierici (INFN-CNAF) Enrico Ferro (INFN-LNL) Marco Serra (INFN-Roma)
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
SMOOTHWALL FIREWALL By Nitheish Kumarr. INTRODUCTION  Smooth wall Express is a Linux based firewall produced by the Smooth wall Open Source Project Team.
Overview of cluster management tools Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Fabric Management: Progress and Plans PEB Tim Smith IT/FIO.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Quattor installation and use feedback from CNAF/T1 LCG Operation Workshop 25 may 2005 Andrea Chierici – INFN CNAF
Quattor: An administration toolkit for optimizing resources Marco Emilio Poleggi - CERN/INFN-CNAF German Cancio - CERN
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
Germán Cancio CERN IT/FIO LCG workshop, 24/3/04
WP4-install status update
Status and plans of central CERN Linux facilities
German Cancio CERN IT .quattro architecture German Cancio CERN IT.
QNX Technology Overview
Web Application Server 2001/3/27 Kang, Seungwoo. Web Application Server A class of middleware Speeding application development Strategic platform for.
SUSE Linux Enterprise Desktop Administration
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
The EU DataGrid Fabric Management Services
Presentation transcript:

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT

2003/07/08Farm Management: Contents  Our Challenge (non-solutions)  Scale  Complexity  Dynamics  Our Solution  Architecture  Current Status

2003/07/08Farm Management: Simple Question of Scale? HW SW: Base Install SW: Production-ising SW: Applications LSF, Web, CVS Infrastructure connections: alarms, monitoring, configuration, passwd dist… RedHat 7.3 Complexity!

2003/07/08Farm Management: Scale is important! [~]  1000 boxes  800,000 Si2k  140,000 jobs/wk  12,000 uids  30 user communities  150 simul. indep. applics  Public network  20 root priv

2003/07/08Farm Management: Complexity: Static Configuration 6.1 SW: Production-ising No Users Interactive HW SW: Base Install SW: Production-ising SW: Functionality SW: Applications 7.3RHES Interactive Users BatchBuildServers 13 acquisitions 10.x

2003/07/08Farm Management: Complexity: Dynamics  Volatile configurations  Fastpasswd files (every couple of hrs)  MedAccess lists  MedSW security updates  SlowOS upgrades  Proliferation  Hardware Failures  Asymptotic configuration changes  Node quiescence  Hardware down / at vendor  User community constraints

2003/07/08Farm Management: A Clean Restart Node Configuration System Monitoring System Installation System Fault Mgmt System Desired StateActual State

2003/07/08Farm Management: Framework Considerations  Lightweight/modular/coupling/protocols/interfaces…  Decoupling  Local config files  Local programs do all work  Avoid inherent drift  No external crontabs or remote mgmt scripts  No unregistered application provider triggered updates  No reliance on mgmt tools for parallel cmd  Reproducible in time and space  Staggered replacement of existing tools  Scalable  Load balanced servers  Time smeared transactions  Pre-deployment caches  Head-nodes ?

2003/07/08Farm Management: Config/SW Considerations  Hierarchical configuration specification  Graph rather than tree structure  Common properties set only once  Node profiles  Complete specification in one XML file  Local cache  Transactions / Notifications  Externally specified, Versioned: CVS repos.  Clean Initial State  Linux Standards Base, RPM  One tool to manage all SW: SPMA  System and application  Update verification nodes + release cycle  Procedures and Workflows

2003/07/08Farm Management: Hardware variety hardware_diskserv_elonex_1100 hardware_elonex_500 hardware_elonex_600 hardware_elonex_800 hardware_elonex_800_mem1024mb hardware_elonex_800_mem128mb hardware_seil_2002 hardware_seil_2002_interactiv hardware_seil_2003 hardware_siemens_550 hardware_techas_600 hardware_techas_600_2 hardware_techas_600_mem512mb hardware_techas_800 template hardware_diskserver_elonex_1100; "/hardware/cpus" = list(create("hardware_cpu_GenuineIntel_Pentium_III_1100"), create("hardware_cpu_GenuineIntel_Pentium_III_1100")); "/hardware/harddisks" = nlist("sda", create("pro_hardware_harddisk_WDC_20")); "/hardware/ram" = list(create("hardware_ram_1024")); "/hardware/cards/nic" = list(create("hardware_card_nic_Broadcom_BCM5701")); structure template hardware_cpu_GenuineIntel_Pentium_III_1100; "vendor" = "GenuineIntel"; "model" = "Intel(R) Pentium(R) III CPU family 1133MHz"; "speed" = 1100; structure template hardware_card_nic_Broadcom_BCM5701; "manufacturer" = "Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet"; "name" = "3Com Corporation 3C996B-T 1000BaseTX"; "media" = "GigaBit Ethernet"; "bus" = "pci";

2003/07/08Farm Management:  CERN RedHat Linux  ~ 2400 packages declared in CDB  RedHat Enterprise Server 2.1  ~ 1300 packages declared in CDB  Diff subsets are selected for diff services  Complete control over installed software Software variety software_diskserver7 software_lxbatch7 software_lxdev7 software_lxmaster7 software_lxplus7 software_tapeserver7 object template software_diskserver7; include declaration_functions; include software_packages_cern_redhat7_3_release; include software_packages_cern_redhat7_3_asis_base; include software_packages_cern_redhat7_3_cerncc_base; include software_packages_cern_redhat7_3_edgwp4; "/software/packages"=pkg_del("CASTOR-client"); "/software/packages"=pkg_add("CASTOR-disk_server"," ","i386"); "/software/packages"=pkg_add("CERN-CC-3dmd","1.0-1","i386");

2003/07/08Farm Management: What is in CDB ?  Hardware  CPU  Hard disk  Network card  Memory size  Location  Software  Repository definitions  Service definitions = groups of packages (RPMs)  System  Partition table  Cluster name and type  CCDB name  Site release  Load balancing information

2003/07/08Farm Management: SW Rep Current Implementation Cfg Cache SW Cache SPMA SW Rep CDB XML RPM http Installation Server PXE Kickstart DHCP KickStartMe SUE LXPLUS LXBATCH 1000 nodes MSA

2003/07/08Farm Management: Conclusions  Maturity brings…  Degradation of initial state definition  HW + SW  Accumulation of innocuous temporary procedures  Scale brings…  Marginal activities become full time  Many hands on the systems  Combat with strong management automation