CERN.ch 1 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
Fabric Management at CERN BT July 16 th 2002 CERN.ch.
User Documentation.  You cannot build a system for a client and leave them without adequate documentation  Computer systems are complex, require configuration.
Mecanismos de alta disponibilidad con Microsoft SQL Server 2008 Por: ISC Lenin López Fernández de Lara.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Distributed components
The CERN Computer Centres October 14 th 2005 CERN.ch.
Lesson 1: Configuring Network Load Balancing
Introduction to MC/ServiceGuard H6487S I.02 Module 2.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Cambodia-India Entrepreneurship Development Centre - : :.... :-:-
Danny gallardo. Operating systems An operating system (OS) is a collection of software that manages computer hardware resources and provides common services.
Installing software on personal computer
Denny Cherry twitter.com/mrdenny.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
MCTS Guide to Microsoft Windows Server 2008 Applications Infrastructure Configuration (Exam # ) Chapter Ten Configuring Windows Server 2008 for High.
Simplify your Job – Automatic Storage Management Angelo Session id:
Lesson 3 Introduction to Networking Concepts Lesson 3.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Term 2, 2011 Week 3. CONTENTS The physical design of a network Network diagrams People who develop and support networks Developing a network Supporting.
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
Planning the LCG Fabric at CERN openlab TCO Workshop November 11 th 2003 CERN.ch.
Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption.
OSP310. What is a SharePoint® Farm? A collection of one or more SharePoint Servers and SQL Servers® providing a set of basic SharePoint.
FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer, Progress Sonic.
Fabric Infrastructure LCG Review November 18 th 2003 CERN.ch.
Denny Cherry Senior Database Administrator / Architect Awareness Technologies Quest Software SQL Server MVP MCSA, MCDBA, MCTS,
Network Management. Network management means monitoring and controlling the network so that it is working properly and providing value to its users. A.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
IT Professionals David Tesar | Microsoft Technical Evangelist David Aiken | Microsoft Group Technical Product Manager 07 | High Availability and Load Balancing.
Phase II Purchasing LCG PEB January 6 th 2004 CERN.ch.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer Progress Sonic.
Cluster Configuration Update Including LSF Status Thorsten Kleinwort for CERN IT/PDP-IS HEPiX I/2001 LAL Orsay Tuesday, December 08, 2015.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
Online System Status LHCb Week Beat Jost / Cern 9 June 2015.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Operating System. Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
CERN Disk Storage Technology Choices LCG-France Meeting April 8 th 2005 CERN.ch.
SERVERS. General Design Issues  Server Definition  Type of server organizing  Contacting to a server Iterative Concurrent Globally assign end points.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Chapter 1: Introduction
High Availability 24 hours a day, 7 days a week, 365 days a year…
High Availability Linux (HA Linux)
Embracing Failure: A Case for Recovery-Oriented Computing
BA Continuum India Pvt Ltd
Network Load Balancing
Maximum Availability Architecture Enterprise Technology Centre.
GGF15 – Grids and Network Virtualization
NCSA Supercluster Administration
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
Distributed Systems and Concurrency: Distributed Systems
In-network computation
Presentation transcript:

CERN.ch 1 Issues  Hardware Management –Where are my boxes? and what are they?  Hardware Failure –#boxes  MTBF + Manual Intervention = Problem!  Software Consistency –Operating system and managed components –Experiment software  State Management –Evolve configuration with high level directives, not low level actions.  Maintain service despite failures –or, at least, avoid dropping catastrophically below expected service level.

CERN.ch 2 Hardware Management  We are not used to handling boxes on this scale. –Essential databases were designed in the ’90s for handling a few systems at a time. »2FTE-weeks to enter 450 systems! –Chain of people involved »prepare racks, prepare allocations, physical install, logical install »and people make mistakes…  Developing Hardware Management System to track systems –1 st benefit has been to understand what we do! –Being used to track systems as we migrate to our new machine room. –Would now like SOAP interfaces to all databases.

CERN.ch 3 Hardware Failure  MTBF is high, but so is the box count. –2400 CERN today: 3.5  10 6 disk-hours/week »1 disk failure per week  Worse, these problems need human intervention.  Another role for the Hardware Management System –Manage list of systems needing local intervention. »Expect this to be prime shift activity only; maintain list overnight and present for action in the morning. –Track systems scheduled for vendor repair »Ensure vendors meet contractual obligations for intervention »Feedback subsequent system changes (e.g. new disk, new MAC address) into configuration databases.

CERN.ch 4 Keeping nodes in order Node Configuration System Monitoring System Installation System Fault Mgmt System

CERN.ch 5 State Management  Clusters are not static –OS upgrades –reconfiguration for special assignments »c.f. Higgs analysis for LEP –load dependent reconfiguration »but best handled by common configuration!  Today: –Human identification of nodes to be moved, manual tracking of nodes through required steps.  Tomorrow: –Give me 200, any 200. Make them like this. By then. –A State Management System. »Development starting now. »Again, needs tight coupling to monitoring & configuration systems.

CERN.ch 6 Grace under Pressure  The pressure: –Hardware failures –Software failure »1 mirror failure per day »1% of CPU server nodes fail per day –Infrastructure failure »e.g. AFS servers  We need a Fault Tolerance System –Repair simple local failures »and tell the monitoring system… –Recognise failures with wide impact and take action »e.g. temporarily suspend job submission –Complete system would be highly complex, but we are starting to address simple cases.