CERN - IT Department CH-1211 Genève 23 Switzerland The Tier-0 Road to LHC Data Taking www.cern.ch/it-fio CPU ServersDisk ServersNetwork FabricTape Drives.

Slides:

Advertisements

Similar presentations

Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.

Advertisements

Fabric Management at CERN BT July 16 th 2002 CERN.ch.

MUNIS Platform Migration Project WELCOME. Agenda Introductions Tyler Cloud Overview Munis New Features Questions.

Introduction to DBA.

The Efficient Fabric Presenter Name Title. The march of ethernet is inevitable Gb 10Gb 8Gb 4Gb 2Gb 1Gb 100Mb +

CASTOR Project Status CASTOR Project Status CERNIT-PDP/DM February 2000.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.

The CERN Computer Centres October 14 th 2005 CERN.ch.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.

Module 10 Configuring and Managing Storage Technologies.

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.

Current Job Components Information Technology Department Network Systems Administration Telecommunications Database Design and Administration.

Copyright 2009 Fujitsu America, Inc. 0 Fujitsu PRIMERGY Servers “Next Generation HPC and Cloud Architecture” PRIMERGY CX1000 Tom Donnelly April

Cluster Reliability Project ISIS Vanderbilt University.

Planning the LCG Fabric at CERN openlab TCO Workshop November 11 th 2003 CERN.ch.

Farm Management D. Andreotti 1), A. Crescente 2), A. Dorigo 2), F. Galeazzi 2), M. Marzolla 3), M. Morandin 2), F.

ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.

Organisation Management and Policy Group (MPG): Responsible for setting and policy decisions and resolving any issues concerning fractional usage, acceptable.

CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 6 th July 2009.

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.

ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.

JLAB Computing Facilities Development Ian Bird Jefferson Lab 2 November 2001.

 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.

CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

Operation of the CERN Managed Storage environment; current status and future directions CHEP 2004 / Interlaken Data Services team: Vladimír Bahyl, Hugo.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.

CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.

CERN IT Department CH-1211 Genève 23 Switzerland t The Tape Service at CERN Vladimír Bahyl IT-FIO-TSI June 2009.

The Worldwide LHC Computing Grid Frédéric Hemmer IT Department Head Visit of INTEL ISEF CERN Special Award Winners 2012 Thursday, 21 st June 2012.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

VCS Building Blocks. Topic 1: Cluster Terminology After completing this topic, you will be able to define clustering terminology.

Next Generation of Apache Hadoop MapReduce Owen

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.

26. Juni 2003Bernd Panzer-Steindel, CERN/IT1 LHC Computing re-costing for for the CERN T0/T1 center.

Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

High Availability Linux (HA Linux)

Robotics and Tape Drives

Status of Fabric Management at CERN

Experiences and Outlook Data Preservation and Long Term Analysis

CERN Lustre Evaluation and Storage Outlook

Luca dell’Agnello INFN-CNAF

LHC Computing re-costing for

GridPP Tier1 Review Fabric

Real IBM C exam questions and answers

The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:

Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland The Tier-0 Road to LHC Data Taking CPU ServersDisk ServersNetwork FabricTape Drives & Robotics CASTORquattor LEAF Commodity Hardware Innovative Software  At least two suppliers chosen for each set of servers to be installed to reduce the risk that capacity upgrades are delayed due to problems with one supplier or one system component.  Rigorous “burn-in” tests prior to moving servers into production catch hardware problems early.  Given past experience with hardware problems, especially with disk servers, we favour ordering early to ensure capacity is delivered on schedule rather than delaying purchases as long as possible Tenders issued for kSI2K capacity to be measured using SPEC benchmark run as specified by CERN. “White box” suppliers offer best price/performance. On-site maintenance required (3 day response) and 3-year warranty. No concerns about hardware reliability; the biggest problem is power consumption: 40 dual-motherboard servers in a 19” rack gives a heat load of 30kW/m 2 ! Power consumption penalty now included in tenders but 1U boxes are still favoured over blade servers. Tenders issued for disk capacity plus I/O performance requirements. “White box” suppliers offer best price/performance. On-site maintenance required (4 hour response) and 3-year warranty. We do experience hardware problems but these are almost always disk-related and tier-1 vendors (IBM, HP, …) use the same disk suppliers! Commodity approach provides greatest capacity with no significant performance or reliability penalty. Using Ethernet standards enables provision of a high performance core network (2.4Tb/s switching capacity and designed for fully non-blocking performance) whilst keeping per-port costs low. Using Ethernet as the unique interconnection technology (as opposed to FiberChannel or Infiniband) avoids costly and inefficient translation gateways to connect to the rest of CERN, the experiments and the T1s. However, we use switches with custom ASICS rather than true commodity devices as ability to cope with congestion is an important feature in the CERN environment; significant effort is required for testing during the hardware selection process. CERN tape system costs are dominated by the €100/600GB, 15PB/year => €2.5M/year! LTO cartridge costs are lower, but media cannot be reformatted at higher density when new drives are introduced. IBM and STK media can be reformatted; e.g. next generation T10K drives will be able to write 1TB on current media. The costs saved by reusing media more than offset the higher costs of non-commodity drives. Twin vendor solution preserves flexibility, e.g. if one vendor has a delay releasing a new drive. Media repack to enable volume reuse is expected to be an ongoing operation; a dedicated CASTOR2 component has been created to facilitate this task. Database centric architecture for increased reliability & performance core services isolated from user queries stateless daemons can be replicated for performance and restarted easily after (hardware) failure Access request scheduling allows prioritisation of requests according to user role/privilege guarantees I/O performance for users and avoids h/w overload 3 Sun/STK SL8500 Libraries 23,200 slots 11.6PB capacity 50 Sun/STK T10K tape drives 3 IBM 3584 Libraries 8,650 slots 6PB capacity 60 IBM 3592 tape drives Batch Farm 2,750 dual-cpu servers 7.5MSI2K capacity 150,000 jobs/week Other functions 1,600 dual-cpu servers (generally older hardware) 700 servers (~500 in CASTOR2); 3PB total capacity Ethernet everywhere: 8 Force10 10GbE routers GbE ports 100 HP Switches 2.4Tb/s switching capacity quattor is a system administration toolkit for the automated installation, configuration and management of clusters and farms. Key quattor components are CDB, the fabric wide configuration database, which holds the desired state for all nodes SPMA, the Software Package Management Agent, which handles local software installations using the system packager (e.g. RPM) NCM, the Node Configuration Manager, configures/reconfigures local system and grid services using a plug-in component framework. quattor development started in the scope of the EDG project ( ). Development and maintenance is coordinated by CERN in collaboration with other partner institutes (including BARC, BEGrid, IN2P3/LAL, INFN/CNAF, NIKHEF, Trinity College Dublin, UAM Madrid and others) Lemon Workflow tracking for box management (hardware and software): install/relocate systems and track vendor maintenance actions oversee software upgrade across 3,000+ node cluster Integrates monitoring, services and configuration database: e.g. on disk server failure, reconfigure CASTOR cluster in CDB and call vendor for repair. See also the poster for the Service Level Status Overview (#91; this session) For more information, see contribution #62: CASTOR2: Design and Development of a scalable architecture for a hierarchical storage system at CERN Carson hall B) Distributed monitoring system scalable to 10k+ nodes Provides active monitoring of software and hardware the Computer Center environment (power usage, temperature, …) Facilitates early error detection and problem prevention Provides persistent storage of the monitoring data Executes corrective actions and sends notifications Offers a framework for further creation of sensors for monitoring Most of the functionality is site independent Extremely Large Fabric management system except for