RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.

Slides:



Advertisements
Similar presentations
Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 GridPP 30, Glasgow, 26th March 2013.
Advertisements

Hardware Reliability at the RAL Tier1 Gareth Smith 16 th September 2011.
RAL Tier1 Operations Andrew Sansum 18 th April 2012.
Tier-1 Evolution and Futures GridPP 29, Oxford Ian Collier September 27 th 2012.
Overview of DVX 9000.
Introduction to DBA.
Cloud & Virtualisation Update at the RAL Tier 1 Ian Collier Andrew Lahiff STFC RAL Tier 1 HEPiX, Lincoln, NEBRASKA, 17 th October 2014.
T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
CommVault Data Management & Exchange Dogfood Presentation by: Georgia Huggins Exchange Server Support | MS IT.
Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
Step Arena Storage Introduction. 2 HDD trend- SAS is the future Source: (IDC) Infostor June 2008.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
Virtualisation Cloud Computing at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Bologna, 18 th April 2013.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
Jefferson Lab Site Report Kelvin Edwards Thomas Jefferson National Accelerator Facility HEPiX – Fall, 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
Storage Trends: DoITT Enterprise Storage Gregory Neuhaus – Assistant Commissioner: Enterprise Systems Matthew Sims – Director of Critical Infrastructure.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
Tier1 Hardware Review Martin Bly HEPSysMan - RAL, June 2013.
HEPix April 2006 NIKHEF site report What’s new at NIKHEF’s infrastructure and Ramping up the LCG tier-1 Wim Heubers / NIKHEF (+SARA)
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
Virtualisation & Cloud Computing at RAL Ian Collier- RAL Tier 1 HEPiX Prague 25 April 2012.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
VMware vSphere Configuration and Management v6
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Virtualisation at the RAL Tier 1 Ian Collier STFC RAL Tier 1 HEPiX, Annecy, 23rd May 2014.
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
RAL Site Report HEPiX Spring 2012, Prague April Martin Bly, STFC-RAL.
BNL Oracle database services status and future plans Carlos Fernando Gamboa, John DeStefano, Dantong Yu Grid Group, RACF Facility Brookhaven National Lab,
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
RAL Site Report Martin Bly SLAC – October 2005.
RAL Site Report HEPiX Spring 2015 – Oxford March 2015 Martin Bly, STFC-RAL.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.
Experience of Lustre at QMUL
Paul Kuipers Nikhef Site Report Paul Kuipers
Mattias Wadenstein Hepix 2012 Fall Meeting , Beijing
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
HPEiX Spring RAL Site Report
Статус ГРИД-кластера ИЯФ СО РАН.
GridPP Tier1 Review Fabric
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
Storage Trends: DoITT Enterprise Storage
Presentation transcript:

RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL

02/05/2011RAL Site Report - HEPiX Spring 2011 Overview STFC Stuff RAL Stuff Building stuff Tier1 Stuff

STFC Developments UK Govt Comprehensive Spending Review (CSR) –Was –General: Level funding for Core Science, i.e., no increase with inflation –The CSR settlement allows STFC to pursue the high priority Science programme it outlined as a result of the 2009 prioritisation. In particle physics and astronomy, this was confirmed by PPAN following the CSR GridPP is an STFC project –Within the STFC programme, GridPP was rated as alpha-5, the highest priority, along with CMS and ATLAS (and other stuff) –The T1 is a high priority within GridPP –GridPP funding for the next 4 years (to 2015) has now been confirmed 02/05/2011RAL Site Report - HEPiX Spring 2011

RAL Addressing: –Removal of old-style addresses in favour of the cross-site standard (Significant resistance to this) –No change in aim to remove old-style addresses but... –mostly via natural wastage as staff leave or retire –Staff can ask to have their old-style address terminated Exchange: –Migration from Exchange 2003 to 2010 went successfully Much more robust with automatic failover in several places Mac users happy as Exchange 2010 works directly with Mac Mail so no need for Outlook clones –Issue for exchange servers with MNLB and switch infrastructure Providing load-balancing Needed very precise instructions for set up to avoid significant network problems 02/05/2011RAL Site Report - HEPiX Spring 2011

Building Stuff UPS problems –Leading power factor due to switch-mode PSUs in hardware –Causes 3KHz ringing on current, all phases (61 st harmonic) Load is small (80kW) compared to capacity of UPS (480kVA) –Most kit stable but EMC AX4-5 FC arrays unpredictably detect supply failure and shut down arrays –Previous possible solutions abandoned in favour of: –Local isolation transformers in feed from room distribution to in-rack distribution: Works! 02/05/2011 RAL Site Report - HEPiX Spring 2011

Networking Site –Sporadic packet loss in site core networking (few %) Began in December, got steadily worse Impact on connections to FTS control channels, LFC, other services Data via LHCOPN not affected other than by control failures –Traced to traffic shaping rules used to limit bandwidth in firewall for site commercial tenants. These were being inherited by other network segments (unintentionally!) –Fixed by removing shaping rules and using a hardware bandwidth limiter LAN –Issue with a stack causing some ports to block access to some ip addresses: one of the stacking ports on the base switch faulty –Several failed 10GbE XFP transceivers 02/05/2011RAL Site Report - HEPiX Spring 2011

FY 10/11 Procurements Summary of previous report: –36 SuperMicro 4U 24-bay chassis with 2TB SATA HDD (10GbE) –13 x SuperMicro Twin²: 2 x X5650, 4GB/core, 2 x 1T HDD –13 x Dell C6100: 2 x X5650, 4GB/core, 2 x 1T HDD –Castor (Oracle) databases server refresh: 13 x Dell R610 –Castor head nodes: 16 x Dell R410 –Virtualisation: 6 x Dell R510, 12 x 300GB SAS, 24GB RAM, 2 x E5640 New since November –13 x Dell R610 tape servers (10GbE) for T10KC drives –14 x T10KC tape drives –Arista 7124S 24-port 10GbE switch + twinax copper interconnects –5 x Avaya 5650 switches + various 10/100/1000 switches 02/05/2011RAL Site Report - HEPiX Spring 2011

Storage Issues One of two batches of the FY09/10 capacity storage failed acceptance testing: 60/98 servers (~2.2PB)  –Cards swapped (LSI -> Adaptec) –Acceptance testing completed –Released for production use After problems with one of the two batches of the FY08/09 capacity during commissioning (now resolved), the other batch has had issues in production resulting in data loss: –Single-drive throws cause array lock up and crash (array loss) –Whole batch (50/110) rotated out of production (data migrated) Updated array firmware Recreate arrays from scratch, new file systems Undergoing hammer test Eight drive throws in 3 months successfully handled 02/05/2011RAL Site Report - HEPiX Spring 2011

Castor Status Castor manages disk and tape storage –12 million files (at March 2011) –Used/Total capacities: 3.2PB/5.2PB on tape and 3.2PB/7.5PB on disk Recent news: –Major upgrade during late 2010 introducing: Checksums for all files, xrootd support, proper integrated disk server draining Disk servers migrated to SL5/64bit with XFS capability –New (non-Tier1) production instance for Diamond synchrotron Coming up: –T10KC drive and tape media support Need to update Library microcode Need latest version of Castor to use these drives –Castor v for tape servers and some backend services –Move to new database hardware and better resilient architecture (using Oracle DataGuard) later this year –New service ’head nodes’ 02/05/2011RAL Site Report - HEPiX Spring 2011

Virtualisation Evaluating MS Hyper-V (inspired by CERN's successes) for services virtualization platform –Offers sophisticated management/failover etc without punitive cost of VMWare However as Linux admins, sometimes hard to know if problems are due to ignorance of the MS world Struggled for a long time with iSCSI storage arrays (and poor support) –abandoned them recently and problems seem resolved Have learnt a lot about administering Windows servers.... Ready to implement production platform 02/05/2011RAL Site Report - HEPiX Spring 2011

Projects Quattor –Batch and Storage systems under Quattor management ~6200 cores, 700+ systems (batch), 500+ system (storage) Significant time saving –Significant rollout on Grid services node types CernVM-FS –Major deployment at RAL to cope with software distribution issues –Details in talk by Ian Collier later this week Network future –Beginning to look at provision for Tier1 core network to continue to meet increasing data bandwidth requirements and resilience Various mesh structures using lower cost components are attractive 02/05/2011RAL Site Report - HEPiX Spring 2011

Questions? 02/05/2011RAL Site Report - HEPiX Spring 2011

. Rollout of new hardware for services nodes Database architecture 02/05/2011RAL Site Report - HEPiX Spring 2011