Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager.

Slides:



Advertisements
Similar presentations
Storage Procurements Some random thoughts on getting the storage you need Martin Bly Tier1 Fabric Manager.
Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Accelerate Your Business RP IaaS (Infrastructure as a Service) IaaS.
Highly Available Central Services An Intelligent Router Approach Thomas Finnern Thorsten Witt DESY/IT.
Module 8: Concepts of a Network Load Balancing Cluster
Computer Networks IGCSE ICT Section 4.
Installing software on personal computer
CISCO NETWORKING ACADEMY Chabot College ELEC Router Introduction.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Mr. Mark Welton.  Three-tiered Architecture  Collapsed core – no distribution  Collapsed core – no distribution or access.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Computerized Networking of HIV Providers Networking Fundamentals Presented by: Tom Lang – LCG Technologies Corp. May 8, 2003.
Tier1 Site Report HEPSysMan 30 June, 1 July 2011 Martin Bly, STFC-RAL.
RAL Site Report HEPiX 20 th Anniversary Fall 2011, Vancouver October Martin Bly, STFC-RAL.
Networking Components Christopher Biles LTEC Assignment 3.
Tier1 Site Report HEPSysMan, RAL June 2010 Martin Bly, STFC-RAL.
RAID: High-Performance, Reliable Secondary Storage Mei Qing & Chaoxia Liao Nov. 20, 2003.
RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.
Status Report on Tier-1 in Korea Gungwon Kang, Sang-Un Ahn and Hangjin Jang (KISTI GSDC) April 28, 2014 at 15th CERN-Korea Committee, Geneva Korea Institute.
Module 12: Designing High Availability in Windows Server ® 2008.
RAL Tier1 Report Martin Bly HEPSysMan, RAL, June
All the components of network are connected to the central device called “hub” which may be a hub, a router or a switch. There is no direct traffic between.
RAL Tier 1 Site Report HEPSysMan – RAL – May 2006 Martin Bly.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
RAL Site Report Martin Bly HEPiX Fall 2009, LBL, Berkeley CA.
Linux Servers with JASMine K. Edwards, A. Kowalski, S. Philpott HEPiX May 21, 2003.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Introducing Network Design Concepts Designing and Supporting Computer Networks.
Tier1 Hardware Review Martin Bly HEPSysMan - RAL, June 2013.
ITGS Networks. ITGS Networks and components –Server computers normally have a higher specification than regular desktop computers because they must deal.
RAL Site Report HEPiX FAll 2014 Lincoln, Nebraska October 2014 Martin Bly, STFC-RAL.
RAL Site Report HEPiX Spring 2011, GSI 2-6 May Martin Bly, STFC-RAL.
© 2006 Cisco Systems, Inc. All rights reserved.Cisco PublicITE I Chapter 6 1 Introducing Network Design Concepts Designing and Supporting Computer Networks.
Network Components David Blakeley LTEC HUB A common connection point for devices in a network. Hubs are commonly used to connect segments of a LAN.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
70-412: Configuring Advanced Windows Server 2012 services
RAL Site Report HEPiX - Rome 3-5 April 2006 Martin Bly.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RAL Site Report Martin Bly HEPiX Spring 2009, Umeå, Sweden.
Install, configure and test ICT Networks
RAL Site Report Martin Bly SLAC – October 2005.
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Wrapping up subnetting, mapping IPs to physical ports BSAD 146 Dave Novak Sources: Network+ Guide to Networks, Dean 2013.
Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
GGF 17 - May, 11th 2006 FI-RG: Firewall Issues Overview Document update and discussion The “Firewall Issues Overview” document.
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
Tier-1 Data Storage Challenges Extreme Data Workshop Andrew Sansum 20 th April 2012.
RAL Site Report Spring CERN 5-9 May 2008 Martin Bly.
Lab A: Planning an Installation
Jan 2016 Solar Lunar Data.
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Update on Plan for KISTI-GSDC
GridPP Tier1 Review Fabric
HEPiX IPv6 Working Group F2F Meeting
GGF15 – Grids and Network Virtualization
Average Monthly Temperature and Rainfall
Web Server Administration
Gantt Chart Enter Year Here Activities Jan Feb Mar Apr May Jun Jul Aug
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Presentation transcript:

Tier1 - Disk Failure stats and Networking Martin Bly Tier1 Fabric Manager

11/06/2010Tier1 Disk Failure Stats and Networking2 Overview Tier1 Disk Storage Hardware Disk Failure Statistics Network configuration

Tier1 Storage Model Tier1 provides disk storage for Castor –Particle Physics data (mostly) Storage is: –Commodity components –Storage-in-a-box –Lots of units 11/06/2010Tier1 Disk Failure Stats and Networking3

Storage Systems I Typically 16- or 24-disk chassis –Mostly SuperMicro One or more hardware RAID cards –3ware/AMCC, Areca, Adaptec, LSI –PCI-X or PCI-e –Some active backplanes Configured as RAID5 or more commonly RAID6 –RAID1 for system disks, on second controller where it fitted Dual multi-cores CPUs 4GB, 8GB, 12GB RAM 1 GbE NIC IPMI 11/06/2010Tier1 Disk Failure Stats and Networking4

Storage Systems II YearChassisCPUCPU x Cores RAM /GB NICIMPIUnits bayAMD 2752 x 241 GbE  bayAMD 2752 x 241 GbE  A16-bayAMD x 281 GbE B16-bayIntel x 481 GbE A16-bayL54102 x 481 GbE B16-bayE54052 x 481 GbE A16-bayE55042 x 4121 GbE B16-bayE55042 x 4121 GbE 38 11/06/2010Tier1 Disk Failure Stats and Networking5

Storage Systems III YearControllerData DriveData Drives RaidCapacityUnitsDrives 2005Areca 1170WD TB ware 9550SX-16MLWD TB A3ware 9650SE-16WD TB B3ware 9650SE-16WD TB AAreca 1280WD RE22620TB B3ware 9650SE-24WD RE22620TB ALSI 8888-ELPWD RE4-GP22638TB BAdaptec 52445Hitachi22640TB /06/2010Tier1 Disk Failure Stats and Networking6

Storage Use Servers are assigned to a Service Classes –Each Virtual Organisation (VO) has several Service Classes –Usually between 2 and many servers per service class Service Classes are either: –D1T1: data is on both disk and tape –D1T0: data is on disk but NOT tape If lost, VO gets upset :-( –D0T1: data is on tape – disk is a buffer –D0T0: ephemeral data on disk Service Class type assigned VO depending on their data model and what they want to do with a chunk of storage 11/06/2010Tier1 Disk Failure Stats and Networking7

System uses Want to make sure data is as secure as possible RAID5 is not secure enough: –Only 1 disk failure can put data at risk –Longer period of risk as rebuild time increases due to array/disk sizes –Even with host spare, risk of double failure is significant Keep D1T0 data on RAID6 systems –Double parity information –Can lose two disks before data is at risk Tape buffer systems need throughput (network bandwidth) not space –Use smaller capacity servers in these Service Classes 11/06/2010Tier1 Disk Failure Stats and Networking8

Disk Statistics 11/06/2010Tier1 Disk Failure Stats and Networking9

11/06/2010Tier1 Disk Failure Stats and Networking High(Low!)lights Drives failed / changed: 398 (9.4% !) Multiple failure incidents: 21 Recoveries from multiple failures: 16 Data copied to another file system: 1 Lost file systems: 4

2009 Data 11/06/2010Tier1 Disk Failure Stats and Networking11 Month Clustervision 05 Viglen 06 Viglen 07a Viglen 07i WD 500GBWD 750GB Seagate 750GB All Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total Average/month Base % fail14.3%19.6%3.6%3.9%17.9%3.6%3.9%9.4%

11/06/2010Tier1 Disk Failure Stats and Networking12 Failure by generation

By drive model 11/06/2010Tier1 Disk Failure Stats and Networking13

Normalised Drive Failure Rates 11/06/2010Tier1 Disk Failure Stats and Networking14

2009 Multiple failures data 11/06/2010Tier1 Disk Failure Stats and Networking15 Viglen 06Viglen 07aViglen 07iAll DaysMonth Clustervision 05 Saved Copied Lost Total Doubles Saved Copied Lost Total Doubles Saved Copied Lost Total Doubles Saved Copied Lost Total Doubles 31Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total Average/ month

Lost file systems (arrays) 11/06/2010Tier1 Disk Failure Stats and Networking16

Closer look: Viglen 06 Why are Viglen 06 servers less reliable? RAID5? –More vulnerable to double disk failure causing a problem Controller issues? –Yes: Failure to start rebuilds after a drive fails –Yes: Some cache RAM issues System RAM? –Yes: ECC not set up correctly in BIOS Age? –Possibly: difficult to assert with confidence Disks? –Hmmm /06/2010Tier1 Disk Failure Stats and Networking17

Viglen 06 – Failures per server 11/06/2010Tier1 Disk Failure Stats and Networking18

Viglen 06 failures - sorted 11/06/2010Tier1 Disk Failure Stats and Networking19 Batch 1 Batch 2

Summary Overall failure rate of drives is high (9.4%) Dominated by WD 500GB drives (19.6%) Rate for 750GB drives much lower (3.75%) Spike in failures: –at the time of the Tier1 migration to R89 –when air conditioning in R89 failed Batch effect for Viglen 06 generation 11/06/2010Tier1 Disk Failure Stats and Networking20

11/06/2010Tier1 Disk Failure Stats and Networking21 Networking

Tier1 Network Data network –Central core switch: Force10 C300 (10GbE) –Edge switches: Nortel 55xx/56xx series stacks –Core-to-Edge: 10GbE or 2 x 10GbE –All nodes connected at 1 x GbE for data Management network –Various 10/100MbE switches: NetGear, 3Com New and salvaged from old data network Uplinks –10Gb/s to CERN (+ failover) –10Gb/s to Site LAN –10Gb/s + 10Gb/s failover to SJ5 (to be 20Gb/s each soon) –Site routers: Nortel 8600 series 11/06/2010Tier1 Disk Failure Stats and Networking22

11/06/2010Tier1 Disk Failure Stats and Networking23

Tier1 Network 102 Tier1 has a class 21 network within the site range –~2000 addresses Need to have a dedicated IP range for the LCHOPN – Class 23 network within existing subnet –~500 addresses, Castor disk servers only 11/06/2010Tier1 Disk Failure Stats and Networking24

Tier1 Network routing General nodes –Gateway address on site router –All traffic goes to Router A From there to site or off site via firewall and Site Access Router (SAR) OPN nodes –Gateway address on UKLight router –Special routes to Router A for site only traffic –All off-site traffic to UKLight router UKLight router –BGP routing information from CERN –T0, T1 T1 traffic directed to CERN link –Lancaster traffic -> link to Lancaster –Other traffic from RAL Tier1 up to SAR (the bypass) SAR –Filters inbound traffic for Tier1 subnet OPN machines to Core via UKLight router Others to Router A via Firewall Non-Tier1 traffic not affected 11/06/2010Tier1 Disk Failure Stats and Networking25

Logical network 11/06/2010Tier1 Disk Failure Stats and Networking26 Firewall SAR UKLR Router A OPN Tier1 Subnet SJ5 OPN Subnet Site

Tier1 Network firewalling Firewall is a NetScreen –10Gb/s linespeed Site FW policy: –closed inbound except by request on port/host (local/remote) basis –Open outbound –Port 80 + kin redirected via web caches... Tier1 subnet: –As Site except port 80 + kin do not use web caches OPN subnet: –Traffic is filtered at SAR – only Castor ports open 11/06/2010Tier1 Disk Failure Stats and Networking27