Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.

Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003

components database 1 server (+ manual failover) DAQ farm: 1 run-control + 5 servers (+2 standby) disks: SSA dataflow 2 - 6-way servers 2 - 4-way AFS disks: SSA and Fibre Channel offline 34 servers (+10 upgrade, not ordered yet) tape library cartridges 5200 drives 12 most components are 3-to-5 years old only one failing machine removed from the environment

machines and processors 196 processors (332-600 MHz) 10 more IBM machines will double the available processing power troubles infant mortality one month after installation, replacement of two 2-way 375 MHz PowerPC Power3 cards that failed intermittently processor failures prompted machine reboots that excluded the failing processor wear out a 4 years old 4-way SUN 450 machine found dead after a scheduled shutdown machinesprocessors IBM 38 156 SUN1040

machines and processors SUN servers sometimes reboot because of kernel panic IBM machines never stopped during operations or during maintenance due to common problems like failing disk drive replacement addition of new components (e.g. Fibre Channel drives) all changes to the OS that require a reboot are collected and applied at times of scheduled maintenance shutdowns –(approximately every three months)

tape data reliability library never stopped because of failures thanks to the High Availability features that include a double controller and a double (active) accessor two stops for microcode upgrade and one for hardware upgrade component failures 1 accessor DC power supply 1 accessor gripper 1 controller processor libraries1 automated cartridges5200310 TB drives in library12160 MB/s

tape drive and tape cartridge failures 20 GB model - no troubles 40 GB model - no troubles 60 GB model - acquired to make room for more data three months of continuous reversal activity during this period 6 drives showed a significantly large rate of recoverable errors 6 headguides have been replaced and now work smoothly (lower quality headguides presumably due to relocation of manufacturing plant) maximum observed number of mounts per cartridge 8,000 no correlation found between cartridge errors and cartridge usage or number of mounts real data loss 1 cartridge (40 GB) 10 reconstructed files 40 cartridges replaced free because of a modest number of recoverable errors (cartridges are now 5 years old)

disk data 1.because of the good reliability, most disks are used without data protection 2.in case of disk faults, all data are easily recovered from tapes a fraction of data requires an improved reliability raid 5 DB2 data sets TSM transition areas all AFS areas mirroring Operating System for all machines TSM database and logs DB2 data is also protected by frequent online backups an automatic archival of logs TSM database and log files are also duplicated by TSM SSA2087.7 TB Fibre Channel284.0 TB FC upgrade+126+18.0 TB(not ordered yet)

disk troubles a well identified set of 120 SSA disks showed a remarkable failure rate for a significant amount of time first failures: appeared three month after installation accident: overlap of the start of disk usage with an extended shutdown of all cooling services to KLOE failure rate apparently grew up rapidly to 1 per week because of this accidental overlap, the claim against IBM has been issued very late all disks belonging to the failing set have been replaced free and are now working properly excluding this unfortunate period, the failure rate is about 5 per year

new acquisitions of disks KLOE is now abandoning the proven, reliable and well manageable SSA technology since all commercial policies are favoring the (15 years old and yet) emerging Fibre Channel technology Fibre Channel disks are only available in a SAN (Storage Area Network) environment this is contemporarily good news and bad news since you both gain all the benefits of SAN, but you also delegate most of disk management to external black boxes disk management and recovery is thus strongly outside the reach of the system administrator and there is a serious possibility that all economical savings could turn into a technical nightmare

miscellanea the beginning of KLOE operations have been burdened by several networking problems initial cheap solutions based on Xylan equipments have been abandoned with a modest expense, using CISCO solutions, all KLOE networking runs smoothly and efficiently the policy used for maintenance contracts includes all components but the reconstruction farm the cost of maintenance contracts is around 4.5% of the list price of all maintained parts

conclusions 5 years of experience in KLOE Computing have show how a balanced set of quality components can guarantee: –long periods of reliable operations mostly constrained by external events covering all main KLOE activities without interruptions 4 stops in 2003 tape library microcode upgrade CISCO networking redundancy upgrade interruption of water cooling at LNF blackout –with costs comparable to solutions based on cheap components –under the control of two persons (2 FTE) the amount of related administrative work, not included in this description, is becoming the real issue

Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.

Similar presentations

Presentation on theme: "Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003.

Similar presentations

Presentation on theme: "Reliability of KLOE Computing Paolo Santangelo for the KLOE Collaboration INFN LNF Commissione Scientifica Nazionale 1 Roma, 13 Ottobre 2003."— Presentation transcript:

Similar presentations

About project

Feedback