ACAT 2007 April Nikhef Amsterdam

ACAT 2007 April 22-27 2007 Nikhef Amsterdam
Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF Ricci Pier Paolo et al., on behalf of INFN TIER1 Staff ACAT 2007 April Nikhef Amsterdam

ACAT 2007 NIKHEF pierpaolo.ricci@cnaf.infn.it
Summary Overall Tier1 Hardware description Castor v.2 HSM Software SAN Fabric implementation Monitoring and Administration tools GPFS and Castor performance test 12-Nov-18 ACAT 2007 NIKHEF

Overall Fabric Tier1 Hardware description
Here is what we have in production: Disk (SAN): ~980 TB RAW (ATA RAID-5) 9 Infortrends A16F-R1211-M2 56TB 1 SUN STK Bladestore 32TB 4 IBM FastT900 (DS 4500) 200TB 5 SUN STK 290TB 3 DELL EMC 400TB Tape: 1PByte uncomp. (tapes for 670TB) 1 SUN STK L5500 partitioned in 2000 slots LTO-2 (200GB) and 3500 slots 9940B (200GB) 6 LTO-2 Drives (20-30 MB/s each) 7 9940B Drives (25-30 MB/s each) 12-Nov-18 ACAT 2007 NIKHEF

Worker Nodes (LSF Batch System)
TIER1 INFN CNAF Storage HSM (1PB) Worker Nodes (LSF Batch System) Farm 800 nodes for 1500KSPI2k (3000 KSPI2k 2nd half 2007) STK180 with 180 LTO-1 (18Tbyte Native) ~90 Diskservers with Qlogic FC (HBA 2340 and 2462) W2003 Server with LEGATO Networker (Backup) Fibre Channel RFIO RFIO,GPFS, Xroot SAN CASTOR-2 HSM Castor services servers and tapeservers WAN or TIER1 LAN STK L5500 robot (5500 slots) 6 IBM LTO-2, 7 STK 9940B drives Fibre Channel SAN (~ 980TB RAW -15/25% for NET SPACE => ~770TB) 56TB RAW 32TB RAW 200TB RAW 290TB RAW 400TB RAW 1 SUN STK BladeStore 1x GByte 1250 SATA Blades 4 x 2Gb FC interfaces 5 SUN STK FLX680 5 x Gbyte 500GB SATA Blades 4 x 2Gb FC interfaces each 4 Infortrend A16F-R1A2-M1 4 x 3200 GByte SATA 2 x 2Gb FC interfaces each 5 Infortrend A16F-R1211-M2 + JBOD 5 x 6400 GByte SATA 2 x 2Gb FC interfaces each 4 IBM FastT900 (DS 4500) 4x43000Gbyte SATA 4 x 2Gb FC interfaces each 3 EMC CX380 3 x GByte FATA 8 x 4Gb FC intefaces each 12-Nov-18 ACAT 2007 NIKHEF

Castor v.2 Hardware Core services are on machines with scsi disks, hardware raid1, redundant power supplies tape servers and disk servers have lower level hardware, like WNs Sun Blade v100 with 2 internal ide disks with software raid-1 running ACSLS 7.0 OS Solaris 9.0 ~40 disk servers attached to a SAN full redundancy FC 2Gb/s or 4Gb/s (latest...) connections (dual controller HW and Qlogic SANsurfer Path Failover SW or Vendor Specific Software) STK L5500 silos (5500 slots, partitioned wth 2 form-factor slots, about 2000 LTO2 for and B, 200GB cartridges, tot capacity ~1.1PB non compressed ) 6 LTO B drives, 2 Gbit/s FC interface, MB/s rate (some more 9940B going to be acquired in next months). SAN Brocade FC SAN 13 tape servers STK FlexLine 600, IBM FastT900 … 12-Nov-18 ACAT 2007 NIKHEF

Setup: core services 2 more machines for the Name Server and Stager DB CASTOR core services v on 4 machines (v also for clients). castor-6: rhserver, stager, rtcpclientd, MigHunter, cleaningDaemon Name server Oracle DB (Oracle 9.2) Stager Oracle DB (Oracle 10.2) castorlsf01: Master LSF, rmmaster, expert 2 SRMv1 endpoints, DNS balanced: srm://castorsrm.cr.cnaf.infn.it:8443 (used for “tape” svc classes) srm://sc.cr.cnaf.infn.it: (used for disk-only svc classes for cms and atlas) srm://srm-lhcb durable.sc.cr.cnaf.infn.it: (used as disk-only svc classes for lhcb) V2.1.1, disk server in TURL dlf01: dlfserver, Cmonit, Oracle for DLF castor-8: nsdaemon, vmgr, vdqm, msgd, cupvd 12-Nov-18 ACAT 2007 NIKHEF

Setup: disk servers ~40 disk servers about 5-6 fs per node, both XFS and EXT3 used, typical size TB LSF software distributed via NFS (exported by the LSF Master node) # LSF slots: from 30 to 450, modified many times.(lower or highter values only for test ) Many servers are used both for file transfers and for job reco/analysis => max slots limitation not very useful in such a case… 12-Nov-18 ACAT 2007 NIKHEF

Supported VOs - Svcclasses - diskpools (270TB net)
Exp Disk pool Gar. Coll. Disk (TB) Tape (TB) alice ALICE alice1 yes 22.0 35.0 cms CMS cms1 48.3 120.0 cmsdisk cms1disk no 51.8 - atlas ATLAS atlas1 22.5 140.0 atlasdisk atlas1disk 53.5 lhcb LHCb lhcb1 13.2 80.0 lhcbdisk lhcb1disk 33.1 dteam dteam1 0.1 3.0 lvd LVD archive1 1.6 2.0 argo ARGO argo1 8.3 60.0 argo_download argo2 2.2 virgo VIRGO 20.0 ams AMS ams1 2.7 10.0 pamela PAMELA pamela1 3.6 8.0 magic MAGIC babar BABAR 24.0 cdf CDF 15.0 12-Nov-18 ACAT 2007 NIKHEF

Setup: Monitoring e Alarms
We use Nagios+RRD for alarm notifications. 12-Nov-18 ACAT 2007 NIKHEF

Setup: Nagios typical parameters such as disk I/O, CPU, network, #connections,#procs, available disk space, raids status…. CASTOR specific parameters such as: tape and disk pool free space, daemons, LSF queue still missing checks on the stager db tables such as newrequests, subrequests … castorlsf01 castor-6 castorlsf01 12-Nov-18 ACAT 2007 NIKHEF

Setup Monitoring (Lemon)
Lemon is in production v as a Monitoring Tool Lemon is the CERN suggested monitoring tool, strong integration with Castor v.2 Oracle v.10.2 as database backend 12-Nov-18 ACAT 2007 NIKHEF

Lemon Monthly usage (production)
Castor diskservers (disk to disk transfers) Max MB/s Average MB/s Castor tapeservers (disk to/from tapes transfers) Max MB/s Average 41.7 MB/s 12-Nov-18 ACAT 2007 NIKHEF

Lemon: Monthly usage (production)
GPFS Max 205 MB/s Average 18.3 MB/s XRootD Max 47.6 MB/s Average 6.28 MB/s 12-Nov-18 ACAT 2007 NIKHEF

SAN Fabric Implementation
Why a Storage Area Network? As Tier1 we need to grant a 7/24 service to the users (LHC and not-LHC) A good SAN hardware installation could be a "real" No Single Point of Failure System NSPF (if software supports it!) So failures to the storage infrastructure components, or planned events (like firmware upgrade) can be done without stopping any service Also SAN gives the best flexibility: we can dinamically vary the diskserver/disk-storage assignament (adding diskservers, or changing ratios...) we can use as diskpool clustered filesystem like GPFS 12-Nov-18 ACAT 2007 NIKHEF

SAN Fabric Implementation
SINGLE SAN (980TB RAW) Hardware based on: Brocade Switches: SAN as one single Fabric managed with a single management web tool and Fabric Manager Software for failures and performace monitoring Qlogic Qla2340 (2Gb/s) 2462 (4Gb/s) HBA: HA failover implemented with SANsurfer configuration and Vendor Specific Tool (EMC PowerPath) 1st Director FULL RENDUNDANT with 128 2Gb/s ports The 2005 tender price was 1 KEuro/port. 2 x 2Gb/s trunked uplink 2 x 2Gb/s trunked uplink Silkworm 3900 with 32 2Gb/s ports Silkworm 3900 with 32 2Gb/s ports 2 x 2Gb/s trunked uplink 2nd Director FULL RENDUNDANT with 128 (out of 256) 4Gb/s ports The 2006 tender price was 0.5 KEuro/port. DISK STORAGE: 4 x IBM FastT900 DS 4500 (4 x 2Gb/s output for each box) 200TB => 20 diskservers with single HBA 4 x Flexline 600 (4 x 2Gb/s each) 290TB => 20 diskservers with double HBAs 3 x CX-380 (8 x 4Gb/s each) 400TB => 36 diskservers with double HBA 1 x SUN STK bladestore (4 x 2 Gb/s) => 5 diskservers with single HBA 9 x Infortrend A16F-R1211-M2 (2 x 2Gb/s each)+ JBOD 56TB => 9 primary diskservers with single HBA About 6-12 TB RAW accessed by one diskserver, depending on the fs/protocol could be enough. Fibre Channel Physical connections, failover and zoning are configured in the simplest way, traffic from diskservers remain in the local switch in most cases so uplink usage is minimized. 12-Nov-18 ACAT 2007 NIKHEF

DISK access typical case (NSPF)
LAN 12 Diskserver Dell 1950 Dual Core Biprocessors 2 x 1.6Ghz 4MB L2 Cache, 4 GByte RAM, 1066 MHz FSB SL 3.0 or 4.0 OS, Hardware Raid1 on system disks and redundant power supply 4 TB Raid Group (8+1) 2TB Logical Disk LUN0 LUN1 ... Gb Ethernet Connections LUN0 => /dev/sda LUN1 => /dev/sdb ... 2 x 4Gb Qlogic 2460 FC redundand connections every Diskserver SAN ZONING: Each diskserver => 4 paths to the storage EMC PowerPath for Load-Balancing and Failover (or Qlogic SANSurfer if problem with SL will arise) Example of Application High Avaliability: GPFS with configuration Network Shared Disk /dev/sda Primary Diskserver 1; Secondary Diskserver2 /dev/sdb Primary Diskserver 2; Secondary Diskserver3 ..... 4Gb FC connections 2 Storage Processor (A e B) 110TB EMC CX380 Dual redundant Controllers (Storage Processors A,B) 4 Ouput for each SP (1,2,3,4) A1 A2 A3 A4 B1 B2 B3 B4 12-Nov-18 ACAT 2007 NIKHEF

Web Admin Tool (from Browser) Web Admin Tool (zoning)
SAN Monitoring Tool Web Admin Tool (from Browser) Web Admin Tool (zoning) 12-Nov-18 ACAT 2007 NIKHEF

SAN Monitoring Tool 12-Nov-18
Fabric Manager Software (Software installed in a dedicated machine) Fabric Manager Software (performance monitoring showing Powerpath Load Balancing) 12-Nov-18 ACAT 2007 NIKHEF

SAN Disk Distribution Storage used in production: 270TB net: Castor v.2 staging area to tape backend or disk-only pools (see above) 140TB net: Xroot (Babar) 130TB net: GPFS v 230TB net : Still Unassigned (used for test in these weeks NFS used mainly for accessing experiment software (TB) - strongly discouraged for data access (Virgo) and currently under migration (to castor v.2 and GPFS) 12-Nov-18 ACAT 2007 NIKHEF

GPFS implementation The idea of GPFS is to provide a fast and reliable (NSPF) pure diskpool storage with direct access (posix file protocol) from the Worker Nodes Farm SRM interface (Storm One single "big filesystem" for each VO could be possible (strongly preferred by users) Further Step: Creation of a single (or future multiple) cluster with a total of all the Worker Nodes ( ) and the ~40 NSD GPFS diskservers. Before we had only the front-ends (i.e. storage element) accessing the GPFS cluster and the WNs used to copy the data locally. Now all the WNs could access the GPFS system directly Test of the whole system using part of the storage hardware infrastructure (24 NSD dedicated diskservers SL bit and 2 CX-380 EMC Storage Arrays tot. 230TB) locally and remotely with dedicated Farm queues (280 WN distributed in ~8 racks for a total of 1100 LSF queues slot ) 12-Nov-18 ACAT 2007 NIKHEF

Test Layout 12-Nov-18 ACAT 2007 NIKHEF

Test Phase (local) Local Access using the 24 diskservers (actually 23) Xfs locally mounted filesystem GPFS cluster. 1 filesystem in one single "200 TB filesystem" Test using linux command "dd" from memory (/dev/zero) Block Size of 1024k and 12 Gbyte every thread (dd process in background) equally distributed over the diskservers. 12-Nov-18 ACAT 2007 NIKHEF

Test Results (local) 12-Nov-18 ACAT 2007 NIKHEF

In general GPFS works better when reading.
GPFS works using parallel I/O so the maximum bandwidth (plateou) is reached with a very limited number of thread (1 dd process for each diskserver is enought). In general GPFS works better when reading. When writing all the diskservers must comunicate to each other to mantain sync. This generate a "background" traffic that could limit write throughtput. Anyway the controller array limit is still reached in our site (the disk is the bottleneck) 12-Nov-18 ACAT 2007 NIKHEF

Test Phase (remote) Remote Access using dedicated Farm nodes (dedicated Slot in LSF Batch System) Castor (rfio over Xfs filesystems) diskpool only (no tape backend) GPFS cluster. 1 filesystem in one single "200 TB filesystem" Test using C coded "dd" commands as Farm Jobs (5GByte files, bs=64k x ~1000 Jobs) We were interested in reliability of GPFS and overall performance comparison between our two disk storage pools production systems 12-Nov-18 ACAT 2007 NIKHEF

Test Results (remote) Castor remote read Castor remote write Some jobs (10%) failures were detected when reading due probably mainly to Castor2 queue timeouts. Writing the efficency was higher (98%). The Aggregate Network statistics reported on the uplink connections show a identical trend + a ~10% overhead 12-Nov-18 ACAT 2007 NIKHEF

Test Results (remote) GPFS remote write GPFS remote read Overall efficiency 98% (2% jobs failed due to multiple reasons, WN problems, .exe nfs area down etc...) The Aggregate Network statistics reported on the uplink connections show a identical trend + a ~10% overhead 12-Nov-18 ACAT 2007 NIKHEF

Conclusion The GPFS system cluster is working fine in a single "big cluster" implementation (all WNs in cluster) Test shows that the theorical hardware bandwidth (limit from the controllers of the disk array) could be saturated locally and remotely with the GPFS cluster Remote test comparisons with Castor v.2 show that Jobs writing and reading from GPFS are the fastest (1200MByte/s vs 850 MB/s Writing and 1500MB/s vs 1300MB/s Reading). This that could prove very useful in some critical I/O access activities (i.e. critical data transfers or analysis jobs) Also reliability is improved since a GPFS cluster is very close to a NSPF system (while failures in a Castor diskserver node will put the corresponding filesystems offline, so part of the diskpool will be unaccessible) GPFS administration is also simpler compared to Castor (no Oracle, no LSF, intuitive admin commands etc...) 12-Nov-18 ACAT 2007 NIKHEF

Abstract Title: Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF Abstract: This paper is a report from the INFN Tier1 (CNAF) about the storage solutions we have implemented over the last few years of activity. In particular we describe the current Castor v.2 installation at our site, the HSM (Hierarchical Storage Manager) software chosen as (low cost) tape storage archiving solution. Beside Castor, we also have in production a large GPFS cluster relying on a Storage Area Network (SAN) infrastructure to obtain a fast and disk-only solution for the users. In this paper, summarizing our experience with these two storage system solutions, we focus on the management and monitoring tools implemented and on the technical solutions needed to improve reliability on the whole system. 12-Nov-18 ACAT 2007 NIKHEF

ACAT 2007 April Nikhef Amsterdam

Similar presentations

Presentation on theme: "ACAT 2007 April Nikhef Amsterdam"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ACAT 2007 April Nikhef Amsterdam

Similar presentations

Presentation on theme: "ACAT 2007 April Nikhef Amsterdam"— Presentation transcript:

Similar presentations

About project

Feedback