Storage at TIER1 CNAF Workshop Storage INFN CNAF 20/21 Marzo 2006 Bologna Ricci Pier Paolo, on behalf of INFN TIER1 Staff

Storage at TIER1 CNAF Workshop Storage INFN CNAF 20/21 Marzo 2006 Bologna Ricci Pier Paolo, on behalf of INFN TIER1 Staff pierpaolo.ricci@cnaf.infn.it

Contents Disk SAN hardware/software status and summary Castor and castor2 hardware/software status and summary Tools for monitoring and accounting

Hardware Status Disk: FC, IDE, SCSI, NAS technologies 470 TB raw (~ 450 FC-SATA) 2005 tender: 200 TB raw (~ 2260 Euro/TB net + VAT) Additional 20% of last tender acquisition requested Tender for 400 TB (not before Fall 2006) Tape libraries: STK L180 18 TB (only used for backups) STK 5500 6 LTO-2 drives with 1200 tapes  240 TB 4 9940B drives (+ 3 to be installed next weeks) with 680 + 650 tapes  130 TB (260) (1.5 KEuro/TB initial cost  0.35 KEuro/TB pure tape cost)

TIER1 INFN CNAF Storage Linux SL 3.0 clients (100-1000 nodes) WAN or TIER1 LAN STK180 with 100 LTO-1 (10Tbyte Native) STK L5500 robot (5500 slots) 6 IBM LTO-2, 4 STK 9940B drives PROCOM 3600 FC NAS2 7000Gbyte PROCOM 3600 FC NAS3 7000 Gbyte NAS1,NAS4 3ware IDE SAS 1800+3200 Gbyte AXUS BROWIE About 2200 GByte 2 FC interface 2 Gadzoox Slingshot 4218 18 port FC Switch STK BladeStore About 25000 GByte 4 FC interfaces Infortrend 4 x 3200 GByte SATA A16F-R1A2-M1 NFS-RFIO-GridFTP oth... W2003 Server with LEGATO Networker (Backup) CASTOR HSM servers H.A. Diskservers with Qlogic FC HBA 2340 IBM FastT900 (DS 4500) 3/4 x 50000 GByte 4 FC interfaces Brocade FABRIC 2 Silkworm 3900 32 port FC 1 Director 24000 64 port FC Infortrend 5 x 6400 GByte SATA A16F-R1211-M2 + JBOD SAN 2 (40TB) SAN 1 (400TB RAW) HSM (400 TB) NAS (20TB) NFS RFIO STK FlexLine 600 4 x 50000 GByte 4 FC interfaces

STK Flexline 600 in production The GPFS, dCache and CASTOR2 tests use 2 of the 4 Flexline 600 (the 1st and 2nd) with a total of 100TB RAW The other 2 are in production (CDF,BABAR): Unsupported IBM GPFS in agreement with CDF Xrootd over xfs filesystem for BABAR We have requested to the INFN the 6°/5° upgrade of the tender (48TB RAW and a fifth controller that could be expanded if needed). Investigation of a major failure (supposed loss of a 4TB RAID5) is under way. 16 Diskservers with dual Qlogic FC HBA 2340 Sun Fire U20Z dual Opteron 2.6GHZ DDR 400MHz 4 x 1GB RAM SCSIU320 2 x 73 10K RAID1 Brocade Director FC Switch (full licenced) with 64 port (out of 128) in 4 x 16 ports "blades" 4 Flexline 600 with 200TB RAW (150TB) RAID5 8+1 4 x 2GB redundand connections to the Switch

DISK access A1A2B1B2 Generic Diskserver 1U. 2 Qlogic 2300 HBA Linux CERN SL 3.0 OS WAN or TIER1 LAN FC SAN ZONED (50TB Unit with 4 Diskservers) Single server connected in different switches or 24000 blades 2 2Gb FC connections every Diskserver 50 TB IBM FastT 900 (DS 4500) Dual redundant Controllers (A,B) Internal MiniHub (1,2) 2Gb FC connections FC Path Failover HA: Qlogic SANsurfer 4 Diskservers every ~50TB 1 Controller can perform a maximum of 120/200MByte/s R-W F1F2 FARMS of rack mountable 1U biprocessors nodes (actually about 1000 nodes for 1300 KspecInt2000) Application HA: NFS server, rfio server with Red Hat Cluster AS 3.0(*) GPFS with configuration NSD Primary Secondary /dev/sdaPrimary Diskserver 1; Secondary Diskserver2 /dev/sdbPrimary Diskserver 2; Secondary Diskserver3..... (*) tested but not actually used in production GB Eth. connections: nfs,rfio,xrootd,GPFS, GRID ftp 1234 2TB Logical Disk LUN0 LUN1... LUN0 => /dev/sda LUN1 => /dev/sdb... RAID5 SAN

TIER1 CNAF SAN Disk Storage CONSOLIDATION OF PRIMARY SAN (400TB RAW) CONCLUDED Hardware based on: Brocade Switches: SAN as one single Fabric managed with a single management web tool and Fabric Manager Software for failures and performace monitoring Qlogic Qla2340 HBA: HA failover implemented with SANsurfer configuration Director with 64 2Gb/s ports (out of 128) The tender price was 1.1 KEuro/port. The price/port ratio of brocade lower-class switches would be at least 50% lower. Silkworm 3900 with 32 2Gb/s ports (currently not on the market) 2 x 2Gb/s trunked uplink 2 x 2Gb/s trunked uplink DISK STORAGE: 4 x IBM FastT900 DS 4500 (4 x 2Gb/s output for each box) 170TB => 14 primary diskservers with single HBA 4 x Flexline 600 (4 x 2Gb/s) 200TB => 16 primary diskservers with double HBAs 5 x Infortrend A16F-R1211-M2 (2 x 2Gb/s)+ JBOD 30TB => 5 primary diskservers with single HBA About 6-12 TB RAW accessed by one diskserver, depending on the fs/protocol could be enough. Other diskservers (4-8) access the SAN storage for specific uses (grid SE, Oracle RAC etc…) Fibre Channel Physical connections, failover and zoning are configured in the simplest way, traffic from diskservers remain in the local switch in most cases so uplink usage is minimized.

SAN Monitoring & Web Tools Fabric Manager Software Web Tool Management (single SAN)

SAN Expansion Silkworm 3900 with 32 2Gb/s ports 2 x 2Gb/s trunked uplink 2 x 2Gb/s trunked uplink Scenario 1) Peripheral switches, based on the Brocade Silkworm 4100 family (lower class than Director). Lower price/port compared to Fabric switches (24000 or 48000), they are fully compatible but not fully rendundand Scenario 2) Another idea could be including a new Fabric Director SilkWorm 48000 (with an expansion capability of 256 4Gb/s ports) in one of the next tenders and provide a DUAL FABRIC SAN (logically/physically divided SAN) for the best rendundancy. Then in the next years the SAN will be expanded filling up the 2 Director and using low-cost periferical switches around these two CENTRALl CORE DIRECTORS. The price/port in the Directors could increase of a factor of 2-3 compared to the Silkworm 4100 Family.

Disk storage summary Main storage (IBM Fast-900, STK FLX680) organized in one Fabric Storage Area Network (3 Brocade switches, star topology) Level-1 disk servers connected via FC Usually in GPFS cluster Easiness of administration Load balancing and redundancy Some level-2 disk servers connected to storage only via GPFS (over IP) LCG and FC dependencies on OS decoupled WNs are not members of GPFS cluster (but scalability on large number of WNs currently under investigation) Supported protocols: rfio, gridftp, xrootd (BaBar), NFS,AFS NFS used mainly for accessing experiment software - strongly discouraged for data access AFS used only by CDF for accessing experiment software We had good experience in HA for diskserver services (RedHat cluster 2.1 and 3.0) but hardware compatibility problem (for fencing nodes). We are planning to upgrade and test the latest release of 3.0 and evaluate 4.0.

Castor HSM Status At present our CASTOR (1.7.1.5) system is: 1 STK L5500 SILOS partitioned with 2 form-factor slots About 2000 slots LTO-2 form About 3500 slots 9940B form 6 LTO-2 DRIVES with 2Gb/s FC interface 4 9940B DRIVES with 2Gb/s FC interface 3 more in installation for Service Challenge requirement Sun Blade v100 with 2 internal ide disks with software raid-0 running ACSLS 7.0 1300 LTO-2 TAPES 240 TB 650 +700 = 1350 9940B TAPES 250TB THE SILOS COULD NOT SUPPORT THE NEXT GENERATION DRIVE T1000 (500GB)

Castor Status (2) 10 Tapeservers, 1U Supermicro 3 GHz 2GB with 1 Qlogic 2300 F.C. HBA, STK CSC Development Toolkit provided by CERN (with licence agreement with STK) ssi,tpdaemon and rtcpd. The 8 tapeservers are direct connected direcly with the FC drive output: DRIVE LTO-2 0,0,10,0->tapesrv-0.cnaf.infn.it DRIVE LTO-2 0,0,10,1->tapesrv-1.cnaf.infn.it DRIVE LTO-2 0,0,10,2->tapesrv-2.cnaf.infn.it DRIVE LTO-2 0,0,10,3->tapesrv-3.cnaf.infn.it DRIVE LTO-2 0,0,10,4->tapesrv-4.cnaf.infn.it DRIVE LTO-2 0,0,10,5->tapesrv-5.cnaf.infn.it DRIVE 9940B 0,0,10,6->tapesrv-6.cnaf.infn.it DRIVE 9940B 0,0,10,7->tapesrv-7.cnaf.infn.it DRIVE 9940B 0,0,10,8->tapesrv-7.cnaf.infn.it DRIVE 9940B 0,0,10,9->tapesrv-7.cnaf.infn.it

Castor Status (3) castor.cnaf.infn.it Central Machine 1 IBM x345 2U machine 2x3GHz Intel Xeon, raid1 with double power supply O.S. Red Hat A.S. 3.0 Machine running all central CASTOR 1.7.1.5 services (Nsdaemon, vmgrdaemon, Cupvdaemon, vdqmdaemon, msgdaemon) and the ORACLE client for the central database castor-4.cnaf.infn.it ORACLE Machine 1 IBM x345 O.S. Red Hat A.S. 3.0 Machine running ORACLE DATABASE 9.i rel 2 1 more x345 machines are in standby and are used for storing all the backup information of the ORACLE db (.exp.dbf) and can be used for replacing the above machines if needed... HA on central service or oracle machine not yet implemented (not needed so far...) castor-1.cnaf.infn.it Monitoring Machine 1 DELL 1650 R.H 7.2 Machine running monitoring CASTOR service (Cmon daemon) NAGIOS central service for monitoring and notification. Also contains the command rtstat e tpstat that are usually runned with the –S option over the tapeserver

Castor Status (4) Stagers with diskserver: 1U Supermicro 3 GHz 2GB with 1 Qlogic 2300 F.C. HBA accessing our SAN and runnig Cdbdaemon, stgdaemon end rfiod.1 STAGER for each LHC Experiment and 2 GENERIC STAGERS disksrv-1.cnaf.infn.it ATLAS stager with 2TB direct connected disksrv-2.cnaf.infn.it CMS stager with 3.2TB direct connected disksrv-3.cnaf.infn.it LHCB stager with 3.2TB direct connected disksrv-4.cnaf.infn.it ALICE stager with 3.2TB direct connected disksrv-5.cnaf.infn.it TEST,PAMELA,ARGO stager disksrv-6.cnaf.infn.it stager with 2TB locally (archive purpose LVD,alice TOF,CDF,VIRGO,AMS,BABAR, and other HEP experiment...) Diskservers: 1U Supermicro 3 GHz 2GB with 1 Qlogic 2300 F.C. HBA accessing our SAN and runnig rfiod.

Castor Status CASTOR-2 has been fully installed and is currently under test CASTOR-1 (production) and CASTOR-2 (test) can share the same resources and are currently “living together” in our software implementation SC will run on the CASTOR-2 environment but production activities will remain on the CASTOR-1 services After SC CASTOR-2 will be used in production and the CASTOR-1 services will be dismissed LTO-2 technology drives not usable in a real production environment with present CASTOR release. Used only for archiving copies of disk data or with a big staging (disk buffer) area which set almost at zero the tape access. In 1.5 year of activity we report that USING THE 9940B have drastically reduced the error rate (we report only 1-3% 9940 tape marked RDONLY due to SCSI error) and negligible hang problem HW problems solved using 9940B technology drives, 3 more will be installed before next SC phase (total 7 9940B drives and 6 LTO-2) CASTOR development team at CERN is currently suffering a critical lack of menpower. Support is granted ONLY for TIER1s. TIER2 won’t be considered.

Castor Status STK L5500 2000+3500 6 drives LTO2 (20-30 MB/s) 4 drives 9940B (25-30 MB/s) 1300 LTO2 (200 GB native) 1350 9940B (200 GB native) TOTAL CAPACITY with 200GB 250 TB LTO-2 (400TB) 260 TB 9940B (700TB) Sun Blade v100 with 2 internal ide disks with software raid-1 running ACSLS 7.0 OS Solaris 9.0 1 CASTOR (CERN)Central Services server RH AS3.0 10 tapeserver Linux RH AS3.0 HBA Qlogic 2300 6 stager with diskserver 15 TB Local staging area EXPERIMENTStaging area (TB) Tape pool (TB) % RDONLY ALICE912(LTO-2)8% ATLAS2037(9940) 8 (LTO-2) 2% 30% CMS1222(9940)0% LHCb1843(LTO-2)10% BABAR (copy)820(LTO-2)2% CDF (copy)29(LTO-2)5% AMS35(9940)0 ARGO+oth28(9940)1% Point to Point FC 2Gb/s connections 1 ORACLE 9i rel 2 DB server RH AS 3.0 Rfio diskservers RH 3.0 staging area (variable) SAN 1 WAN or TIER1 LAN SAN 2 Indicates Full rendundancy FC 2Gb/s connections (dual controller HW and Qlogic SANsurfer Path Failover SW)

Castor Status Storage Element front-end for CASTOR castorgrid.cr.cnaf.infn.it (DNS alias load balaced over 4 machines for WAN gridftp ) sc.cr.cnaf.infn.it (DNS alias load balaced over 8 machines for SC WAN gridftp with dedicated link) SRM1 is installed and in production in the above machines. CASTOR2 stager installation (NOT YET IN PRODUCTION) castor-6 (HW HA) STAGER + Request Handler + MigHunter + rtcpclientd oracle01 (HW HA) DB STAGER castorlsf01 (HW HA) MASTER LSF diskserv-san-13 DLF + DB DLF + RMMASTER + EXPERT

CASTOR Grid SE GridFTP access through the castorgrid SE, a dns cname pointing to 4 server. Dns round-robin for load balancing During LCG Service Challenge2 introduced also a load average selection: every M minutes the ip of the most loaded server is replaced in the cname (see graph) This method worked well, still used in production and in the next SC phases

LHCb CASTOR tape pool # processes on a CMS disk SE eth0 traffic through a CASTOR LCG SE Monitoring (Nagios) Also other parameters like overall I/O performance, status of the RAID systems, space occupation on disks is constantly monitored

Disk Accounting Pure disk space (TB)CASTOR disk space (TB)

Storage at TIER1 CNAF Workshop Storage INFN CNAF 20/21 Marzo 2006 Bologna Ricci Pier Paolo, on behalf of INFN TIER1 Staff

Similar presentations

Presentation on theme: "Storage at TIER1 CNAF Workshop Storage INFN CNAF 20/21 Marzo 2006 Bologna Ricci Pier Paolo, on behalf of INFN TIER1 Staff"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Storage at TIER1 CNAF Workshop Storage INFN CNAF 20/21 Marzo 2006 Bologna Ricci Pier Paolo, on behalf of INFN TIER1 Staff

Similar presentations

Presentation on theme: "Storage at TIER1 CNAF Workshop Storage INFN CNAF 20/21 Marzo 2006 Bologna Ricci Pier Paolo, on behalf of INFN TIER1 Staff"— Presentation transcript:

Similar presentations

About project

Feedback