Presentation is loading. Please wait.

Presentation is loading. Please wait.

Status and plans Dejan Vitlacil on behalf of Tier1 Storage June '08 Instituto Nazionale di Fisica Nucleare Centro Nazionale per.

Similar presentations


Presentation on theme: "Status and plans Dejan Vitlacil on behalf of Tier1 Storage June '08 Instituto Nazionale di Fisica Nucleare Centro Nazionale per."— Presentation transcript:

1 CASTOR@CNAF CASTOR2@CNAF Status and plans Dejan Vitlacil on behalf of Tier1 Storage June '08 Instituto Nazionale di Fisica Nucleare Centro Nazionale per la Ricerca e Sviluppo nelle Tecnologie Informatiche e Telematiche

2 Overview Storage & CNAF Setup History Future SRM Top 5

3 Storage classes Implementation of 3 (2) Storage Classes needed for LHC Disk0Tape1 (D0T1)  CASTOR Disk1Tape0 (D1T0)  GPFS/StoRM Disk1Tape1 (D1T1)  GPFS/StoRM Storage & CNAF

4 CASTOR2 Production Instance  4+1 core servers (stager, lsf, dlf, name-server) all SLC3‏ 32-bit - 1 SLC4 tmp  Name server, Stager Oracle RAC (SRMv.2.2 DB & Repack, Single Instance)  1 server for ACSLS (library management)‏  1 library: STK L5500 (1 PB on-line)‏  15 tape servers (10 9940b drives, 6 LTO2 drives)‏  6 servers srm v. 1.1  6 servers srm v. 2.2  40 disk-servers all XFS (nightly defragmentation) SLC4 32 and 64-bit  2 xrood servers (peer and manager)‏ Tools, Monitoring and Alarms  QUATTOR configuration and installation  LEMON for fabric monitoring (want to try LAS)‏  House made monitoring tools for some Castor parts  NAGIOS for proactive alarms E-mail notification (SMS in progress)‏ Exclusion of problematic servers from service (via dns alias)‏ Setup

5 40 disk servers attached to a SAN full redundancy FC 2Gb/s or 4Gb/s (latest...)‏ connections (dual controller HW and Qlogic SANsurfer Path Failover SW or Vendor Specific Software)‏ CASTOR Setup (1)‏ STK FlexLine 600, IBM FastT900, EMC Clarion … Core services are on machines with scsi disks, hardware raid1, redundant power supplies tape servers and disk servers have lower level hardware, like WNs 15 tape servers Sun Blade v100 with 2 internal ide disks with software raid-1 running ACSLS 7.0 OS Solaris 9.0 STK L5500 silos (5500 slots, partitioned wth 2 form- factor slots, about 2000 LTO2 for and 3500 9940B, 200GB cartridges, tot capacity ~1.1PB non compressed )‏ 6 LTO2 + 10 9940B drives, 2 Gbit/s FC interface, 20-30 MB/s rate. SAN Setup

6 CASTOR Setup (2)‏ castor-6: rhserver, stager, rtcpclientd, cleaningDaemon, MigHunter, RecHunter, expert castorlsf01: Master LSF, rmmaster dlf01: dlfserver, Cmonit, Oracle for DLF castor-8: nsdaemon, vmgr, vdqm, msgd, cupvd CASTOR core services v2.1.6-12 on 5 machines Name server Oracle DB (Oracle RAC)‏ Stager Oracle DB (Oracle RAC)‏ castor-4: mighunter SLC4 1 SRMv1 endpoints, DNS balanced: srm://castorsrm.cr.cnaf.infn.it:8443 (used for disk pools with tape backend)‏ HOPE TO BE DISSMISSED SOON‏ ! 2 SRMv22 endpoint: srm://srm-v2.cr.cnaf.infn.it:8443 srm://srm-cms-v2.cr.cnaf.infn.it:8443 SRM v2.2. Oracle DB (Single Instance)‏ Repack Oracle DB (Single Instance)‏ Setup

7 Castor setup (3)‏ Setup Probably will change as soon as new storage arrive 1.6 PB ( not all for Castor )‏

8 CASTOR blues Upgrade from v. 2.1.1-9 to v. 2.1.3-15. (July 9-11 '07)‏  New LSF plugin – load on Oracle db lowered Previously frequent rmmaster meltdowns with # PEND jobs > 1000 Increased stability of scheduler (new job limit 30000)‏  Good performances Upgrade to v. 2.1.3-24. (September 19-20 '07)‏  D1T0 pools management still problematic‏ Not properly managed disks with no space left (requests still queued)‏ Nearly impossible for the user to delete files from disk Situation between Stager, NameServer and Diskservers is not synchronized  tape recall not yet enough efficient Still needs for visible workarounds  Manual load distribution on disk servers.  Not negligible administrative effort for db clean-up  Repack not yet supported ‏ History

9 CASTOR blues (2)‏ Upgrade to v. 2.1.4-10-2 (November 14-15 '07)  Hotfix requested after installation due to FTS (dCache SRM)‏  Still problems with Repack but with dev. team support, it was usable  BugCleaning script + other workaround scripts used heavily Upgrade to v. 2.1.6-12 (May 05-06 '07)‏  Much more stable instance  Much more usable Repack  SLC4 was necessary for migration and streaming policy  Great step -> integration of workarounds ( now is turn of Op. Team scripts :-) )‏ Great appreciation for your work  CCRC went well and Castor did his part very well at CNAF  Still some experiment specific tuning is required (perfection needed) History

10 Our next Castor steps Get rid of SRM v1.1 as soon as possible Move Core Services on new SLC4 machines New library (SUN is on site from 13.06.2008)‏  8 tape drives (T10000; 500GB)‏  8 tape servers (probably we'll try 64-bit SLC4)‏  7000 slots (expandable to 10 000)‏  Oher 12 drives (1TB) in november Try to integrate Castor Lemon Sensor Try Lemon LAS Develop some more tools to keep Castor under control New Storage – 1.6 PB - should arrive soon (end of July)‏ Future

11 SRM at CNAF We'd like to quit using srmv1.1 by the end of June (very difficult but likely)‏ We are still running 1.3-21 Some problems:  DB deadlocks  BE crashes  Great help from Shaun “GodBlessHim” DeWitt We'd like to implement a solution with multiple BE as soon as possible SRM

12 Top 5 ( or what bother us) Stream and Migration Control  total control – we want to pilot and control everything GC Policy  It seems like a step backwards (sysAdmin lose control)‏  Very expensive for the system Castor monitoring tool (web-based)‏  Maybe with common effort  Lemon sensor for Castor is CC Castor Workarounds Control  We all need to run the “same” Castor Clean the not necessary  nsenterclass, svcClass,.. Top 5

13

14 Castor - ieri oggi e domani Castor è stato aggiornato alla versione 2.1.6-12 (primi di aprile)‏ Impostato ed installato il nuovo repack che ha mostrato maggiore stabilità della versione precedente ma ancora genera errori e richiede “baby-sitting” (metà di aprile)‏ Afrontato il tema delle Migration policies e durante i test scoperto il baco che ne limita l'utilizzo solamente al SLC4 (i nostri core services sono con SLC3); per superarlo, installata una macchina a parte SLC4 e spostati i servizi neccesari da far funzionare Migration Policies. (fine aprile e primi di maggio) Effetuata la prima configurazione delle Mig. Poli. Con CMS; create 8 tapepools (tape families); la posibilità di configurazione è molto limitata e richiede intervento lato amministratore; problema sollevato anche dal RAL; (primi di maggio e oggi)‏ Dopo CCRC08 si farà l'upgrade alla versione 2.1.7- e tutte le macchine di core services verranno spostate su SLC4;


Download ppt "Status and plans Dejan Vitlacil on behalf of Tier1 Storage June '08 Instituto Nazionale di Fisica Nucleare Centro Nazionale per."

Similar presentations


Ads by Google