Olof Bärring (IT/FIO/FS) 13 LHCb Migration Production: smooth, mostly done already in March Users: some difficulties Dependency on old ROOT3 data delayed migration Flip of STAGE_HOST not sufficient: gridjobs have no CERN specific env most WN access to stagepublic (castor1) Usage RFIO and ROOT access Challenges Lots of tape writing at CERN in early summer Data export in July - August Special requirements ‘Durable’ disk pools Special SRM endpoint: srm-durable-lhcb.cern.ch Pools default: 28TB wan: 51TB lhcbdata: 5TB no GC lhcblog: 5TB no GC
Olof Bärring (IT/FIO/FS) 14 Plans for non-LHC migration Plan is to migrate all non-LHC groups to a shared CASTOR2 instance: castorpublic Dedicated pools for large groups Small groups will share ‘default’ Also used for the dteam background transfers and by the ‘repack’ service NA48 first out: plan is to switch off stagena48 end of January 2007 COMPASS Complications Engineering community may require windows client How to migrate small groups without computing coordinators?
Olof Bärring (IT/FIO/FS) 15 Main problems and workarounds prepareForMigration: Deadlocks resulted in CASTOR NS not updated file remains 0 size while tape segment>0 Tedious cleanup GC Long period of instabilities during the summer. Now OK since 2.1.0-6 Stager_qry: Now you see your file, now you don’t… users confused Used by operational procedure for draining disk server: manual and tedious workaround for INVALID status bug LSF plugin related problems Meltdown Limit PENDing jobs to 1000 workaround But may result in a rmmaster meltdown instead Problematic with ‘durable’ pools which are not properly managed Recent problem with lsb_postjobmsg ‘Bad file descriptor’. Plugin cannot recover, workaround is to restart LSF Missing cleanups Accumulation of stageRm subrequests, diskcopies in FAILED, … Looping migrators NBTAPECOPIESINFS inconsistency. Workaround in hotfix of early September reduced the impact on the tape mounting but manual cleanup is still required Looping recallers Due to zero-size files (see above) Due to a stageRm bug (insufficient cleanup) Client/server (in)compatibility matrix Request mixing…!
Olof Bärring (IT/FIO/FS) 16 Tape service (TSI section) Both T10K and 3592 used in production during 2006 No preference buy both Current drive park: 40 SUN T10K 40 IBM 3592 6 LTO3 44 9940B Current robot park 1 SUN SL8500 6 SUN powderhorns (recently dismounted 6) 1 IBM 3485 Buying for next year 10 more drives of each T10K and 3592 50 of each in total 1 more SUN SL8500 Enough media to fill the new robotics ~18k pieces of media: 12k T10K, 6k 3592 (700GB)
Olof Bärring (IT/FIO/FS) 17 Tape / Robots IBM 3584 Tape Library Monolithic Solution - 40 x 3592E05 IBM Tape Drives - ~6000 Tape Slots - 2 Accessors - ~38 m 2 of Floor Space SUN/STK SL8500 Tape Library Modular Solution - 40 x SUN T10K Tape Drives - 21 x LTO-3 Tape Drives - 10 x 9940B Tape Drives - ~8000 Tape Slots - 2 x 4 Handbots - Pass-Through Mechanism - ~19 m 2 of Floor Space SUN/STK SL8500 IBM 3584
Olof Bärring (IT/FIO/FS) 18 Repack of 22k 9940B tapes Leave 4 powderhorn silos for 9940B tapes to be repacked to new media Some tapes have a huge number of small files Record: 165k files on a single 9940B tape (200GB). Will take ~1month to repack that tape alone…
Olof Bärring (IT/FIO/FS) 19 SRM service SRM v11 Shared facility accessed through a single endpoint: srm.cern.ch 9 CPU servers, DNS loadbalanced 1 CPU server used for the request repository Some dirty workaround for ‘durable’ space required setting up some extra endpoints (srm-durable-xyz.cern.ch) All transfers initiated through srm.cern.ch (== castorsrm.cern.ch) are redirected to the disk servers. The old castorgrid gateway only used by non-LHC for non-SRM access (e.g. NA48 and compass) All CASTOR2 diskservers are on LCG network (also visible to the Tier-2 sites through the HTAR) SRM v22 Test facility up and running (srm-v2.cern.ch) No need for additional endpoints for ‘durable’ storage: durable space is addressed through SRM spacetokens
Olof Bärring (IT/FIO/FS) 20 Conclusions 4 LHC experiments successfully migrated to CASTOR2 All major SC4 milestones completed successfully Non-LHC migration has ‘started’ New tape equipment running in production without any major problem Our next challenges Dare to remove dirty workarounds when bugs get fixed SRM v22 operation and support Repack 22k 9940B tapes to new media