Presentation is loading. Please wait.

Presentation is loading. Please wait.

CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010.

Similar presentations


Presentation on theme: "CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010."— Presentation transcript:

1 CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010

2 Agenda Testing –What we Planned, what we did and what the VOs are doing Results Issues Rollout Plan The Future

3 Planned Testing Original Plan –Test database Upgrade Procedure –Functional Test 2.1.7/8/9 –Stress Test 2.1.7/8/9 10K reads (1 file in, multiple reads) (rfio+gridFTP) 10K writes (multiple files in)(rfio+gsiftp) 10K d-2-d (1 file in, multiple reads) (rfio) 20K read/write (rfio+gridFTP), 10K mixed tests 10K stager_qry (database test) 5 file sizes (100MB-2GB)

4 Required Changes Move to ‘local’ nameserver –Required to allow rolling updates Nameserver schema can not be upgraded until all instances are at 2.1.9 Move from SLC4 to SL4 –Support for SLC4 end this year SL4 supported until 2012 Change of diskservers part way through testing

5 Actual Testing StagerLocal NameserverCentral Nameserver Tests 2.1.7-27-2.1.8-3FT/ST 2.1.7-27 2.1.8-3FT 2.1.7-27 2.1.8-18FT 2.1.8-18 FT/ST† 2.1.8-18 2.1.9-6*FT 2.1.9-6 2.1.9-6*FT 2.1.9-6- FT/ST (*) Indicates a schema only upgrade; the rpm’s remained at the previous version (†) Move from SLC4 to SL4 after stress testing

6 Actual Stress Testing Original plan for fix would have taken too long –Moved to fixed duration testing (24 hr limit) –Reduced number of file sizes from 5 to 2 100 MB and 2GB No mixed tests

7 Results All 2.1.8 Functional Tests pass Most 2.1.9 tests pass –With some modifications to scripts –Including xrootd! –Some fail because they require a CERN specific set up Stable under stress testing –Changes made performance metrics less useful –Overall impression is no significant change

8 Issues (on Testing) Limit on clients –More stress on client machines than CASTOR –Unable to test extreme LSF queues –VO testing includes stress (hammercloud) tests Functional tests done with ‘matching’ client version –Some basic testing also done with older client versions (2.1.7) against later stager versions. –VO’s using 2.1.7 clients

9 Issues (on CASTOR) Remarkably few.... –DLF not registering file id Fixed by CERN – we need custom version of DLF.py –No 32-bit xroot rpms available Produced for us, but not fully supported –gridFTP external (used @ RAL) does not support checksumming –Some database cleanup needed before upgrade

10 Issues (VO Testing) Some misconfigured disk servers Problems with xrootd for ALICE –Disk servers need firewall ports opening.

11 Issues (in 2.1.9-6) Known issues affecting 2.1.9-6 –Rare checksum bug affecting gridFTP internal Fixed in 2.1.9-8 –Can get file inconsistencies during repack if file is overwritten Very unlikely (fixed in 2.1.9-7) –Xrootd manager core dumps at CERN Under investigation –Problem with multiple tape copies on file update

12 Change Control Whole testing and rollout plan has been extensively change reviewed –Four separate reviews, some done independently of CASTOR team –Included review of Update Process –Provided useful input for additional tests and highlighted limitations, and identifying impacted systems –Proposed regular reviews during upgrades Detailed update plan under development

13 Rollout Plan High level docs available for some time now: –https://www.gridpp.ac.uk/wiki/RAL_Tier1_Upgrad e_Planhttps://www.gridpp.ac.uk/wiki/RAL_Tier1_Upgrad e_Plan Three downtimes Schedule to be agreed with VO’s –Proposed schedule sent to VO’s –Likely LHCb will be guinea pigs –ALICE before Heavy Ion run

14 Schedule (draft) Rolling move to local nameserver starting 13/9 Main update: –LHCb: 27/9 –GEN(ALICE): 25/10 –ATLAS: 8/11 –CMS: 22/11 Revert back to central n/s post Xmas

15 The Future More CASTOR/SRM upgrades –2.1.9-8 to address known issues –2.9 SRM more performant, safer against DoS Move to SL5 –Probably next year; no rpm’s available yet CASTOR gridFTP ‘internal’ More use of xrootd More stable database infrastructure (Q1 2011?)

16 Facilities Instance Provide CASTOR instance for STFC facilities –Provides (proven) massively scalable “back end” storage component of a deeper data management architectural stack –CASTOR for STFC facilities: production system to be deployed ~ Dec 2010 –STFC friendly users currently experimenting with CASTOR –Users expected to interface to CASTOR via “Storage-D” (High performance data management pipeline) –E-Science aiming for a common architecture for “big data management”: CASTOR Back end data storage Storage-D middleware ICAT file and meta-data catalogue TopCat – multi user web access –Can eventually wind down sterling, (but obscure) “ADS” service (very limited expertise, non Linux operating system, unknown code in many parts) –Exploits current (and future) skill set of the group

17 Summary New CASTOR was stable under stress testing –And VO testing – so far Performance not impacted – probably Very useful getting experiments on-board for testing. ‘Ready’ for deployment

18 Results (Stress Tests, 100MB) Test2.1.72.1.82.1.9 Rfio write76.3(+/3.92)s82.7(+/-25.7)s39.7(+/-22.3)s Rfio write+read330.3(+/-107.1)s10.5(+/-24.8)s62.4(+/-18.0)s Disk-2-disk59.7(+/-10.9)s23.2(+/-14.5)s68.6(+/-17.4)s gridFTP write85.4(+/-10.5)s43.3(+/-14.2)s49.9(+/-73.0)s gridFTP write+read27.9(+/-7.4)s68.9(+/-18.9)s72.5(+/-40.8)s

19 Results (Stress Tests, 2GB) Test2.1.72.1.82.1.9 Rfio write16944.3(+/-286.4)1699.8(+/-42.7)736.8(+/-377.6) Rfio write+read3409.9(+/-9.6)380.6(+/-168.7)1421.8(+/-597.7) Disk-2-disk7605.3(+/-2317.7)402.9(+/-175.6)1295.9(+/-597.7) gridFTP write1713.8(+/-19.8)765.1(+/-83.2)750.5(+/-223.0) gridFTP write+read1630.3(+/-184.5)803.9(+/-220.2)1287.3(+/-638.0)


Download ppt "CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010."

Similar presentations


Ads by Google