CASTOR 2.1.9 Upgrade, Testing and Issues Shaun de Witt GRIDPP-25 23 August 2010.

Slides:



Advertisements
Similar presentations
30-31 Jan 2003J G Jensen, RAL/WP5 Storage Elephant Grid Access to Mass Storage.
Advertisements

Steve Traylen Particle Physics Department Experiences of DCache at RAL UK HEP Sysman, 11/11/04 Steve Traylen
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
1 port BOSS on Wenjing Wu (IHEP-CC)
CERN IT Department CH-1211 Genève 23 Switzerland t Plans and Architectural Options for Physics Data Analysis at CERN D. Duellmann, A. Pace.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
RAL Site Report Castor F2F, CERN Matthew Viljoen.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
Your university or experiment logo here NextGen Storage Shaun de Witt (STFC) With Contributions from: James Adams, Rob Appleyard, Ian Collier, Brian Davies,
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Δ Storage Middleware GridPP10 What’s new since GridPP9? CERN, June 2004.
Author - Title- Date - n° 1 Partner Logo WP5 Summary Paris John Gordon WP5 6th March 2002.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
CERN - IT Department CH-1211 Genève 23 Switzerland t CASTOR Status March 19 th 2007 CASTOR dev+ops teams Presented by Germán Cancio.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
Report from CASTOR external operations F2F meeting held at RAL in February Barbara Martelli INFN - CNAF.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Future Plans at RAL Tier 1 Shaun de Witt. Introduction Current Set-Up Short term plans Final Configuration How we get there… How we plan/hope/pray to.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Operational experiences Castor deployment team Castor Readiness Review – June 2006.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
The GridPP DIRAC project DIRAC for non-LHC communities.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
EGEE-III INFSO-RI Enabling Grids for E-sciencE JRA1 and SA3 All Hands Meeting December 2009, CERN, Geneva Product Teams –
Bonny Strong RAL RAL CASTOR Update External Institutes Meeting Nov 2006 Bonny Strong, Tim Folkes, and Chris Kruk.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
J Jensen / WP5 /RAL UCL 4/5 March 2004 GridPP / DataGrid wrap-up Mass Storage Management J Jensen
Federating Data in the ALICE Experiment
WLCG IPv6 deployment strategy
CASTOR Giuseppe Lo Presti on behalf of the CASTOR dev team
Status of the SRM 2.2 MoU extension
U.S. ATLAS Grid Production Experience
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Service Challenge 3 CERN
CASTOR-SRM Status GridPP NeSC SRM workshop
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
Ákos Frohner EGEE'08 September 2008
The INFN Tier-1 Storage Implementation
Presentation transcript:

CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010

Agenda Testing –What we Planned, what we did and what the VOs are doing Results Issues Rollout Plan The Future

Planned Testing Original Plan –Test database Upgrade Procedure –Functional Test 2.1.7/8/9 –Stress Test 2.1.7/8/9 10K reads (1 file in, multiple reads) (rfio+gridFTP) 10K writes (multiple files in)(rfio+gsiftp) 10K d-2-d (1 file in, multiple reads) (rfio) 20K read/write (rfio+gridFTP), 10K mixed tests 10K stager_qry (database test) 5 file sizes (100MB-2GB)

Required Changes Move to ‘local’ nameserver –Required to allow rolling updates Nameserver schema can not be upgraded until all instances are at Move from SLC4 to SL4 –Support for SLC4 end this year SL4 supported until 2012 Change of diskservers part way through testing

Actual Testing StagerLocal NameserverCentral Nameserver Tests FT/ST FT FT FT/ST† *FT *FT FT/ST (*) Indicates a schema only upgrade; the rpm’s remained at the previous version (†) Move from SLC4 to SL4 after stress testing

Actual Stress Testing Original plan for fix would have taken too long –Moved to fixed duration testing (24 hr limit) –Reduced number of file sizes from 5 to MB and 2GB No mixed tests

Results All Functional Tests pass Most tests pass –With some modifications to scripts –Including xrootd! –Some fail because they require a CERN specific set up Stable under stress testing –Changes made performance metrics less useful –Overall impression is no significant change

Issues (on Testing) Limit on clients –More stress on client machines than CASTOR –Unable to test extreme LSF queues –VO testing includes stress (hammercloud) tests Functional tests done with ‘matching’ client version –Some basic testing also done with older client versions (2.1.7) against later stager versions. –VO’s using clients

Issues (on CASTOR) Remarkably few.... –DLF not registering file id Fixed by CERN – we need custom version of DLF.py –No 32-bit xroot rpms available Produced for us, but not fully supported –gridFTP external RAL) does not support checksumming –Some database cleanup needed before upgrade

Issues (VO Testing) Some misconfigured disk servers Problems with xrootd for ALICE –Disk servers need firewall ports opening.

Issues (in ) Known issues affecting –Rare checksum bug affecting gridFTP internal Fixed in –Can get file inconsistencies during repack if file is overwritten Very unlikely (fixed in ) –Xrootd manager core dumps at CERN Under investigation –Problem with multiple tape copies on file update

Change Control Whole testing and rollout plan has been extensively change reviewed –Four separate reviews, some done independently of CASTOR team –Included review of Update Process –Provided useful input for additional tests and highlighted limitations, and identifying impacted systems –Proposed regular reviews during upgrades Detailed update plan under development

Rollout Plan High level docs available for some time now: – e_Planhttps:// e_Plan Three downtimes Schedule to be agreed with VO’s –Proposed schedule sent to VO’s –Likely LHCb will be guinea pigs –ALICE before Heavy Ion run

Schedule (draft) Rolling move to local nameserver starting 13/9 Main update: –LHCb: 27/9 –GEN(ALICE): 25/10 –ATLAS: 8/11 –CMS: 22/11 Revert back to central n/s post Xmas

The Future More CASTOR/SRM upgrades – to address known issues –2.9 SRM more performant, safer against DoS Move to SL5 –Probably next year; no rpm’s available yet CASTOR gridFTP ‘internal’ More use of xrootd More stable database infrastructure (Q1 2011?)

Facilities Instance Provide CASTOR instance for STFC facilities –Provides (proven) massively scalable “back end” storage component of a deeper data management architectural stack –CASTOR for STFC facilities: production system to be deployed ~ Dec 2010 –STFC friendly users currently experimenting with CASTOR –Users expected to interface to CASTOR via “Storage-D” (High performance data management pipeline) –E-Science aiming for a common architecture for “big data management”: CASTOR Back end data storage Storage-D middleware ICAT file and meta-data catalogue TopCat – multi user web access –Can eventually wind down sterling, (but obscure) “ADS” service (very limited expertise, non Linux operating system, unknown code in many parts) –Exploits current (and future) skill set of the group

Summary New CASTOR was stable under stress testing –And VO testing – so far Performance not impacted – probably Very useful getting experiments on-board for testing. ‘Ready’ for deployment

Results (Stress Tests, 100MB) Test Rfio write76.3(+/3.92)s82.7(+/-25.7)s39.7(+/-22.3)s Rfio write+read330.3(+/-107.1)s10.5(+/-24.8)s62.4(+/-18.0)s Disk-2-disk59.7(+/-10.9)s23.2(+/-14.5)s68.6(+/-17.4)s gridFTP write85.4(+/-10.5)s43.3(+/-14.2)s49.9(+/-73.0)s gridFTP write+read27.9(+/-7.4)s68.9(+/-18.9)s72.5(+/-40.8)s

Results (Stress Tests, 2GB) Test Rfio write (+/-286.4)1699.8(+/-42.7)736.8(+/-377.6) Rfio write+read3409.9(+/-9.6)380.6(+/-168.7)1421.8(+/-597.7) Disk-2-disk7605.3(+/ )402.9(+/-175.6)1295.9(+/-597.7) gridFTP write1713.8(+/-19.8)765.1(+/-83.2)750.5(+/-223.0) gridFTP write+read1630.3(+/-184.5)803.9(+/-220.2)1287.3(+/-638.0)