Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.

Similar presentations


Presentation on theme: "Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong."— Presentation transcript:

1 Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong

2 History Jan 2005Castor1 installed at RAL for evaluation Jan 2006Castor2 first available to external institutes, installation begun at RAL Aug 2006Castor2 running after resolving problems for deployment outside CERN, verion 2.1.0 Sep 2006CSA06 ran successfully Mar 2007Upgrade to version 2.1.2 - Major problems and instability causing frequent meltdowns Sep 2007Deployed separate instances per VO and castor version 2.1.3 - Much better stability

3 Name Server 2 Production Architecture stager DLF LSF stager DLF LSF 1 Diskserver 9 TB Tape Server Oracle stager Oracle NS+ vmgr Name Server 1 +vmgr CMS Stager Instance Atlas Stager Instance LHCb Stager Instance Repack and Small User Stager Instance 22 Diskservers 133 TB 7 Diskservers 48 TB 20 Diskservers 144 TB Oracle DLF Oracle stager Oracle DLF Oracle stager Oracle DLF Oracle DLF Oracle repack Oracle stager Tape Server Tape Server Tape Server Tape Server Tape Server repack Shared Services

4 Test Architecture stager DLF DLF+ LSF 1 Diskserver - variable Tape Server Oracle stager Oracle NS+ vmgr Name Server +vmgr DevelopmentPreproduction 1 Diskserver - variable Oracle DLF Oracle DLF Oracle repack Oracle stager repack Shared Services stagerDLF LSF 1 Diskserver - variable Tape Server Oracle NS+ vmgr Name Server +vmgr Certification Testbed Oracle DLF Oracle repack Oracle stager repack Shared Services

5 Operational Management Change management System manager on duty Helpdesk Monitoring: nagios, ganglia, castor-specific Team Bonny Strong – service manager Shaun de Witt – developer Tim Folkes (about 50%)- tape operations Chris Kruk – LSF manager, diskservers, sys admin Cheney Ketley (50%) – sys admin, LSF backup

6 Working with VOs Weekly meeting with all VOs to discuss issues and plans Meetings individually with VOs to model data flow and plan CASTOR configuration

7 Atlas Data Flow Model T0Raw StripInput D0T1 D1T0 D1T1 D0T0 T0 T2 T1’s RAW AODm1/ TAG AODm2/ TAG ESD2/ AODm2/ TAG AOD2 simRaw ESD/ AODm/ TAG/ RAW simStrip ESD1/ AODm1/ TAG TAG/ AODm2 Partner T1 ESD1 AODm2/ TAG ESD Farm RAW

8 Key Improvements Planned Over Next 6 Months Resilience –Oracle clusters (RAC) with Dataguard DB replication –Redundant stagers for each VO –Encouraging development for additional redundancy Monitoring improvements Development of administrative tools Deployment and configuration management procedures Disaster recovery documentation and testing

9 SRMv2 In production at RAL by 1 Dec 2007 Separate endpoints for each VO Front end clusters for redundancy Will run in parallel with SRMv1 until VOs approve v1 decommissioning

10 Major Problems and Issues Software reliability Heavy operational cost CERN-specific development Repack delayed Lack of administrative tools Performance to tape Staffing for 24/7 coverage

11 Working with CERN External institutes conference call every 2 weeks to review development progress and operational issues Twice yearly face-to-face meetings of external institutes Once monthly deployment conference call to plan development priorities Management level meetings over last year to address problems of CASTOR for Tier1s –Improved release procedures and planning –More involvement of Tier1s in development planning –Improved testing with development of certification testbed and testsuite at RAL

12 Conclusions Has not been a smooth road Have taken or plan significant steps to overcome problems Major concerns for 2008: –24/7 operation –Improving tape performance Expect system reliability to be much better in 2008 than 2007


Download ppt "Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong."

Similar presentations


Ads by Google