Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tier1 Status Report Martin Bly RAL 27,28 April 2005.

Similar presentations


Presentation on theme: "Tier1 Status Report Martin Bly RAL 27,28 April 2005."— Presentation transcript:

1 Tier1 Status Report Martin Bly RAL 27,28 April 2005

2 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Topics Hardware Atlas DataStore Networking Batch services Storage Service Challenges Security

3 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Hardware Approximately 550 CPU nodes –~980 processors deployed in batch –Remainder are services nodes, servers etc. 220TB disk space ~ 60 servers, ~120 arrays Decommissioning –Majority of the P3/600MHz systems decommissioned Jan 05 –P3/1GHz systems to be decommissioned in July/Aug 05 after commissioning of Year 4 procurement. –Babar SUN systems decommissioned by end Feb 05 –CDF IBM systems decommissioned and sent to Oxford, Liverpool, Glasgow and London Next procurement –64bit AMD or Intel CPU nodes – power, cooling –Dual cores possibly too new –Infortrend Arrays / SATA disks / SCSI connect Future –Evaluate new disk technologies, dual core CPUs, etc.

4 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Atlas DataStore Evaluating new disk systems for staging cache –FC attached SATA arrays –Additional 4TB/server, 16TB total –Existing IBM/AIX servers Tape drives –Two additional 9940B drives, FC attached –1 for ADS, 1 for test CASTOR installation Developments –Evaluating a test CASTOR installation –Stress testing ADS components to prepare for Service Challenges –Planning for a new robot –Considering next generation of tape drives –SC4 (2006) requires step in cache performance –Ancillary network rationalised

5 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Networking Planned upgrades to Tier1 production network –Started November 04 –Based on Nortel 5510-48T `stacks’ for large groups of CPU and disk server nodes (up to 8/stack, 384 ports) –High speed backbone inter-unit interconnect (40Gb/s bi- directional) within stacks –Multiple 1Gb/s uplinks aggregated to form backbone currently 2 x 1Gb/s, max 4 x 1Gb/s –Update to 10Gb/s uplinks and head node as cost falls –Uplink configuration with links to separate units within each stack and the head switch will provide resilience –Ancillary links (APCs, disk arrays) on separate network Connected to UKLight for SC2 (c.f. later) –2 x 1Gb/s links aggregated from Tier1

6 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Batch Services Worker node configuration based on traditional style batch workers with LCG configuration on top. –Running SL 3.0.3 with LCG 2_4_0 –Provisioning by PXE/Kickstart –YUM/Yumit, Yaim, Sure, Nagios, Ganglia… All rack-mounted workers dual purpose, accessed via a single batch system PBS server (Torque). Scheduler (MAUI) allocates resources for LCG, Babar and other experiments using Fair Share allocations from User Board. Jobs able to spill into allocations for other experiments and from one `side’ to the other when spare capacity is available, to make best use of the capacity. Some issues with jobs that use excess memory (memory leaks) not being killed by Maui or Torque – under investigation.

7 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Service Systems Service systems migrated to SL 3 –Mail hub, NIS servers, UIs –Babar UIs configured as DNS triplet NFS / data servers –Customised RH7.n  Driver issues NFS performance of SL 3 uninspiring c/w 7.n –dCache systems at SL 3 LCG service nodes at SL 3, LCG-2_4_0 Need to migrate to LCG-2_4_0 or loose work

8 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Storage Moving to SRMs from NFS for data access –dCache successfully deployed in production Used by CMS, ATLAS… See talk by Derek Ross –Xrootd deployed in production Used by Babar Two `redirector’ systems handle requests –Selected by DNS pair –Hand off request to appropriate server –Reduces NFS load on disk servers Load issues with Objectivity server –Two additional servers being commissioned Project to look at SL 4 for servers –2.6 kernel, journaling file systems - ext3, XFS

9 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Service Challenges I The Service Challenges are a program infrastructure trials designed to test the LCG fabric at increasing levels of stress/capacity in the run up to LHC operation. SC2 – March/April 05: –Aim: T0->T1s aggregate of >500MB/s sustained for 2 weeks –2Gb/sec link via UKlight to CERN –RAL sustained 80MB/sec for two weeks to dedicated (non-production) dCache 11/13 gridftp servers Limited by issues with network –Internal testing reached 3.5Gb/sec (~400MB/sec) aggregate disk to disk –Aggregate to 7 participating sites: ~650MB/sec SC3 – July 05 -Tier1 expects: –CERN -> RAL at 150MB/s sustained for 1 month –T2s -> RAL (and RAL -> T2s?) at yet-to-be-defined rate Lancaster, Imperial … Some on UKlight, some via SJ4 Production phase Sept-Dec 05

10 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Service Challenges II SC4 - April 06 –CERN-RAL T0-T1 expects 220MB/sec sustained for one month –RAL expects T2-T1 traffic at N x 100MB/sec simultaneously. June 06 – Sept 06: production phase Longer term: –There is some as yet undefined T1 -> T1 capacity needed. This could be add 50 to 100MB/sec. –CMS production will require 800MB/s combined and sustained from batch workers to the storage systems within the Tier1. –At some point there will be a sustained double rate test – 440MB/sec T0-T1 and whatever is then needed for T2-T1. It is clear that the Tier1 will be able to keep a significant part of a 10Gb/sec link busy continuously, probably from late 2006.

11 27/28 April 2005 Tier1 Status Report - HEPSysMan, RAL Security The Badguys™ are out there –Users are vulnerable to loosing authentication data anywhere Still some less than ideal practices –All local privilege escalation exploits must be treated as a high priority must-fix –Continuing program of locking down and hardening exposed services and systems –You can only be more secure See talk by Roman Wartel


Download ppt "Tier1 Status Report Martin Bly RAL 27,28 April 2005."

Similar presentations


Ads by Google