Presentation on theme: "Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006."— Presentation transcript:
Tier1 Status Report Andrew Sansum GRIDPP15 12 January 2006
Overview Usage (Make the oversight committee happy) Service Challenge Plans Robot Upgrade and CASTOR plans Other Hardware Upgrades Spec Benchmark (if time)
Utilisation Concerns July: Oversight Committee concerned that Tier-1 was under utilised - Procurements delayed Team perception was that farm quite well occupied and demand increasing Needed to: –Understand miss-match between CPU use and Occupancy –Maximise capacity available (minimise downtime etc) –Address reliability issues – buy more mirror disks – investigate Linux HA etc, move to limited on-call system.... –Maximise number of running experiments
January-July nominal capacity: 796KSI2K
Efficiency Monitoring Work by Matt Hodges Automate post processing of PBS logfiles each month, measuring JobCPUEfficiency=[CPU time]/[Elapsed Time] Important tool to improve farm throughput Need to address low efficiency jobs – they occupy job slots that could be used by other work.
Experimen t SeptemberOctoberNovemberDecember ATLAS BaBar BioMed CMS DTeam D H LHCb MINOS Pheno SNO Theory Zeus Others
Babar December 2005 Stuck jobs clocking up wallclock time but no CPU Probably system staff paused Babar job queue to fix disk server Typical analysis jobs
Minos December 2005 NFS Server overload
CMS November 2005 dCache Server overload – 3Gb/s I/O to dCache
Outstanding Scheduling Problems How to schedule large memory jobs. Large memory job can occupy two job slots on an older 1GB memory system. –Is it better to always run whatever work you can – even at the expense of future possible job starts –Is it better to limit large memory job starts and keep small memory systems free for small memory work that might not turn up. No inter-VO scheduling at present (give 40% to LHC for example) No intra-VO scheduling at present (give production 80%)
XMASS Availability Usually farm runs unattended over Christmas. Major failure early in the period can severely impact December CPU delivery [eg power failure last XMASS] Martin Bly/Steve Traylen worked several days overtime over the period to fix faults. Fantastic availability and good demand led to one of the most productive Christmases ever.
XMASS Running XMASS
Have we succeeded Significant improvement for second half of 2005.
Scheduler Monitoring Work by Matt Hodges: Now we have heavy demand!!! – need to monitor MAUI scheduler. Put MAUI scheduling data into ganglia. Gain insight into scheduling – help experiments understand why experiments jobs dont start... Not an exact science – UB over allocates CPU shares – we use this data to simply calculate target shares and schedule over relatively short period of time (9 days, decaying by 0.7 per day)
Target Shares January Target shares Implemented
Service Challenges An ever expanding project! 12 Months ago: –SC2 throughput March 2005 disk/disk –SC3 throughput July 2005 disk/tape –SC4 throughput April 2006 tape/tape 6 Months ago: –As above, but now add Service Phases and Tier-2s (significant part of Matt and Catalins work: VO boxes, FTS, LFC, etc)
Service Challenges 2 Months ago - as above but add: –SC3 throughput test 2 (16 th January/1 week –Change April 2006 test to disk/disk (from 150MB/s. Motivation – several Tier-1 tape SRMs are late. –Add July 2006 test – 150MB/s. RAL will use this as an opportunity to use CASTOR –Add Tier1-Tier-2 throughput tests. 1 Month ago: –As above, but now request early February test to tape 50MB/s –RAL unable to take part – too many commitments. May review close to date – depends on our work schedule. Other sites in similar position.
Robot Upgrade Old Robot –6000 slots –Early 1990s hardware – but still pretty good –1 robot arm –supports most recent generation drives but end of line –Still operational but migrate drives shortly and close New Robot (funded by CCLRC): STK SDL8530 –10,000 slots –Life expectancy of at least 2 drive generations –Up to 8 mini robots mounting tapes – faster – resiliant –T10K drives/media in FY06: Faster drives&bigger tapes
CASTOR Plans ADS Software (that drives tape robot) home- grown, old and hard to support. Many limiting factors. Not viable for LCG operation. Only financially and functionally viable alternative to ADS found was CASTOR Following negotiation with CERN, RAL now a member of CASTOR project (contributing SRM) – see Jenss talk. It makes no sense to operate both dCache and CASTOR. Two SRMs = double trouble**2
CASTOR PLANS Carry out migration from dCache to CASTOR2 over 12 months (CASTOR1 test system deployed in Autumn). Tension: –Deploy CASTOR as late as possible to allow proper build/testing –Deploy CASTOR as soon as possible to allow experiments to test CASTOR2 test system scheduled for end January Will be looking for trial users in May to throughput test CASTOR CASTOR will be used in SC4 tape throughput test in July. Provided we are satisfied, then CASTOR SRM will be offered as a production service in September for both disk and tape
dCache PLANS dCache is currently our production SRM to both disk and tape. dCache will remain a production SRM to disk (and part of tape capacity): –Until CASTOR is proven to work –Experiments have had adequate time to migrate data from dCache dCache phase out (probably) in 1H07 dCache may provide a plan B if CASTOR deployment is late – not desirable for all kinds of reasons.
Disk and CPU Hardware delivery scheduled for late February (evaluation nearly complete) –Modest CPU upgrade ( KSI2K) – modest demand –Spend more on disk (up to 135TB additional capacity) –CPU online early June, disk: July –Disk moving from external RAID array to internal PCI based RAID in order to reduce cost. Probably second round of purchases early in FY06. Size and timing to be agreed at Tier-1 board. Capacity available by September.
Other Hardware Oracle Hardware –See Gordons talk for details –Mini Storage Area Network to meet Oracle: requirements. 1 Fibre Channel RAID Array 4 server hosts SAN Switch (Qlogic SANbox 5200 stackable switch) Delivery in February. Upgrade 10 systems to mirror disks for critical services.
Network Plans Currently 2*1Gbit to CERN by UKLIGHT to CERN. 1*1Gbit to Lancaster. 2*1Gbit to SJ4 production. Upgrade to 4*1Gbit to CERN end January (for SC4) Upgrade site edge (lightpath) router to 10Gbit, end February Attach Tier-1 at 10Gbit to edge, via 10Gbit uplink from Nortel 5530 switch (£5K switch stackable with our existing 5510 commodity units) (March) Attach T1 to CERN at 10Gbit early in SJ5 rollout (early summer).
Machine Rooms Extensive planning out to 2010 and beyond to identify growth constraints Major power and cooling work (additional 400KW) in A5 lower in 2004 funded by CCLRC E-Science to accommodate growth of Tier-1 and HPC systems. Sufficient to cool kit up to mid Further cooling expansion just started (>400KW) to meet profile out to 2008 hardware delivery for Tier-1 Investigating building new machine room for hardware installation.
T1/A and SPEC Work by George Prassas (hardware support) Motivation: Investigate whether the batch scaling factors used by T1/A were accurate Whether our performance/scaling mirrored the published results Help form a view about CPUs for future purchases
SPEC CPU Metrics SPECint2000/SPECint_base2000 –Geometric mean of 12 normalised ratios (one for each app) when compiled with aggressive/conservative compiler options SPECint_rate2000/SPECint_rate_base2000 –Same as above but for 12 normalised throughput ratios Same apply for CFP2000
Warning: Maths! If α 1, α 2, α 3, …, α n are real numbers, we define their geometric mean as: (α 1* α 2* α 3 …α n ) 1/n =
Results - SPECint Scaling Factors
Results – SPECint Performance
Conclusions We have come an immense distance in 1 year LHC service challenge work is ever expanding major effort to increase utilisation Have (from planning purposes) been living in 2006 for most of Now we have arrived. CASTOR deployment will be a considerable challenge. Hardening the service will be an important part of work for 1H2006 Very little time now left and lots to do.