Prague Site Report Jiří Chudoba Institute of Physics, Prague Hepix meeting, Prague
Local Organization Institute of Physics: o 2 locations in Prague, 1 in Olomouc o 786 employees (281 researchers + 78 doctoral students) Department of Networking and Computing Techniques (SAVT) o networking up to offices, mail and web servers, central services Computing centre (CC) o large scale calculations o part of SAVT (except leader – Jiri Chudoba) Division of Elementary Particle Physics o Section Department of detector development and data processing head Milos Lokajicek started large scale calculations, later transferred to CC the biggest hw contributor (LHC computing) participates in the CC operation
Server room I Server room I (Na Slovance) o 62 m2, ~20 racks 350 kVA motor generator, x 100 kVA UPS, 108 kW air cooling, 176 kW water cooling o continuous changes o hosts computing servers and central services
Other server rooms New server room for SAVT o located next to server room I o independent UPS (24 kW now, max 64 kW n+1), motor generator (96 kW), cooling 25 kW (n+1) o dedicated for central services o 16 m2, now 4 racks (room for 6) o very high reliability required o first servers moved in last week Server room Cukrovarnicka o another building in Prague o 14 m2, 3 racks (max 5), 20 kW central UPS, 2x8 kW cooling o backup servers and services Server room UTIA o 3 racks, 7 kW cooling, 3 + 5x1.5 kW UPS o dedicated to Department of Condensed Matter Theory
5
Clusters in CC - Dorje Dorje: Altix ICE8200, 1.5 rack o 512 cores on 64 diskless WN, IB, 2 disk arrays (6+14 TB) o only local users, solid state physics, condense matter theory o 1 admin for administration and user support o relatively small number of jobs, MPI jobs up to 256 processes o Torque + Maui, SLES10 SP2, SGI Tempo, MKL, OpenMPI, ifort users run mostly: Wien2k, vasp, fireball, apls
Cluster LUNA 2 servers SunFire X4600 o 8 CPUs 32 cores, 256 GB RAM 4 servers SunFire V20z, V40z Operated by CESNET Metacentrum – distributed computing activity of the NGI_CZ Metacentrum o 9 locations o 3500 cores o 300 TB
Cluster Thsun, Small group servers Thsun o “private” cluster small number of users power users with root privileges o 12 servers of variable hw servers for groups o managed by groups in collaboration with CC
Cluster Golias Upgraded every year – several subclusters of the identical hw 3812 cores, HS06 almost 2 PB disk space the newest (March 2012) subcluster rubus: o 23 nodes SGI Rackable C1001-G13 o 2x (Opteron cores) 64 GB RAM, 2x SAS 300 GB o 374 W (full load) o 232 HS06 per node, 5343 HS06 total
Golias shares 2011 HS06share Alice+Star Atlas D Solid9144 Calice300 Auger HS06share Alice+Star Atlas D Solid6292 Calice130 Auger Subclusters contribution to the total performance Planned vs real usage (walltime)
WLCG Tier2 cluster + xrootd Rez 2012 pledges: o ATLAS HS06, 1030 TiB; HS06 available, 1300 TB av. o ALICE 5000 HS06, 420 TiB; 7564 HS06, 540 TB available delivery of almost 600 TB delayed due to floods 66% efficiency is assumed for WLCG accounting o sometimes under 100% of pledges Low cputime/walltime ratio for the ALICE o not only on our site o Tests with limits on number of concurrent jobs (last week) o “no limit” (about 900 jobs) – 45% o limit 600 jobs - 54 %
Utilization Very high average utilization o several different projects, different tools for production o D0 – production submitted locally by 1 user o ATLAS – panda, ganga, local users; DPM o ALICE – VO box; xrootd D0 ALICE ATLAS
Networking CESNET upgraded our main CISCO router o > 6509 o supervisor SUP720 -> SUP2T o new 8x 10G X2 card o planned upgrade of power supplies 2x3kW -> 2x6 kW (2 cards 48x1 Gbps, 1 card 4x10 Gbps, FW service module)
External connection Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET) Shared: 10 Gbps (PASNET – GEANT) Not enough for ATLAS T2D limit (5 MB/s to/from T1s) Perfsonar installed FZK -> FZU FZU -> FZK PASNET link
Miscellaneous items Torque server performance o W jobs, sometimes long response time o divide Golias in 2 clusters with 2 torque instances? o memory limits for ATLAS and ALICE queues CVMFS o used by ATLAS, works well o some older nodes have too small disks -> excluded for ATLAS Management o Cfengine v2 used for production o Puppet used for IPv6 testbed 2 new 64 core nodes o SGI Rackable H2106-G7, 128 GB RAM, 4x Opteron GHz, 446 HS06 o frequent crashes when loaded with jobs Another 2 servers with Intel SB expected o small subclusters with different hw
Water cooling Active vs passive cooling doors o 1 new rack with cooling doors o 2 new cooling doors on APC racks
Water cooling good sealing crucial diskservers on off (divider added) diskservers rubus01
Distributed Tier2, Tier3s Networking infrastructure (provided by CESNET) connects all Prague institutions involved o Academy of Sciences of the Czech Republic Institute of Physics (FZU, Tier-2) Nuclear Physics Institute o Charles University in Prague Faculty of Mathematics and Physics o Czech Technical University in Prague Faculty of Nuclear Sciences and Physical Engineering Institute of Experimental and Applied physics Now only NPI hosts resources visible in Grid o Many reasons why others do not: manpower, suitable rooms, lack of IPv4 addresses Data Storage group at CESNET o deployment for LHC projects discussed
Thanks to my colleagues for help with preparation of these slides: o Marek Eliáš o Lukáš Fiala o Jiří Horký o Tomáš Hrubý o Tomáš Kouba o Jan Kundrát o Miloš Lokajíček o Petr Roupec o Jana Uhlířová o Ota Velínský 22