Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.

Similar presentations


Presentation on theme: "WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas."— Presentation transcript:

1 WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas Kouba, Milos Lokajicek, Jan Svec Prague 04.09. 2014

2 Outline Introducing the WLCG Tier-2 site in Prague A couple of history flashbacks  we celebrate the 10 th anniversary Current issues Summary and Outlook

3 HEP Computing in Prague: site praguelcg2 (a.k.a. the farm GOLIAS) A national computing center for processing data from various HEP experiments –Located in the Institute of Physics (FZU) in Prague –Basic infrastructure already in 2002, but –OFFICIALLY STARTED IN 2004  10 th ANNIVERSARY THIS YEAR Certified as a Tier2 center of LHC Computing Grid (praguelcg2) –Collaboration with several Grid projects. April 2008, WLCG MoU signed by Czech republic (ALICE+ATLAS). Excellent network connectivity: Multiple dedicated 1 – 10 Gb/s connections to collaborating institutions. Connected to LHCONE. Provides computing services for ATLAS + ALICE, D0, Solid state physics, Auger, Star... Started in 2002 with: 32 dual PIII 1.2GHz, 1 GB RAM, 18 GB SCSI HDD, 100 Mb/s Ethernet rack servers …. (29 of these decommissioned in 2009) Storage - disk array 1TB: HP server TC4100

4 History: 2002 -> 2014

5 Current numbers 5 1 batch system (torque + maui) 2 main WLCG VOs: ALICE, ATLAS –FNAL's D0 (dzero) user group –Other VOs: Auger, Star ~ 4000 cores published in the Grid ~ 3 PB on new disk servers (DPM, XRootD, NFS) Regular yearly upscale of resources on the basis of various financial supports, mainly the academic grants. The WLCG services include: –Apel publisher, Argus Authorization service, BDII, several UIs, Alice VOBOX, Cream CEs, Storage Elements The use of virtualization at the site is quite extensive. ALICE disk XRootD Storage Element ALICE::Prague::SE –~ 1.113 PB of disk space in total –Redirector/client + 3 clients @ FZU, 5 clients @ NPI Rez –  a distributed storage cluster

6 Site Usage ATLAS and ALICE – continuous production other projects – shorter campaigns ALICE ATLAS

7 Some history flashbacks (celebrating the 10 th anniversary)

8 8 ALICE PDC 2004 resource statistics: 14 sites ALICE 2014 resource statistics: 74 sites

9 9 ALICE PDC resources statistics - 2005 25 sites in operation Running jobs (8 November 2005) Farm Min Avg Max Sum116016511771 CCIN2P3134210231 CERN-L268286304 CNAF255362394 FZK0531600 Houston0314 Münster25881 Prague436171 Sejong222 Torino334143

10 2006 ALICE vobox set-up –fixing problems with the vobox proxy (unwanted expirations) –AliEn services set up –manually changing the RBs used by the JAs – successful participation in the ALICE PDC'06: –Prague site delivered ~ 5% of total computing resources (6 Tier1s, 30 Tier2s) Problems with the fair-share of the site local batch system (then PBSPro) 2007 2007 still problems with functioning of the ALICE vobox- proxy during the PDC'07 problems with job submission due to malfunctions of the default RBs  the failover submission configured Prague site delivered ~ 2.6% of total computing resources (significant increase of the number of Tier- 2s) migration to gLite3.1 ALICE vobox on 64-bit SLC4 machine upgrade of the local CE serving ALICE to lcg- CE 3 repeating problems with job submission through RB's  in Oct. the site re-configured for the WMS submission migration to the Torque batch system on a part of the site: some WNs on 32bit in PBS and some on 64bit in Torque installation and tuning of the creamCE hybrid state: –‘glite’ vobox and WNs, 32bit –‘cream’ vobox submitting JAs directly to creamCE  Torque, 64bit –Dec: ALICE jobs submitted only to the creamCE 2008 2008 2009 2009

11 2010 creamCE 1.6 / gLite 3.2/sl5 64bit installed in Prague  we were the first ALICE Tier-2 where cream1.6 was tested and put in production NGI_CZ set in operation 2011 Start of IPv6 implementation The site router got an IPv6 address Routing set-up in special VLANs ACLs directly implemented in the router IPv6 address configuration: DHCPv6 Set-up of an IPv6 testbed

12 2012 Optimization of the ALICE XRootD storage cluster performance an extensive tuning of the cluster motivated by a remarkably different performance of the individual machines: – data was migrated from the machine to be tuned to free disk arrays at another machine of the cluster. – the migration procedure done so that the data was accessible all the time. – the empty machine re-configured. –number of disks in one array reduced. –set-up of disk failure monitoring. –raid controller cache carefully configured. –readahead option set to a multiple of (stripe_unit * stripe_width) of the underlying RAID array. –no partition table used to ensure proper alignment of the file systems: they were created with right geometry options ("-d su=X, w=YYk“ mmkfs.xfs switches). –mounting performed with the noatime option. Parameters of one of the optimized XRootD servers before and after tuning 2013 2013 Almost all machines migrated to SL6 CVMFS installed on all machines Connected to LHCONE

13 praguelcg2 contribution to WLCG Tier-2 ATLAS+ALICE computing resources http://accounting.egi.eu/ A long-term slide down due to problems with financial support

14 Current issues

15 Monitoring issues Monitoring issues A number of monitoring tools in use: NAGIOS, MUNIN, GANGLIA, MRTG, NETFLOW, Gstat, MonALISA Nagios: –IPv6-only or IPv4-only servers connected to the central Dual stack node via Livestatus –Some checks can be run form IPv4-only or IPv6-only Nagios nodes MUNIN2: –current version 2.0.19 –IPv6 in testing Ganglia: –problems if the proper gai.conf is not present –gmetad doesn’t bind to IPv6 address on aggregators NetFlow: –plan to switch from v5 to v9 to use nfdump + nfsen Some new sensors are needed to fully deploy IPv6, some additional work necessary MonALISA REPOSITORY: –A simple test version installed, plans for future development

16 Network monitoring – weathermap LHCONE link is heavily utilized (capacity 10 Gbps) Nagios for alerts

17 Network architecture at FZU

18

19 Outgoing IPv4 local traffic from DPM servers Outgoing IPv6 local traffic from DPM servers IPv6 deployment Currently on Dual-stack: dpm headnode, all production disk nodes, all but 2 subclusters of WNs Over IPv6 goes: dpns between disknodes and headnode, srm between WNs and headnode, actual data transfer via gridftp IPv6 enabled on the ALICE vobox

20 Site services management Site services management Since 2008 services management done with CFEngine version 2 –cfagent Nagios sensor developed: a python script checking CFEngine logs for fresh records (error signals if the log is too old) CFEngine v2 used for production Puppet used for IPv6 testbed Migration to the overall Puppet management in progress

21 NGI_CZ Since 2010, NGI_CZ is recognized and in operation: https://wiki.metacentrum.cz/metawiki/NGI_CZ#Farma_golias_aka_praguelcg2 https://wiki.metacentrum.cz/metawiki/NGI_CZ#Farma_golias_aka_praguelcg2  all the events and relevant information about praguelcg2 2 sites involved: praguelcg2 and prague_cesnet_lcg2 significant part of the services provided by the praguelcg2 team Services provided by NGI_CZ for the EGI infrastructure: Accounting (APEL, DGAS, Cesga portal) Resources database (GOC DB) Operations - https://operations-portal.egi.eu/ https://operations-portal.egi.eu/ ROD (Regional Operator on Duty) Top level BDII VOMS servers Meta VO User support (GGUS/RT) - https://rt4.cesnet.cz/rt/ Middleware versions: UMD 3.0.0, EMI 3.0

22 Use of external resources Use of external resources Not much really to choose from Longer term usage of the cluster ‘skurut’ in Prague: site prague_cesnet_lcg2, courtesy of CESNET association – a long-time established cooperation NGI_CZ provided a single opportunity to use ~ 35 TB disk storage in Pilsen – for testing purposes mostly –dCache manager used –Evaluating the effect of switching/tuning TTreeCache, dCap RA –Not much of help as an extension of home resources

23 Summary and Outlook Prague Tier-2 site was performing as a distinguished member of the WLCG collaboration for 10 years now A stable upscale of resources High-level accessibility, reliable delivery of services, fast response to problems Into the upcoming years, we will do our best to keep up the reliability and performance level of the services Crucial is the high-capacity, state-of-the-art network infrastructure provided by CESNET However, the future LHC runs will require a huge upscale of resources which will be impossible for us to achieve with the expected flat budget As everybody else these days, we are in a search for external resources: got some help from CESNET but need more. As widely recommended, we very likely will try to collaborate with non-HEP scientific projects to get access to additional resources in the future

24 A couple of current plots

25 RUNNING ALICE JOBS IN PRAGUE in 2013/2014: Average = 996, maximum = 2227 Total number of processed jobs: ~ 5 millions GRID for ALICE in Prague – Monitoring jobs (MonALISA)

26 26 ALICE Disk Storage Elements – 62 endpoints, ~ 34 PB Prague scores with the largest Tier-2 storage

27 NETWORK TRAFFIC ON PRAGUE ALICE STORAGE CLUSTER in 2013/2014: (Total disk space capacity 1.113 PB) Max total traffic IN/write: 195 MB/s Max total traffic OUT/read: 1.05 GB/s Total data OUT/read : 5.322 PB GRID for ALICE in Prague – Monitoring storage (MonALISA)


Download ppt "WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas."

Similar presentations


Ads by Google