Download presentation
Presentation is loading. Please wait.
1
Pete Gronbech GridPP Project Manager April 2016
Oxford Experiences Pete Gronbech GridPP Project Manager April 2016
2
Oxford Grid Cluster GridPP4 status Autumn 2015
Current capacity 16,768HS TB DPM Storage a mixture of older Supermicro 26, 24 and 36 bay servers, but the major capacity is provided by Dell 510 and 710s. 12 bay with 2 or 4 TB disks. Majority of CPU nodes are ‘twin-squared’ Viglen Supermicro worker nodes have been installed. Intel E5- 8 core (16 Hyper-threaded cores each) provides 1300 job slots with 2GB RAM. The Grid Cluster now runs HT Condor behind an ARC CE. April 2016
3
Oxford’s Grid Cluster April 2016
4
April 2016
5
Primary VOs ATLAS LHCb CMS Tier-3 ALICE Support Others April 2016
Main priority 6th largest UK site (By metrics) ~1300 CPU cores with 1.3PB storage LHCb 6th Largest UK site (By metrics) CMS Tier-3 Supported by RALPPD’s PhEDEx server Useful for CMS, and for us, keeping the site busy in quiet times During 2015 delivered the same percentage as Bristol However can block Atlas jobs and during accounting period not so desirable ALICE Support There is a need to supplement the support given to ALICE by Birmingham. Made sense to keep this in SouthGrid so Oxford have deployed an ALICE VO box Oxford provides roughly 40% of ALICE CPU (No storage) Others 5th largest by Metrics April 2016
6
Other Oxford Work UK Regional Monitoring IPv6 Testing
Oxford runs the nagios based WLCG monitoring for the UK These include the Nagios server itself, and support nodes for it, SE, MyProxy and WMS/LB Multi VO Nagios Monitoring. IPv6 Testing We have taken a leading part in the IPv6 testing, many services enabled and tested by the community. perfSONAR IPv6 enabled. RIPE Atlas probe also on IPv6. Cloud Development Openstack test setup (Has run Atlas jobs) VAC setup (LHCb, Atlas & GridPP Dirac server jobs) Viab now installed April 2016
7
Approx. FTE Grid Tasks HTCondor Batch 0.2 DPM Storage
Monitoring and tuning in the batch system DPM Storage Installation, testing, monitoring and maintenance of SPACETOKENS, and h/w Grid Ops & Management Installation and management system (Cobbler, Puppet). GGUS tickets etc Ovirt VM infrastructure 0.1 SL based ovirt system setup to provide VM infrastructure for all service and test nodes. IPv6 testing IPv6 UI used by others in GridPP. IPv6 se, Perfsonar Early adopter and also ATLAS RIPE probes Security Team membership Meetings and running UK security tests VAC testing Early adopter and tester. Viab testing 0.05 Recent testing National Nagios infrastructure GridPP wide essential service VO Nagios for UK GridPP wide service Backup VOMS Server DPM on SL7 testing Part of storage group, early testing and bug discovery. Openstack Cloud test setup Open stack setup used for testing FTE ≈1.5 April 2016
8
Juggling Many Tasks Successfully
April 2016
9
Plan for GridPP5 Manpower ramp down from 1.5FTE to 0.5 (average of 1FTE over GridPP5) Need to simplify the setup and reduce the number of development tasks April 2016
10
Plan for GridPP5 Storage
Upgrade DPM to running on SL6 fully puppetized (Initially tried to install on SL7 but found too many incompatibilities to deal with in the short time available) We were the first site to install the latest DPM with the puppet modules supplied by the DPM developers. This is a good thing but meant we were on the bleeding edge and fed back many missing features and bugs. Ewan worked with Sam to find all the missing parts. We are still finding out now about some of the issues such as publishing. Decommission out of warranty h/w April 2016
11
Storage directions Reduced emphasis on storage??
Heading in the direction of T2C, and were acting as a test site for this way of working. Atlas FAX is intended to deal with cases where the files are missing. Jobs could be sent to a site knowing the data is not there to force use of FAX, whether this scales is as yet unknown. Actually using xrootd redirection without an SE is as yet untested. Currently short of staff so we will continue in traditional mode. Question needs answering before next h/w round, do we continue in the T2C direction or actually act as a T2D for SouthGrid sites. We do have good networking and a fair sized resource currently. We have rather too much storage to be used just as cache. What do the experiments want from us? Will continue running an SE for the foreseeable future. April 2016
12
Decommissioning old storage
17 Supermicro servers removed from the DPM SE. (Reduction of 320TB new total 980TB) April 2016
13
Storage March 2016 Simplified
Old servers removed, switched off or repurposed Software up to date OS up to date Simplified April 2016
14
Openstack I’m here, any work? Panda Yes, job WN’s 8 Atlas image OpenStack Infrastructure Head Node 8 P. Love Atlas account Store Atlas VM image Storage Atlas VM image CMS VM image This model is good if you have an existing Open Stack infrastructure. Helps with OS independence but quiet complex to setup and not straight forward to setup. Networking 8 April 2016
15
VAC I’m here, any work? Yes, job Install SL Install VAC rpms
VAC Factory AKA WN’s Yes, job 8 Layers KVM Hypervisor - KVM Libvirt manager - libvirt Virt manager - VAC Vrsh - installation/management image Install SL Install VAC rpms scp config file 8 VAC layer – when should I start a VM? Do I have an empty slot, 30 seconds, yes start 1 VM, what type, (Different types Atlas, LHCb, GridPP…) Prioritisation defined in /etc/vac.conf (Actually now multiple config files) Uses the site squid. 8 April 2016
16
viab Configured mainly by web pages
Only have to install first node manually (eg from USB stick) All the rest can boot from any of the other installed nodes. Each node runs dhcp, tftp and squid cache to act as an installation server. Everything comes from the web including certificates, but having the private part on the web would be bad so they are encrypted with a passphrase that is stored locally and must be copied to each node. Only have to copy contents of /etc/viabkeys and republish the rpm via the web page. All nodes network boot and always reinstall. Overall simpler setup. April 2016
17
viab April 2016
18
Vcycle Vcycle is VAC running on OpenStack – not tested at Oxford
Could be useful if central Advanced research computing cluster starts offering an Open Stack setup. April 2016
19
Plan for GridPP5 CPU GridPP4+ h/w money spent on CPU. April 2016
20
CPU Upgrade March 2016 Lenovo NeXtScale
25 Nodes each with Dual E v3 & 64GB RAM 800 new cores (new total ~2200) April 2016
21
Plan for GridPP5 CPU GridPP4+ h/w money spent on CPU. This rack is identical to the kit used by Oxford Advanced Research Computing. Initial plan is to plug into our Condor Batch as WNs When staff levels and time allows, we will investigate integrating the rack into the ARC cluster. April 2016
22
Can be part of a much bigger cluster
23
Plan for GridPP5 CPU GridPP4+ h/w money spent on CPU. This rack is identical to the kit used by Oxford Advanced Research Computing. Initial plan is to plug into our Condor Batch as WNs When staff levels and time allows, we will investigate integrating the rack into the ARC cluster. Initially stick with what we know ie HTCondor & viab Have to work out how to integrate with a shared cluster later. Decommission out of warranty hardware. Should we move away from the standard grid middleware to viab? April 2016
24
Reduction in Staff count
Recent loss of staff shows We April 2016
25
Approx. FTE Grid Tasks HTCondor Batch 0.2 DPM Storage 0.15
Monitoring and tuning in the batch system DPM Storage 0.15 Installation, testing, monitoring and maintenance of SPACETOKENS, and h/w Grid Ops & Management Installation and management system (Cobbler, Puppet). GGUS tickets etc Ovirt VM infrastructure 0.1 SL based ovirt system setup to provide VM infrastructure for all service and test nodes. IPv6 testing Perfsonar Early adopter and also ATLAS RIPE probes Security Team membership VAC testing Viab testing 0.05 National Nagios infrastructure GridPP wide essential service VO Nagios for UK GridPP wide service Backup VOMS Server DPM on SL7 testing Openstack Cloud test setup FTE ≈1.05 April 2016
26
Conclusions A time of streamlining and rationalisation.
Can continue as a useful site by sticking to core tasks. Need to make a decision on the storage question. Still a lot of work to do, to investigate integrating with university resources. Will this be possible, will it save time, or allow bursting to greater resources? Possibly benefits of cost savings on h/w maintenance and electricity costs. April 2016
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.