Presentation is loading. Please wait.

Presentation is loading. Please wait.

AGLT2 Site Report Shawn McKee University of Michigan March 23 2015 / OSG-AHM.

Similar presentations


Presentation on theme: "AGLT2 Site Report Shawn McKee University of Michigan March 23 2015 / OSG-AHM."— Presentation transcript:

1 AGLT2 Site Report Shawn McKee University of Michigan March 23 2015 / OSG-AHM

2 Site Summary The ATLAS Great Lake Tier-2 (AGLT2) is a distributed LHC Tier-2 for ATLAS spanning between UM/Ann Arbor and MSU/East Lansing. Roughly 50% of storage and compute at each site 6650 single core job slots MCORE slots 10-508 (dynamic) 720 Tier-3 job slots usable by Tier-2 Average 9.54 HS06/slot 3.5 (3.7) Petabytes of storage (adding 192 TB at MSU) Total of 62.9 kHS06, up from 49.0 kHS06 last spring Most Tier-2 services virtualized in VMware 2x40 Gb inter-site connectivity, UM has 100G to WAN, MSU has 10G to WAN, lots of 10Gb internal ports and 16 x 40Gb ports High capacity storage systems have 2 x 10Gb bonded links 40Gb link between Tier-2 and Tier-3 physical locations

3 LAN Network LAN for AGLT2 is working well. About 10GBytes/sec during job-startup is shown on an example from this morning on the right We have been working on some SDN demos and have reconfigured our network at UM to work around some problems with OpenFlow on the S4810

4 AGLT2 100G Network Details Esnet LHCONE Internet2 100G

5 AGLT2-HEPiX 14-Oct-14 Equipment Deployment All FY14 funds expended Purchased Dell R620s from another customer’s cancelled order Large memory (256GB), dual 10G (RJ45), dual P/S Got same $/HS06 as best deal before Spent $16K of FY15 funds getting 3 of these at UM Purchased additional 10G switching: Dell N4032 / N4032F Storage purchase for T2 MD3460, 60x4TB (at MSU) UM purchases from January online now Working on bringing up new network, compute and storage at MSU.

6 HTCondor CE at AGLT2 Bob Ball worked for ~2 months at AGLT2 setup – Steep learning curve for newbies – Lots of non-apparent niceties in preparing job-router configuration – RSL no longer available for routing decisions Can modify variables and place them in ClassAd variables set in the router – Used at AGLT2 to control MCORE slot access condor_ce_reconfig will put into effect any dynamically changed job routes – If done via cron must make sure PATH is correctly set Currently in place on all gatekeepers, ready to complete cut-over Bob will present details later

7 MCORE at AGLT2 AGLT2 AGLT2 has supported MCORE jobs for many months now Condor configured for two MCORE job types – Static slots (10 total, 8 cores each) – Dynamic slots (578 of 8 cores each) Requirements statements added by the “condor_submit” script – Depends on count of queued MP8 jobs – HTCondor-CE does this in job routes Result is instant access for a small number with gradual release of cores for more with time. Full details at https://www.aglt2.org/wiki/bin/view/AGLT2/MCoreSetuphttps://www.aglt2.org/wiki/bin/view/AGLT2/MCoreSetup QUEUED RUNNING

8 Cgroups at AGLT2 Implemented in December Simple implementation via added file in /etc/condor/config.d – BASE_CGROUP = htcondor – CGROUP_MEMORY_LIMIT_POLICY = soft /etc/cgconfig.conf extended to add “group htcondor” BEWARE: You MUST have maxMemory defined in submitted jobs – HTCondor bug will otherwise bite you and limit all jobs, always, to 128MB RAM

9 Middleware Deployment Plans Currently very up-to-date on middleware OSG-CE 3.2.20 HTCondor 8.2.7 Three gatekeepers 1 Production CE for ATLAS 1 Test CE 1 CE for all other VOs dCache 2.10.19 All run Scientific Linux 6 Prepping for SL7 now

10 Update on DIIRT At HEPiX Gabriele Carcassi presented on “Using Control Systems for Operation and Debugging”Using Control Systems for Operation and Debugging This effort has continued and is now called DIIRT (Data Integration In Real Time)DIIRT Control System Studio UI for operators NFS CSV or JSON diirt server Websockets + JSON Web pages HTML + Javascript scripts dependency data flow Currently implemented Scripts populate NFS directory from condor/ganglia Files are served by diirt server through web sockets Control System Studio can create “drag’n’drop” UI

11 AGLT2-HEPiX 14-Oct-14 Original DIIRT UI Canvas allows drag-n-drop of elements to assemble views, no programming required Server can feed remote clients in real-time. Project info at http://diirt.org/http://diirt.org/

12 AGLT2-HEPiX 14-Oct-14 DIIRT via Web Axes and annotation via pull downs, for either site or both

13 Software-Defined Storage Research NSF proposal submitting today (Multi-campus) Ceph Exploring Ceph for future software- defined storage Goal is centralized storage that supports in place access from CPUs across campuses Intends to leverage Dell “dense” storage MD3xxx (12 Gbps SAS) in JBOD mode

14 Future Plans Lustre Our Tier-3 uses Lustre 2.1 and has ~500TB – Approximately 35M files averaging 12MB/file – We have purchased new hardware providing another 500TB. LustreLustreZFS – Intend to go to Lustre 2.7.0 using Lustre on ZFS for this LustreLustre – Plan: install new Lustre instance, then migrate existing Lustre data over, then rebuild older hardware into the new instance, retiring some components for spare parts. Still exploring OpenStack as an option for our site. Would like to use Ceph for a back-end. New network components support Software Defined Networking (OpenFlow). Once v1.3 is supported we intend to experiment with SDN in our Tier-2 and as part of LHCONE point-to-point testbed. Working on IPv6 dual-stack for all nodes in our Tier-2

15 ConclusionConclusion Summary Things are working well. We have our purchases in place Interesting possibilities are being worked on. Questions ?


Download ppt "AGLT2 Site Report Shawn McKee University of Michigan March 23 2015 / OSG-AHM."

Similar presentations


Ads by Google