AGLT2 Site Report Shawn McKee University of Michigan March 23 2015 / OSG-AHM.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Advertisements

AGLT2 Site Report Shawn McKee University of Michigan HEPiX Fall 2014 / UNL.
Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
CHEPREO Tier-3 Center Achievements. FIU Tier-3 Center Tier-3 Centers in the CMS computing model –Primarily employed in support of local CMS physics community.
By: Gerardo L. Mazzola Web Application Development Life Cycle “A driven force moving businesses into the future.”
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
AGLT2 Site Report Shawn McKee University of Michigan HEPiX Spring 2014.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
CERN IT Department CH-1211 Geneva 23 Switzerland t T0 report WLCG operations Workshop Barcelona, 07/07/2014 Maite Barroso, CERN IT.
AGLT2 Site Report Benjeman Meekhof University of Michigan HEPiX Fall 2013 Benjeman Meekhof University of Michigan HEPiX Fall 2013.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Understand what’s new for Windows File Server Understand considerations for building Windows NAS appliances Understand how to build a customized NAS experience.
CMS Data Transfer Challenges LHCOPN-LHCONE meeting Michigan, Sept 15/16th, 2014 Azher Mughal Caltech.
A. Mohapatra, HEPiX 2013 Ann Arbor1 UW Madison CMS T2 site report D. Bradley, T. Sarangi, S. Dasu, A. Mohapatra HEP Computing Group Outline  Infrastructure.
Michigan Grid Testbed Report Shawn McKee University of Michigan UTA US ATLAS Testbed Meeting April 4, 2002.
LARK Bringing Distributed High Throughput Computing to the Network Todd Tannenbaum U of Wisconsin-Madison Garhan Attebury
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
Site Lightning Report: MWT2 Mark Neubauer University of Illinois at Urbana-Champaign US ATLAS Facilities UC Santa Cruz Nov 14, 2012.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
Thoughts on Future LHCOPN Some ideas Artur Barczyk, Vancouver, 31/08/09.
CCNA 2 Week 1 Routers and WANs. Copyright © 2005 University of Bolton Welcome Back! CCNA 2 deals with routed networks You will learn how to configure.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
First attempt for validating/testing Testbed 1 Globus and middleware services WP6 Meeting, December 2001 Flavia Donno, Marco Serra for IT and WPs.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
ATLAS Great Lakes Tier-2 (AGL-Tier2) Shawn McKee (for the AGL Tier2) University of Michigan US ATLAS Tier-2 Meeting at Harvard Boston, MA, August 17 th,
Shawn McKee/University of Michigan
EXPOSING OVS STATISTICS FOR Q UANTUM USERS Tomer Shani Advanced Topics in Storage Systems Spring 2013.
IHEP(Beijing LCG2) Site Report Fazhi.Qi, Gang Chen Computing Center,IHEP.
2-Sep-02Steve Traylen, RAL WP6 Test Bed Report1 RAL and UK WP6 Test Bed Report Steve Traylen, WP6
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
Simple Infrastructure to Exploit 100G Wide Are Networks for Data-Intensive Science Shawn McKee / University of Michigan Supercomputing 2015 Austin, Texas.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Scientific Computing in PPD and other odds and ends Chris Brew.
RALPP Site Report HEP Sys Man, 11 th May 2012 Rob Harper.
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
AGLT2 Site Report Shawn McKee/University of Michigan HEPiX Fall 2012 IHEP, Beijing, China October 14 th, 2012.
HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
CS 283Computer Networks Spring 2013 Instructor: Yuan Xue.
Managing a growing campus pool Eric Sedore
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
LHCONE NETWORK SERVICES: GETTING SDN TO DEV-OPS IN ATLAS Shawn McKee/Univ. of Michigan LHCONE/LHCOPN Meeting, Taipei, Taiwan March 14th, 2016 March 14,
AGLT2 Site Report Shawn McKee/University of Michigan Bob Ball, Chip Brock, Philippe Laurens, Ben Meekhof, Ryan Sylvester, Richard Drake HEPiX Spring 2016.
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
New directions in storage | ISGC 2015, Taipei | Patrick Fuhrmann | 19 March 2015 | 1 Presenter: Patrick Fuhrmann dCache.org Patrick Fuhrmann, Paul Millar,
Storage at the ATLAS Great Lakes Tier-2 Tier-2 Storage Administrator Talks Shawn McKee / University of Michigan OSG Storage ForumShawn McKee1.
ATLAS Tier-2 Storage Status AGLT2 OSG Storage Forum – U Chicago – Sep Shawn McKee / University of Michigan OSG Storage ForumShawn McKee1.
UK Status and Plans Catalin Condurache – STFC RAL ALICE Tier-1/Tier-2 Workshop University of Torino, February 2015.
Dynamic Extension of the INFN Tier-1 on external resources
Bob Ball/University of Michigan
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Service Challenge 3 CERN
AGLT2 Site Report Shawn McKee/University of Michigan
Em4 Ethernet tutorial Remote connection.
Generator Services planning meeting
ATLAS Sites Jamboree, CERN January, 2017
AGLT2 Site Report Shawn McKee/University of Michigan
NET2.
GridPP Tier1 Review Fabric
INSTALLING AND SETTING UP APACHE2 IN A LINUX ENVIRONMENT
NSF cloud Chameleon: Phase 2 Networking
Presentation transcript:

AGLT2 Site Report Shawn McKee University of Michigan March / OSG-AHM

Site Summary The ATLAS Great Lake Tier-2 (AGLT2) is a distributed LHC Tier-2 for ATLAS spanning between UM/Ann Arbor and MSU/East Lansing. Roughly 50% of storage and compute at each site 6650 single core job slots MCORE slots (dynamic) 720 Tier-3 job slots usable by Tier-2 Average 9.54 HS06/slot 3.5 (3.7) Petabytes of storage (adding 192 TB at MSU) Total of 62.9 kHS06, up from 49.0 kHS06 last spring Most Tier-2 services virtualized in VMware 2x40 Gb inter-site connectivity, UM has 100G to WAN, MSU has 10G to WAN, lots of 10Gb internal ports and 16 x 40Gb ports High capacity storage systems have 2 x 10Gb bonded links 40Gb link between Tier-2 and Tier-3 physical locations

LAN Network LAN for AGLT2 is working well. About 10GBytes/sec during job-startup is shown on an example from this morning on the right We have been working on some SDN demos and have reconfigured our network at UM to work around some problems with OpenFlow on the S4810

AGLT2 100G Network Details Esnet LHCONE Internet2 100G

AGLT2-HEPiX 14-Oct-14 Equipment Deployment All FY14 funds expended Purchased Dell R620s from another customer’s cancelled order Large memory (256GB), dual 10G (RJ45), dual P/S Got same $/HS06 as best deal before Spent $16K of FY15 funds getting 3 of these at UM Purchased additional 10G switching: Dell N4032 / N4032F Storage purchase for T2 MD3460, 60x4TB (at MSU) UM purchases from January online now Working on bringing up new network, compute and storage at MSU.

HTCondor CE at AGLT2 Bob Ball worked for ~2 months at AGLT2 setup – Steep learning curve for newbies – Lots of non-apparent niceties in preparing job-router configuration – RSL no longer available for routing decisions Can modify variables and place them in ClassAd variables set in the router – Used at AGLT2 to control MCORE slot access condor_ce_reconfig will put into effect any dynamically changed job routes – If done via cron must make sure PATH is correctly set Currently in place on all gatekeepers, ready to complete cut-over Bob will present details later

MCORE at AGLT2 AGLT2 AGLT2 has supported MCORE jobs for many months now Condor configured for two MCORE job types – Static slots (10 total, 8 cores each) – Dynamic slots (578 of 8 cores each) Requirements statements added by the “condor_submit” script – Depends on count of queued MP8 jobs – HTCondor-CE does this in job routes Result is instant access for a small number with gradual release of cores for more with time. Full details at QUEUED RUNNING

Cgroups at AGLT2 Implemented in December Simple implementation via added file in /etc/condor/config.d – BASE_CGROUP = htcondor – CGROUP_MEMORY_LIMIT_POLICY = soft /etc/cgconfig.conf extended to add “group htcondor” BEWARE: You MUST have maxMemory defined in submitted jobs – HTCondor bug will otherwise bite you and limit all jobs, always, to 128MB RAM

Middleware Deployment Plans Currently very up-to-date on middleware OSG-CE HTCondor Three gatekeepers 1 Production CE for ATLAS 1 Test CE 1 CE for all other VOs dCache All run Scientific Linux 6 Prepping for SL7 now

Update on DIIRT At HEPiX Gabriele Carcassi presented on “Using Control Systems for Operation and Debugging”Using Control Systems for Operation and Debugging This effort has continued and is now called DIIRT (Data Integration In Real Time)DIIRT Control System Studio UI for operators NFS CSV or JSON diirt server Websockets + JSON Web pages HTML + Javascript scripts dependency data flow Currently implemented Scripts populate NFS directory from condor/ganglia Files are served by diirt server through web sockets Control System Studio can create “drag’n’drop” UI

AGLT2-HEPiX 14-Oct-14 Original DIIRT UI Canvas allows drag-n-drop of elements to assemble views, no programming required Server can feed remote clients in real-time. Project info at

AGLT2-HEPiX 14-Oct-14 DIIRT via Web Axes and annotation via pull downs, for either site or both

Software-Defined Storage Research NSF proposal submitting today (Multi-campus) Ceph Exploring Ceph for future software- defined storage Goal is centralized storage that supports in place access from CPUs across campuses Intends to leverage Dell “dense” storage MD3xxx (12 Gbps SAS) in JBOD mode

Future Plans Lustre Our Tier-3 uses Lustre 2.1 and has ~500TB – Approximately 35M files averaging 12MB/file – We have purchased new hardware providing another 500TB. LustreLustreZFS – Intend to go to Lustre using Lustre on ZFS for this LustreLustre – Plan: install new Lustre instance, then migrate existing Lustre data over, then rebuild older hardware into the new instance, retiring some components for spare parts. Still exploring OpenStack as an option for our site. Would like to use Ceph for a back-end. New network components support Software Defined Networking (OpenFlow). Once v1.3 is supported we intend to experiment with SDN in our Tier-2 and as part of LHCONE point-to-point testbed. Working on IPv6 dual-stack for all nodes in our Tier-2

ConclusionConclusion Summary Things are working well. We have our purchases in place Interesting possibilities are being worked on. Questions ?