Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.

Slides:



Advertisements
Similar presentations
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Advertisements

Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.
Lesson 11-Virtual Private Networks. Overview Define Virtual Private Networks (VPNs). Deploy User VPNs. Deploy Site VPNs. Understand standard VPN techniques.
DPM Italian sites and EPEL testbed in Italy Alessandro De Salvo (INFN, Roma1), Alessandra Doria (INFN, Napoli), Elisabetta Vilucchi (INFN, Laboratori Nazionali.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
Testing as a Service with HammerCloud Ramón Medrano Llamas CERN, IT-SDC
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Efi.uchicago.edu ci.uchicago.edu FAX update Rob Gardner Computation and Enrico Fermi Institutes University of Chicago Sep 9, 2013.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS Computing Integration.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
ALICE data access WLCG data WG revival 4 October 2013.
Connect.usatlas.org ci.uchicago.edu ATLAS Connect Technicals & Usability David Champion Computation Institute & Enrico Fermi Institute University of Chicago.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Integration Program Update Rob Gardner US ATLAS Tier 3 Workshop OSG All LIGO.
Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)
Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.
Storage Wahid Bhimji DPM Collaboration : Tasks. Xrootd: Status; Using for Tier2 reading from “Tier3”; Server data mining.
FAX UPDATE 1 ST JULY Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue.
FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.
Efi.uchicago.edu ci.uchicago.edu Towards FAX usability Rob Gardner, Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS.
Efi.uchicago.edu ci.uchicago.edu FAX meeting intro and news Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Federated Xrootd.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
Status & Plan of the Xrootd Federation Wei Yang 13/19/12 US ATLAS Computing Facility Meeting at 2012 OSG AHM, University of Nebraska, Lincoln.
Efi.uchicago.edu ci.uchicago.edu FAX Dress Rehearsal Status Report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computation.
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
Efi.uchicago.edu ci.uchicago.edu Using FAX to test intra-US links Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computing Integration.
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
Storage Federations and FAX (the ATLAS Federation) Wahid Bhimji University of Edinburgh.
Performance Tests of DPM Sites for CMS AAA Federica Fanzago on behalf of the AAA team.
Efi.uchicago.edu ci.uchicago.edu Status of the FAX federation Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 /
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.
FAX UPDATE 12 TH AUGUST Discussion points: Developments FAX failover monitoring and issues SSB Mailing issues Panda re-brokering to FAX Monitoring.
Efi.uchicago.edu ci.uchicago.edu Data Federation Strategies for ATLAS using XRootD Ilija Vukotic On behalf of the ATLAS Collaboration Computation and Enrico.
Federated Data Stores Volume, Velocity & Variety Future of Big Data Management Workshop Imperial College London June 27-28, 2013 Andrew Hanushevsky, SLAC.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Maria Girone, CERN CMS Experiment Status, Run II Plans, & Federated Requirements Maria Girone, CERN XrootD Workshop, January 27, 2015.
Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Testing Infrastructure Wahid Bhimji Sam Skipsey Intro: what to test Existing testing frameworks A proposal.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
Data Management at Tier-1 and Tier-2 Centers Hironori Ito Brookhaven National Laboratory US ATLAS Tier-2/Tier-3/OSG meeting March 2010.
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.
STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,
DPM in FAX (ATLAS Federation) Wahid Bhimji University of Edinburgh As well as others in the UK, IT and Elsewhere.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Efi.uchicago.edu ci.uchicago.edu Federating ATLAS storage using XrootD (FAX) Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computation and Enrico Fermi.
Efi.uchicago.edu ci.uchicago.edu Caching FAX accesses Ilija Vukotic ADC TIM - Chicago October 28, 2014.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Federating Data in the ALICE Experiment
WLCG IPv6 deployment strategy
Global Data Access – View from the Tier 2
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
FDR readiness & testing plan
Brookhaven National Laboratory Storage service Group Hironori Ito
Ákos Frohner EGEE'08 September 2008
Australia Site Report Sean Crosby DPM Workshop – 13 December 2013.
Presentation transcript:

efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation and Enrico Fermi Institutes University of Chicago ADC Development Meeting February 3, 2014

efi.uchicago.edu ci.uchicago.edu 2 Examine the Layers – as in prior reports New new results at increasing scale and complexity. Limit tests to renamed Rucio sites Capability Panda re-broker (future) HammerCloud Functional HammerCloud Stress WAN testing HammerCloud Functional HammerCloud Stress WAN testing Network cost matrix (continuous) SSB Functional (continuous) Failover to Federation (production)

efi.uchicago.edu ci.uchicago.edu 3 The New Global Logical Filename With Rucio we are no longer dependent on LFC – Brings a lot in stability, scalability – Simplifies joining the Federation – Speeds up file lookups – Makes for much nicer looking gLFNs New format gLFN /atlas/rucio/scope:filename N2N recalculates Rucio LFN /rucio/scope/xx/yy/filename Checks each space token at the site if there is such a path – Reducing # space token paths will make this even more efficient

efi.uchicago.edu ci.uchicago.edu 4 Summary of FAX Site Deployments Standardize the deployment procedures – Goals are largely achieved: Twiki doc, FAX rpms in WLCG repo, etc. Software Components – Xrootd release requirement, X509 are mostly achieved – Rucio N2N deployment in progress (dCache, DPM, Xrootd, Posix (GPFS, Lustre)) o ~60% of sites deployed the N2N: Sites are either cautious on this or are delayed by a libcurl bug on SL5 platform – FIx is ready but would still like to hear from the DPM team of their validation result. o EOS has its own, functioning N2N plug-in Redirection network has been stable since switch to Rucio Recommending scalable FAX site configuration for Tier1s – Use a small xrootd cluster instead of a single machine – Similar to multiple GridFTPs doors – BNL and SLAC use this configuration

efi.uchicago.edu ci.uchicago.edu 5 Infrastructure: 10 redirectors

efi.uchicago.edu ci.uchicago.edu 6 Infrastructure: 44 SE’s with XROOTD

efi.uchicago.edu ci.uchicago.edu 7

efi.uchicago.edu ci.uchicago.edu 8 Active FAX sites

efi.uchicago.edu ci.uchicago.edu 9 Basic redirection functionality Direct access from clients to sites Redirection to non-local data (“upstream”) Redirection from central redirectors to the site (“downstream”) Uses a host at CERN which runs set of probes against sites Waiting: Rucio-based gLFN  PFN mapper plugin Storage software upgrade Rucio renaming

efi.uchicago.edu ci.uchicago.edu 10 Regular Status Testing from the SSB Functional tests run once per hour Checks whether direct Xrootd access is working Sends an to cloud support, fax-ops w/info Problem notification Problem resolved

efi.uchicago.edu ci.uchicago.edu 11 FAX Throughput

efi.uchicago.edu ci.uchicago.edu 12 Status of Cost Matrix Submits jobs into 20 largest ATLAS compute sites (continuously) Measures average IO to each endpoint (an xrdcp of 100 MB file) Stores in SSB, along with FTS and perfsonar BW data Data sent to Panda for use in WAN brokering decisions

efi.uchicago.edu ci.uchicago.edu 13 Comparison of data used for cost matrix collection between a representative compute site-storage site pair.

efi.uchicago.edu ci.uchicago.edu 14 Performance map for the selection of WAN links Can be used as a rough control factor for WAN load Track as we see network upgrades in the next year WAN performance map

efi.uchicago.edu ci.uchicago.edu 15 In Production: Failover-to-FAX Two month window  Mix of PROD and ANALY  Failover rates are relatively modest  About 60k jobs, 60% recovered

efi.uchicago.edu ci.uchicago.edu 16 Failover-to-FAX rate comparisons # jobs Low rate of usage is a measure of existing reliability of ATLAS storage sites Storage issues

efi.uchicago.edu ci.uchicago.edu 17 Failover-to-FAX rate comparisons WAN failover IO reasonable Thus no penalty for queue by using WAN failover

efi.uchicago.edu ci.uchicago.edu 18 Failover-to-FAX enabled queue Any queue Panda resource can be easily enabled to use FAX for the fallback case.

efi.uchicago.edu ci.uchicago.edu 19 WAN Direct Access Testing Directly access a remote FAX endpoints Reveals an interesting WAN landcape Relative WAN event rates and CPU eff very good in DE (at 10’s of jobs scale) Question is at what job scale does one reach diminishing returns? (HammerCloud results from Friedrich Hoenig)

efi.uchicago.edu ci.uchicago.edu 20 WAN Load Test (200 job scale) Using HC framework in DE cloud; SMWZ H  WW Some uncertainty of #concurrently running jobs (not directly controllable) Indicates reasonable opportunity for re- brokering

efi.uchicago.edu ci.uchicago.edu 21 Load Testing with Direct WAN IO 744 files (~3.7 GB ea.) reading FDR dataset over WAN, TTC=30MB Limited to 250 jobs in test queue “Deep read”: 10% events, all 8k branches Used most of 10g connection

efi.uchicago.edu ci.uchicago.edu 22 FAX user tools Useful for Tier 3 users or access to ATLAS data from non-grid clusters (e.g. cloud, campus cluster, etc.) AtlasLocalRootBase package: localSetupFAX – Sets up dq2-client – Sets up grid middleware – Sets up xrootd client – Sets up an optimal FAX access point o Uses geographical distance from client IP to FAX endpoints – FAX tools o isDSinFAX.py o FAX-setRedirector.sh o FAX-get-gLFNs.sh Removes need for redirector knowledge Eases support Removes need for redirector knowledge Eases support

efi.uchicago.edu ci.uchicago.edu 23 Conclusions, Lessons, To-do Significant stability improvements for sites using the Rucio namespace mapper – Also, with removal of LFC callouts, no redirector stability issues observed Tier 1 Xrootd proxy stability issues – Have been observed for very large loads during stress tests (O(1000) clients) (but no impact on backend SE) – Adjustments made and success on re-test – Suggests configuration for protecting Tier 1 storage The WAN landscape is obviously diverse – Cost matrix captures capacities Probes of 10g link scale – Indicate appropriate WAN job level < 500 jobs – (typically 10% CPU capacity) Controlled load testing on-going