Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.

Similar presentations


Presentation on theme: "Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation."— Presentation transcript:

1 efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation and Enrico Fermi Institutes University of Chicago ADC Development Meeting February 3, 2014

2 efi.uchicago.edu ci.uchicago.edu 2 Examine the Layers – as in prior reports New new results at increasing scale and complexity. Limit tests to renamed Rucio sites Capability Panda re-broker (future) HammerCloud Functional HammerCloud Stress WAN testing HammerCloud Functional HammerCloud Stress WAN testing Network cost matrix (continuous) SSB Functional (continuous) Failover to Federation (production)

3 efi.uchicago.edu ci.uchicago.edu 3 The New Global Logical Filename With Rucio we are no longer dependent on LFC – Brings a lot in stability, scalability – Simplifies joining the Federation – Speeds up file lookups – Makes for much nicer looking gLFNs New format gLFN /atlas/rucio/scope:filename N2N recalculates Rucio LFN /rucio/scope/xx/yy/filename Checks each space token at the site if there is such a path – Reducing # space token paths will make this even more efficient

4 efi.uchicago.edu ci.uchicago.edu 4 Summary of FAX Site Deployments Standardize the deployment procedures – Goals are largely achieved: Twiki doc, FAX rpms in WLCG repo, etc. Software Components – Xrootd release requirement, X509 are mostly achieved – Rucio N2N deployment in progress (dCache, DPM, Xrootd, Posix (GPFS, Lustre)) o ~60% of sites deployed the N2N: Sites are either cautious on this or are delayed by a libcurl bug on SL5 platform – FIx is ready but would still like to hear from the DPM team of their validation result. o EOS has its own, functioning N2N plug-in Redirection network has been stable since switch to Rucio Recommending scalable FAX site configuration for Tier1s – Use a small xrootd cluster instead of a single machine – Similar to multiple GridFTPs doors – BNL and SLAC use this configuration

5 efi.uchicago.edu ci.uchicago.edu 5 Infrastructure: 10 redirectors

6 efi.uchicago.edu ci.uchicago.edu 6 Infrastructure: 44 SE’s with XROOTD

7 efi.uchicago.edu ci.uchicago.edu 7

8 efi.uchicago.edu ci.uchicago.edu 8 Active FAX sites

9 efi.uchicago.edu ci.uchicago.edu 9 Basic redirection functionality Direct access from clients to sites Redirection to non-local data (“upstream”) Redirection from central redirectors to the site (“downstream”) Uses a host at CERN which runs set of probes against sites Waiting: Rucio-based gLFN  PFN mapper plugin Storage software upgrade Rucio renaming

10 efi.uchicago.edu ci.uchicago.edu 10 Regular Status Testing from the SSB Functional tests run once per hour Checks whether direct Xrootd access is working Sends an email to cloud support, fax-ops w/info Problem notification Problem resolved

11 efi.uchicago.edu ci.uchicago.edu 11 FAX Throughput

12 efi.uchicago.edu ci.uchicago.edu 12 Status of Cost Matrix Submits jobs into 20 largest ATLAS compute sites (continuously) Measures average IO to each endpoint (an xrdcp of 100 MB file) Stores in SSB, along with FTS and perfsonar BW data Data sent to Panda for use in WAN brokering decisions

13 efi.uchicago.edu ci.uchicago.edu 13 Comparison of data used for cost matrix collection between a representative compute site-storage site pair.

14 efi.uchicago.edu ci.uchicago.edu 14 Performance map for the selection of WAN links Can be used as a rough control factor for WAN load Track as we see network upgrades in the next year WAN performance map

15 efi.uchicago.edu ci.uchicago.edu 15 In Production: Failover-to-FAX Two month window  Mix of PROD and ANALY  Failover rates are relatively modest  About 60k jobs, 60% recovered

16 efi.uchicago.edu ci.uchicago.edu 16 Failover-to-FAX rate comparisons # jobs Low rate of usage is a measure of existing reliability of ATLAS storage sites Storage issues

17 efi.uchicago.edu ci.uchicago.edu 17 Failover-to-FAX rate comparisons WAN failover IO reasonable Thus no penalty for queue by using WAN failover

18 efi.uchicago.edu ci.uchicago.edu 18 Failover-to-FAX enabled queue Any queue Panda resource can be easily enabled to use FAX for the fallback case.

19 efi.uchicago.edu ci.uchicago.edu 19 WAN Direct Access Testing Directly access a remote FAX endpoints Reveals an interesting WAN landcape Relative WAN event rates and CPU eff very good in DE (at 10’s of jobs scale) Question is at what job scale does one reach diminishing returns? (HammerCloud results from Friedrich Hoenig)

20 efi.uchicago.edu ci.uchicago.edu 20 WAN Load Test (200 job scale) Using HC framework in DE cloud; SMWZ H  WW Some uncertainty of #concurrently running jobs (not directly controllable) Indicates reasonable opportunity for re- brokering

21 efi.uchicago.edu ci.uchicago.edu 21 Load Testing with Direct WAN IO 744 files (~3.7 GB ea.) reading FDR dataset over WAN, TTC=30MB Limited to 250 jobs in test queue “Deep read”: 10% events, all 8k branches Used most of 10g connection

22 efi.uchicago.edu ci.uchicago.edu 22 FAX user tools Useful for Tier 3 users or access to ATLAS data from non-grid clusters (e.g. cloud, campus cluster, etc.) AtlasLocalRootBase package: localSetupFAX – Sets up dq2-client – Sets up grid middleware – Sets up xrootd client – Sets up an optimal FAX access point o Uses geographical distance from client IP to FAX endpoints – FAX tools o isDSinFAX.py o FAX-setRedirector.sh o FAX-get-gLFNs.sh Removes need for redirector knowledge Eases support Removes need for redirector knowledge Eases support

23 efi.uchicago.edu ci.uchicago.edu 23 Conclusions, Lessons, To-do Significant stability improvements for sites using the Rucio namespace mapper – Also, with removal of LFC callouts, no redirector stability issues observed Tier 1 Xrootd proxy stability issues – Have been observed for very large loads during stress tests (O(1000) clients) (but no impact on backend SE) – Adjustments made and success on re-test – Suggests configuration for protecting Tier 1 storage The WAN landscape is obviously diverse – Cost matrix captures capacities Probes of 10g link scale – Indicate appropriate WAN job level < 500 jobs – (typically 10% CPU capacity) Controlled load testing on-going


Download ppt "Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation."

Similar presentations


Ads by Google