Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014.

Similar presentations


Presentation on theme: "Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014."— Presentation transcript:

1 efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014

2 efi.uchicago.edu ci.uchicago.edu 2 Content Status – Coverage – Traffic – Failover – Overflow Changes in localSetupFAX Monitoring changes – Changes in GLED collector, dashboard – Failover & overflow monitoring – FaxStatusBoard Meetings – Tutorial – 23 -27 June – dedicated to instructing on xAOD and the new analysis model – ROOTIO – 25-27 June

3 efi.uchicago.edu ci.uchicago.edu 3 FAX topology Topology change in North America added East and West will serve CA cloud all hosted at BNL Will need NL cloud redirector

4 efi.uchicago.edu ci.uchicago.edu 4 FAX in Europe To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann

5 efi.uchicago.edu ci.uchicago.edu 5 FAX in North America To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August) To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August)

6 efi.uchicago.edu ci.uchicago.edu 6 FAX in Asia To come: Beijing (~two weeks) Tokyo Australia (few weeks) To come: Beijing (~two weeks) Tokyo Australia (few weeks)

7 efi.uchicago.edu ci.uchicago.edu 7 Status Most sites running stably Glitches do happen but are fixed usually in few hours SSB issues solved New sites added – IFAE – PIC – IN2P3-LPC In need of restart: – UNIBE-LHEP

8 efi.uchicago.edu ci.uchicago.edu 8 Coverage Now auto-updated Twiki page – https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FaxCoverage https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FaxCoverage Coverage is good (~85%), but we should aim for >95% ! Info fetched from http://dashb-atlas-job- prototype.cern.ch/dashboard/request.py/dailysummary http://dashb-atlas-job- prototype.cern.ch/dashboard/request.py/dailysummary

9 efi.uchicago.edu ci.uchicago.edu 9 Traffic Slowly increasing Max peak output record broken Still small to what we expect will come

10 efi.uchicago.edu ci.uchicago.edu 10 Failover Running stably

11 efi.uchicago.edu ci.uchicago.edu 11 Overflow status All the chain ready I have set all the US queues to allow 3 Gbps to be both delivered to and delivered from sites. Test tasks submitted to sites that don’t have the data so that transfertype=FAX is invoked. This does not test the JEDI decision making (the one based on cost matrix) Waiting for actual jobs to check the full chain – Users not yet instructed to use JEDI client – Waiting for JEDI monitor

12 efi.uchicago.edu ci.uchicago.edu 12 Overflow tests Test is the hardest IO test – 100% events, all branches read, standard TTC/no AsyncPrefetch. Site specific FDR datasets (10 DSs, 744 files, 2.7TB) All the source/destination combinations of US sites All of it submitted in 3 batches, but not all started simultaneously. Affected by priority degradation. Three input files per job. If site is copy2scratch pilot does xrdcp to scratch, if not jobs access files remotely.

13 efi.uchicago.edu ci.uchicago.edu 13 Overflow tests Error rate – Total 9188 jobs – Finished 9052 – Failed 117 – 1.3% o 24 – OU reading OU (no FAX involved) o 66 – reading from WT2 (files are corrupted) o 27 – 0.29 % -actual FAX errors where SWT2 did not deliver the files. Will be investigated. o The rest are “Payload run out of memory”

14 efi.uchicago.edu ci.uchicago.edu 14 Overflow tests Jobs reading from local scratch - for comparison Direct access site Reading locally Per job: 7.2 MB/s 67% CPU eff 71 ev/s Direct access site Reading locally Per job: 7.2 MB/s 67% CPU eff 71 ev/s Scout jobs Copy2scratch site Per job: 11.0 MB/s 97% CPU eff 109 ev/s Copy2scratch site Per job: 11.0 MB/s 97% CPU eff 109 ev/s

15 efi.uchicago.edu ci.uchicago.edu 15 Overflow tests Jobs reading remote sources Direct access site Reading remotely Per job: 4.2 MB/s 43% CPU eff 42 ev/s Direct access site Reading remotely Per job: 4.2 MB/s 43% CPU eff 42 ev/s Direct access site Reading remotely Per job: 3.5 MB/s 29% CPU eff 34 ev/s Direct access site Reading remotely Per job: 3.5 MB/s 29% CPU eff 34 ev/s No saturation Possibly a start of saturation

16 efi.uchicago.edu ci.uchicago.edu 16 Overflow tests MWT2 reading from OU and SWT2 simultaneously In aggregate reached 850 MB/s – limit for MWT2 at that time.

17 efi.uchicago.edu ci.uchicago.edu 17 Cost matrix destination source http://1-dot-waniotest.appspot.com/

18 efi.uchicago.edu ci.uchicago.edu 18 localSetupFAX Added command fax-ls – Made by Shuwei YE. – Will finally replace isDSinFAX – He will move all the other tools to Rucio Change in fax-get-best-redirector – Each time does three queries o SSB to get endpoints and their status o AGIS to get sites, hosting the endpoints o AGIS to get site coordinates – Each call returns hundreds of kb’s – Can’t scale to large number of requests – Solution: o Made a GoogleAppEngine servlets that each 30 min take info from SSB and AGIS and deliver it from memory o Information slimmed to what is actually needed: ~several kb o Now requests served in few tens of ms. o “Infinitely” scalable

19 efi.uchicago.edu ci.uchicago.edu 19 Monitoring – collector, dashboard Problem: support of multi-VO sites Meeting: Alex, Matevz, me Issues: – Site name: o ATLAS reports it o CMS not or badly, will fix it – Requesting user’s VO o ATLAS does it o CMS not strict about it. US-CMS uses GUMS. Will fix it. Proposal: – During the summer Matevz develops XrdMon that can handle multi-VO messages – Sends messages from multi-VO sites to a special “mixed” AMQ. Dashboard splits traffic according to user’s VO. Details: https://docs.google.com/document/d/1Syx3_vkwCfc5lj2lQzbUUrKT0Je238w6lcwVL7IY1GY/edit#

20 efi.uchicago.edu ci.uchicago.edu 20 Monitoring Failover – Not flexible enough Overflow – No monitoring yet – Need to compare jobs grouped by transfer type


Download ppt "Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014."

Similar presentations


Ads by Google