Presentation is loading. Please wait.

Presentation is loading. Please wait.

FAX UPDATE 1 ST JULY 2013. Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue.

Similar presentations


Presentation on theme: "FAX UPDATE 1 ST JULY 2013. Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue."— Presentation transcript:

1 FAX UPDATE 1 ST JULY 2013

2 Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue with redirection to CERN Recent deployments Ilija Vukotic ivukotic@uchicago.edu 2

3 FAILOVER TO FAX Recall the first FAX use is failover to FAX in case pilot could not stage input files. Turned on for all of USA and two UK sites. Documentation on how to enable a queue for failover: https://twiki.cern.ch/twiki/bin/view/Atlas/FAXconfigAGIS#Enabling_FAX_failover No problems reported reported as yet. Does it work? YES IT DOES! Ilija Vukotic ivukotic@uchicago.edu 3

4 FAX FAILOVER OBSERVATIONS In one of the first days it saved 1K+ jobs from failing. Most on OU_OCHEP_SWT2 (700+), SWT2_CPM (100+) and the rest distributed all around. Sometimes cases of 5-10 files recovered. Information on these jobs can be obtained by constructing a URL like this: http://pandamon.atlascloud.org/jobinfo?dump=yes&JOBMETRICS=*filesWithFAX*[1- 9]{1}*&jobparam=JOBMETRICS&jobStatus=finished&hours=&tstart=2013-06-11+00:00:00&tend=2013- 06-28+00:00 http://pandamon.atlascloud.org/jobinfo?dump=yes&JOBMETRICS=*filesWithFAX*[1- 9]{1}*&jobparam=JOBMETRICS&jobStatus=finished&hours=&tstart=2013-06-11+00:00:00&tend=2013- 06-28+00:00 However this is (naturally) quite slow. We may also want daily summary statistics to assess operational performance Ilija Vukotic ivukotic@uchicago.edu 4

5 FAX FAILOVER DEVELOPMENT Need better way to monitor how many jobs were saved, how many files, how many still failed, what’s amount of data served by FAX. After discussions with Valeri, Torre, decision is to send info to panda logger http://pandamon.cern.ch/logsummary Will require pilot modifications to send failover records Also a python plugin to create web pages that we want Open questions: What process should we follow for switching on a site? Note: any Panda production queue can be enabled When the pilot comes supports Rucio file names, will the fallback mechanism still work? Ilija Vukotic ivukotic@uchicago.edu 5

6 STEPS TOWARD AUTOMATED OPERATIONAL NOTIFICATION Moving towards production operation sites and ADC shifters will need well defined procedures and awareness of potential problems with endpoints Some failures are obvious, others will require intervention by experts Perhaps start with SSB “Direct” and “Upstream” tests http://dashb-atlas- ssb.cern.ch/dashboard/request.py/siteview#currentView=FAX+e ndpoints&fullscreen=true&highlight=false Align with existing site (cloud) notification channels Ilija Vukotic ivukotic@uchicago.edu 6

7 PANDA RE-BROKERING One of the original use-cases, discussed again at last CERN S&C week Idea is to re-broker jobs to sites with free CPUs / short queues provided transfer or direct access read “cost” is reasonable FAX team responsible for providing an estimate of cost to move data across the WAN to PANDA Cost matrix exists in SSB, ready for AGIS integration Final step is Tadashi making use of that table from AGIS to actually re-broker Ilija Vukotic ivukotic@uchicago.edu 7

8 ISSUE WITH REDIRECTION TO CERN High level symptom is that downstream redirection to CERN endpoint often fails: xrdcp -f -d 1 root://atlas-xrd- eu.cern.ch:1094//atlas/dq2/user/HironoriIto/user.HironoriIto.xrootd.cern- prod/user.HironoriIto.xrootd.cern-prod-1M i.e. the client gets redirected into an endless loop between lpsc-se- dpm-server.in2p3.fr and atlas-xrd-fr.cern.ch Note IN2P3-LPSC has a deployed xrootd door, but not in AGIS This might be the cause Reported to FR cloud contact, awaiting clarification Ilija Vukotic ivukotic@uchicago.edu 8

9 RECENT DEPLOYMENTS PIC is in and validated IT Cloud – is the site at Bologna coming online? Will need to revisit at some point the strategy for Asian sites including Australia Ilija Vukotic ivukotic@uchicago.edu 9


Download ppt "FAX UPDATE 1 ST JULY 2013. Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue."

Similar presentations


Ads by Google