Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

FAX status. Overview Status of endpoints and redirectors Monitoring Failover Overflow.
Operating System Support Focus on Architecture
Chapter 8 Operating System Support
Computer Organization and Architecture
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
Efi.uchicago.edu ci.uchicago.edu FAX update Rob Gardner Computation and Enrico Fermi Institutes University of Chicago Sep 9, 2013.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS Computing Integration.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Netflix Data Pipeline with Kafka Allen Wang & Steven Wu.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
FAX UPDATE 1 ST JULY Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue.
FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.
Efi.uchicago.edu ci.uchicago.edu ATLAS Experiment Status Run2 Plans Federation Requirements Ilija Vukotic XRootD UCSD San Diego 27 January,
Efi.uchicago.edu ci.uchicago.edu Towards FAX usability Rob Gardner, Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS.
9 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From to
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Efi.uchicago.edu ci.uchicago.edu Using FAX to test intra-US links Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computing Integration.
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
EGI-InSPIRE EGI-InSPIRE RI DDM Site Services winter release Fernando H. Barreiro Megino (IT-ES-VOS) ATLAS SW&C Week November
Nurcan Ozturk University of Texas at Arlington US ATLAS Transparent Distributed Facility Workshop University of North Carolina - March 4, 2008 A Distributed.
Efi.uchicago.edu ci.uchicago.edu Status of the FAX federation Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 /
MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
NTOF DAQ status D. Macina (EN-STI-EET) Acknowledgements: EN-STI-ECE section: A. Masi, A. Almeida Paiva, M. Donze, M. Fantuzzi, A. Giraud, F. Marazita,
The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
FAX PERFORMANCE TIM, Tokyo May PERFORMANCE TIM, TOKYO, MAY 2013ILIJA VUKOTIC 2  Metrics  Data Coverage  Number of users.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
FAX UPDATE 12 TH AUGUST Discussion points: Developments FAX failover monitoring and issues SSB Mailing issues Panda re-brokering to FAX Monitoring.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
Efi.uchicago.edu ci.uchicago.edu Data Federation Strategies for ATLAS using XRootD Ilija Vukotic On behalf of the ATLAS Collaboration Computation and Enrico.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.
Network integration with PanDA Artem Petrosyan PanDA UTA,
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,
Data Distribution Performance Hironori Ito Brookhaven National Laboratory.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Efi.uchicago.edu ci.uchicago.edu Federating ATLAS storage using XrootD (FAX) Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computation and Enrico Fermi.
Efi.uchicago.edu ci.uchicago.edu Caching FAX accesses Ilija Vukotic ADC TIM - Chicago October 28, 2014.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Computing Operations Roadmap
Connecting LRMS to GRMS
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
William Stallings Computer Organization and Architecture
FDR readiness & testing plan
Brookhaven National Laboratory Storage service Group Hironori Ito
Processor Management Damian Gordon.
Operating Systems.
Processor Management Damian Gordon.
IPv6 update Duncan Rand Imperial College London
The LHCb Computing Data Challenge DC06
Presentation transcript:

efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014

efi.uchicago.edu ci.uchicago.edu 2 Content Status – Coverage – Traffic – Failover – Overflow Changes in localSetupFAX Monitoring changes – Changes in GLED collector, dashboard – Failover & overflow monitoring – FaxStatusBoard Meetings – Tutorial – June – dedicated to instructing on xAOD and the new analysis model – ROOTIO – June

efi.uchicago.edu ci.uchicago.edu 3 FAX topology Topology change in North America added East and West will serve CA cloud all hosted at BNL Will need NL cloud redirector

efi.uchicago.edu ci.uchicago.edu 4 FAX in Europe To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann To come: Sara Nikhef IL cloud - IL-TAU, Technion, Weizmann

efi.uchicago.edu ci.uchicago.edu 5 FAX in North America To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August) To come: TRIUMF (June?) McGill (end of June) SCINET (end of June) Victoria (~August)

efi.uchicago.edu ci.uchicago.edu 6 FAX in Asia To come: Beijing (~two weeks) Tokyo Australia (few weeks) To come: Beijing (~two weeks) Tokyo Australia (few weeks)

efi.uchicago.edu ci.uchicago.edu 7 Status Most sites running stably Glitches do happen but are fixed usually in few hours SSB issues solved New sites added – IFAE – PIC – IN2P3-LPC In need of restart: – UNIBE-LHEP

efi.uchicago.edu ci.uchicago.edu 8 Coverage Now auto-updated Twiki page – Coverage is good (~85%), but we should aim for >95% ! Info fetched from prototype.cern.ch/dashboard/request.py/dailysummary prototype.cern.ch/dashboard/request.py/dailysummary

efi.uchicago.edu ci.uchicago.edu 9 Traffic Slowly increasing Max peak output record broken Still small to what we expect will come

efi.uchicago.edu ci.uchicago.edu 10 Failover Running stably

efi.uchicago.edu ci.uchicago.edu 11 Overflow status All the chain ready I have set all the US queues to allow 3 Gbps to be both delivered to and delivered from sites. Test tasks submitted to sites that don’t have the data so that transfertype=FAX is invoked. This does not test the JEDI decision making (the one based on cost matrix) Waiting for actual jobs to check the full chain – Users not yet instructed to use JEDI client – Waiting for JEDI monitor

efi.uchicago.edu ci.uchicago.edu 12 Overflow tests Test is the hardest IO test – 100% events, all branches read, standard TTC/no AsyncPrefetch. Site specific FDR datasets (10 DSs, 744 files, 2.7TB) All the source/destination combinations of US sites All of it submitted in 3 batches, but not all started simultaneously. Affected by priority degradation. Three input files per job. If site is copy2scratch pilot does xrdcp to scratch, if not jobs access files remotely.

efi.uchicago.edu ci.uchicago.edu 13 Overflow tests Error rate – Total 9188 jobs – Finished 9052 – Failed 117 – 1.3% o 24 – OU reading OU (no FAX involved) o 66 – reading from WT2 (files are corrupted) o 27 – 0.29 % -actual FAX errors where SWT2 did not deliver the files. Will be investigated. o The rest are “Payload run out of memory”

efi.uchicago.edu ci.uchicago.edu 14 Overflow tests Jobs reading from local scratch - for comparison Direct access site Reading locally Per job: 7.2 MB/s 67% CPU eff 71 ev/s Direct access site Reading locally Per job: 7.2 MB/s 67% CPU eff 71 ev/s Scout jobs Copy2scratch site Per job: 11.0 MB/s 97% CPU eff 109 ev/s Copy2scratch site Per job: 11.0 MB/s 97% CPU eff 109 ev/s

efi.uchicago.edu ci.uchicago.edu 15 Overflow tests Jobs reading remote sources Direct access site Reading remotely Per job: 4.2 MB/s 43% CPU eff 42 ev/s Direct access site Reading remotely Per job: 4.2 MB/s 43% CPU eff 42 ev/s Direct access site Reading remotely Per job: 3.5 MB/s 29% CPU eff 34 ev/s Direct access site Reading remotely Per job: 3.5 MB/s 29% CPU eff 34 ev/s No saturation Possibly a start of saturation

efi.uchicago.edu ci.uchicago.edu 16 Overflow tests MWT2 reading from OU and SWT2 simultaneously In aggregate reached 850 MB/s – limit for MWT2 at that time.

efi.uchicago.edu ci.uchicago.edu 17 Cost matrix destination source

efi.uchicago.edu ci.uchicago.edu 18 localSetupFAX Added command fax-ls – Made by Shuwei YE. – Will finally replace isDSinFAX – He will move all the other tools to Rucio Change in fax-get-best-redirector – Each time does three queries o SSB to get endpoints and their status o AGIS to get sites, hosting the endpoints o AGIS to get site coordinates – Each call returns hundreds of kb’s – Can’t scale to large number of requests – Solution: o Made a GoogleAppEngine servlets that each 30 min take info from SSB and AGIS and deliver it from memory o Information slimmed to what is actually needed: ~several kb o Now requests served in few tens of ms. o “Infinitely” scalable

efi.uchicago.edu ci.uchicago.edu 19 Monitoring – collector, dashboard Problem: support of multi-VO sites Meeting: Alex, Matevz, me Issues: – Site name: o ATLAS reports it o CMS not or badly, will fix it – Requesting user’s VO o ATLAS does it o CMS not strict about it. US-CMS uses GUMS. Will fix it. Proposal: – During the summer Matevz develops XrdMon that can handle multi-VO messages – Sends messages from multi-VO sites to a special “mixed” AMQ. Dashboard splits traffic according to user’s VO. Details:

efi.uchicago.edu ci.uchicago.edu 20 Monitoring Failover – Not flexible enough Overflow – No monitoring yet – Need to compare jobs grouped by transfer type