Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.

Slides:



Advertisements
Similar presentations
Distributed Xrootd Derek Weitzel & Brian Bockelman.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
Efi.uchicago.edu ci.uchicago.edu FAX update Rob Gardner Computation and Enrico Fermi Institutes University of Chicago Sep 9, 2013.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS Computing Integration.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
Input from CMS Nicolò Magini Andrea Sciabà IT/SDC 5 July 2013.
ALICE data access WLCG data WG revival 4 October 2013.
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
FAX UPDATE 1 ST JULY Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue.
FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.
Efi.uchicago.edu ci.uchicago.edu ATLAS Experiment Status Run2 Plans Federation Requirements Ilija Vukotic XRootD UCSD San Diego 27 January,
Efi.uchicago.edu ci.uchicago.edu Towards FAX usability Rob Gardner, Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
Efi.uchicago.edu ci.uchicago.edu FAX meeting intro and news Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Federated Xrootd.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.
Efi.uchicago.edu ci.uchicago.edu Using FAX to test intra-US links Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computing Integration.
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
FAX PERFORMANCE TIM, Tokyo May PERFORMANCE TIM, TOKYO, MAY 2013ILIJA VUKOTIC 2  Metrics  Data Coverage  Number of users.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
ALICE DATA ACCESS MODEL Outline 05/13/2014 ALICE Data Access Model 2  ALICE data access model  Infrastructure and SE monitoring.
FAX UPDATE 12 TH AUGUST Discussion points: Developments FAX failover monitoring and issues SSB Mailing issues Panda re-brokering to FAX Monitoring.
Efi.uchicago.edu ci.uchicago.edu Data Federation Strategies for ATLAS using XRootD Ilija Vukotic On behalf of the ATLAS Collaboration Computation and Enrico.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
Efi.uchicago.edu ci.uchicago.edu Ramping up FAX and WAN direct access Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
Maria Girone, CERN CMS Experiment Status, Run II Plans, & Federated Requirements Maria Girone, CERN XrootD Workshop, January 27, 2015.
Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Activities and Perspectives at Armenian Grid site The 6th International Conference "Distributed Computing and Grid- technologies in Science and Education"
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Efi.uchicago.edu ci.uchicago.edu Federating ATLAS storage using XrootD (FAX) Rob Gardner on behalf of the atlas-adc-federated-xrootd working group Computation.
LHCONE Workshop Richard P Mount February 10, 2014 Concerns from Experiments ATLAS Richard P Mount SLAC National Accelerator Laboratory.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group Computation and Enrico Fermi.
Efi.uchicago.edu ci.uchicago.edu Caching FAX accesses Ilija Vukotic ADC TIM - Chicago October 28, 2014.
Storage discovery in AliEn
Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /
Federating Data in the ALICE Experiment
Daniele Bonacorsi Andrea Sciabà
ALICE internal and external network
ALICE Monitoring
Network Requirements Javier Orellana
Storage elements discovery
Presentation transcript:

efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage Workshop SLAC, April 10, 2014

efi.uchicago.edu ci.uchicago.edu 2 Content Monitoring systems in use by ATLAS Issues we will have to address Proposals – Not presenting solutions, just starting a conversion

efi.uchicago.edu ci.uchicago.edu 3 Glossary Monitoring – Systems testing, visualizing functionality of the federation elements (endpoints, redirectors) – Notification of problems Accounting – Systems storing detailed information on previous transfers – Data mining and visualization tools Network monitoring – Systems measuring, storing, predicting, delivering information on current/historical parameters describing connections between sites. – Cost matrix – for each source-destination pair gives a value best describing expected performance of a transfer.

efi.uchicago.edu ci.uchicago.edu 4 Glossary Production jobs (RECO, PILE) - high CPU, low IO Derivation framework - high CPU, low IO User analysis jobs (more reductions, cut flow, …) - low CPU, high IO Failover - when jobs fail to access input data locally and failover to reading from a remote location Overflow - when jobs are explicitly instructed to use input data from a remote location AGIS – ATLAS Grid Information System – Describes FAX topology – Controls what sites export through FAX – Toggle sites ON/OFF for failover – Sets limits on rates delivered-to / used-from FAX

efi.uchicago.edu ci.uchicago.edu 5 ATLAS Monitoring: Infrastructure Simple cron checking basic functionality of endpoints and redirectors (direct, upstream, downstream, privacy, monitoring) SSB - visualization, history, and reporting problems – a primary source of information on state of FAX infrastructure.

efi.uchicago.edu ci.uchicago.edu 6 ATLAS Monitoring: detailed & summary IO + various Many other automatic tools – monitor & collect endpoint information – Run diagnostics, troubleshooting

efi.uchicago.edu ci.uchicago.edu 7 ATLAS: Accounting Dashboard monitor – Described in detail by A. Beche – Our authoritative source Throughput Rate per file active transfersFinished transfers

efi.uchicago.edu ci.uchicago.edu 8 ATLAS: Accounting Use in Production & Analysis PanDA-specific FAX accounting – FAX failover jobs – FAX “overflow” (re-brokered) accounting (in progress)

efi.uchicago.edu ci.uchicago.edu 9 Network Monitoring Sources perfSONAR – Bandwidth – each 6 hours – pinger – 20 minutes – Latency and packet loss - continuous FTS – Collects rates of transfers for all the files. Source- destination combinations not used during one week are probed with test file transfers. – Averaged over file size and over 15 days o 1 GB FAX cost matrix – Estimate of expected rate (MB/s) between a grid worker node and FAX endpoints using single stream xrdcp – Memory-to-memory transfer – Sampling interval: 30 min

efi.uchicago.edu ci.uchicago.edu 10 FAX cost matrix Data collected between 42 ANALY queues (compute sites) and every FAX endpoint Jobs submitted by HammerCloud Results to ActiveMQ, consumed by SSB with network & throughput measurements (perfSONAR and FTS) HammerCloud FAX redirection REST SSB FAX cost matrix SSB FAX cost matrix FTS

efi.uchicago.edu ci.uchicago.edu 11 FAX cost matrix Results averaged over the last week N.B. – rate observed to the worker node (not site BW)

efi.uchicago.edu ci.uchicago.edu 12 Future More data will come our way, on top of current No big increase in storage space Faster networks will come All the experiments will lean on them: – to reduce number of data replicas – more efficiently use available CPUs – decrease failure rate (failover) More and more end users will use federations as locally (T3s) available storage become insufficient

efi.uchicago.edu ci.uchicago.edu 13 Issues Up to now experiments concentrated on getting infrastructure stable, and being able to scale to the network limits There remain challenges – We need to wisely use network resources o Make applications “network aware” – We need to assure we don’t saturate any site’s links – Come up with way to fair share among experiments? Most of the links are not dedicated and we’ll never be able to account for all the traffic on them, but the more we know better we will be able to use it CMS and ALICE evidently did not need to worry much about it and still made almost no accident. ATLAS has more data and higher IO requirements (for analysis jobs), so has to be more careful.

efi.uchicago.edu ci.uchicago.edu 14 Smart network usage Make applications “network aware” – Grid submitted jobs o Submission system (JEDI) has to know expected network performance for all the combinations of sources and destinations. o It has to internally enforce limits set by the sites. – Tier-3s queues and individual end-users o Can not expect an end-user to know or care about network o When working outside of a framework, make sure that they access the federation from place closest to them o Instrument frameworks to choose the source with highest available bandwidth to the client

efi.uchicago.edu ci.uchicago.edu 15 Fair sharing & avoiding link saturation ideas Limit bandwidth usage by endpoints – Hard way, but simple: limit access with proxy servers o Requires dedicated node(s) for each experiment, expensive for smaller endpoints o Very suboptimal network usage – if one of the VOs is not using the allocated bandwidth other VOs cannot use it. o Not fine grained – Easier way, but implies development: storage service accounts and throttles bandwidth usage by VO o Often traffic flows directly from data servers – Potentially expensive to do – Need VO names at the data servers

efi.uchicago.edu ci.uchicago.edu 16 US usage by VO and service XrootD FTS atlas alice cms

efi.uchicago.edu ci.uchicago.edu 17 Other possibilities Control bandwidth usage centrally – Needs unified site identification – Have sites set what is their allocation per experiment for both input and output – Have accounting from DDM dashboard – Inform VOs when using more than what site allocated – Inform both VOs crossing the limit and also the Site when bandwidth reaches 90% of total allocation – VOs individually decide how they use that information o Stop submissions to site o Move/copy data elsewhere o Warn users, kill jobs

efi.uchicago.edu ci.uchicago.edu 18 Not so urgent… Unify cost measurements, their storage, distribution Measuring between cloud providers and CE We need a way of communicating experiment’s needs Cost of transaction (stress on SE)