PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,

Slides:



Advertisements
Similar presentations
Ch. 12 Routing in Switched Networks
Advertisements

Pankaj Kumar Qinglan Zhang Sagar Davasam Sowjanya Puligadda Wei Liu
PROMISE: Peer-to-Peer Media Streaming Using CollectCast Mohamed Hafeeda, Ahsan Habib et al. Presented By: Abhishek Gupta.
1 Internet Networking Spring 2004 Tutorial 13 LSNAT - Load Sharing NAT (RFC 2391)
Traffic Engineering Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.
CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.
1 Interconnecting LAN segments Repeaters Hubs Bridges Switches.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #12 LSNAT - Load Sharing NAT (RFC 2391)
Jennifer Rexford Princeton University MW 11:00am-12:20pm Wide-Area Traffic Management COS 597E: Software Defined Networking.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS Computing Integration.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.
ALICE data access WLCG data WG revival 4 October 2013.
Dividing the Pizza An Advanced Traffic Billing System An Advanced Traffic Billing System Christopher Lawrence Burke The University of Queensland.
Tiziana FerrariNetwork metrics usage for optimization of the Grid1 DataGrid Project Work Package 7 Written by Tiziana Ferrari Presented by Richard Hughes-Jones.
 Network Segments  NICs  Repeaters  Hubs  Bridges  Switches  Routers and Brouters  Gateways 2.
Network and Transfer WG Metrics Area Meeting Shawn McKee, Marian Babik Network and Transfer Metrics Kick-off Meeting 26 h November 2014.
Computer Networks Performance Metrics. Performance Metrics Outline Generic Performance Metrics Network performance Measures Components of Hop and End-to-End.
FAX UPDATE 1 ST JULY Discussion points: FAX failover summary and issues Mailing issues Panda re-brokering to sites using FAX cost and access Issue.
FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.
PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
27th, Nov 2001 GLOBECOM /16 Analysis of Dynamic Behaviors of Many TCP Connections Sharing Tail-Drop / RED Routers Go Hasegawa Osaka University, Japan.
PanDA: Exascale Federation of Resources for the ATLAS Experiment
Testing the dynamic per-query scheduling (with a FIFO queue) Jan Iwaszkiewicz.
PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
IT 210: Web-based IT Winter 2012 Measuring Speed on the Internet and WWW.
Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.
Michael Schapira Yale and UC Berkeley Joint work with P. Brighten Godfrey, Aviv Zohar and Scott Shenker.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
EGI-InSPIRE EGI-InSPIRE RI DDM Site Services winter release Fernando H. Barreiro Megino (IT-ES-VOS) ATLAS SW&C Week November
Intradomain Traffic Engineering By Behzad Akbari These slides are based in part upon slides of J. Rexford (Princeton university)
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Efi.uchicago.edu ci.uchicago.edu FAX status report Ilija Vukotic on behalf of the atlas-adc-federated-xrootd working group S&C week Jun 2, 2014.
1 Version 3.1 Module 6 Routed & Routing Protocols.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
EGI-InSPIRE EGI-InSPIRE RI DDM solutions for disk space resource optimization Fernando H. Barreiro Megino (CERN-IT Experiment Support)
Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.
Network integration with PanDA Artem Petrosyan PanDA UTA,
Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.
Access Link Capacity Monitoring with TFRC Probe Ling-Jyh Chen, Tony Sun, Dan Xu, M. Y. Sanadidi, Mario Gerla Computer Science Department, University of.
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.
PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
Collectd 101.
Corelite Architecture: Achieving Rated Weight Fairness
Delay-Tolerant Networks (DTNs)
Future of WAN Access in ATLAS
CS 268: Router Design Ion Stoica February 27, 2003.
PanDA in a Federated Environment
Chapter 6 Delivery & Forwarding of IP Packets
Subject Name: Digital Switching Systems Subject Code:10EC82 Prepared By: Aparna.P, Farha Kowser Department: Electronics and Communication Date:
ECE 544 Project III Description and Timeline March 23, 2018
COMP60621 Fundamentals of Parallel and Distributed Systems
lundi 25 février 2019 FTS configuration
CS 6290 Many-core & Interconnect
COMP60611 Fundamentals of Parallel and Distributed Systems
- Measurement of traffic volume -
Presentation transcript:

PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting, Clemson University

Configurator: overview 2 AGISRucioNWS Configurator PanDA DB Brokerage PanDA agent running every 30 minutes collecting information useful for brokerage, in particular regarding WORLD cloud migration Adding progressively new sources as we see the need PanDA

WORLD cloud N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 WORLD is the evolution of MCP. Tasks are not confined to their cloud anymore. Nucleus: PanDA task brokerage will assign tasks to Nuclei (=T1s and selected T2s). The output will be aggregated in the Nucleus. Satellites: Run jobs and ship the output to the Nuclei. Satellites will be selected for each task, with a maximum of 10 satellites per task. Job brokerage will select satellites based on usual criteria (e.g. #jobs in different states, data availability, …) Job brokerage will not confine the task to a cloud, but will increasingly be based on the network connectivity and transfer queues between the sites. N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1 N1N1

Configurator: topology data (reminder) 4 Base info Space usage Downtime info Base info Role Base info Locality

Configurator: static network data Configurator agent downloads and processes network information every 30 min from AGIS and NWS. Data is cached in a key-value table in PanDA DB – Table structure avoids adding/removing columns every time a new metric appears/disappears AGIS closeness matrix: static closeness between each source- destination pair: AGIS closeness matrix – Value between 1 (good) and 9 (bad) based on the hierarchic cloud model. Examples: T1  T1: 1 T2  T1 in same cloud: 2 … T2  T2 in different clouds: 7 T3  T3 in different clouds: 9 – Special values: -1 to blacklist a channel 0 to define a combined site (in progress) 5

Configurator: dynamic network data The Analytics platform contains a lot of raw network information (FTS, FAX, PerfSonar). We are working with the Rucio team to get aggregates per source-destination pair combined with Rucio queue data:aggregates per source-destination pair – #files transferred in last 1 and 6 hours (by activity) – #files queued (by activity) – Throughput according to FTS, based on 1 week data – Throughput according to FAX – PerfSonar metrics (latency, packet loss, throughput) 6 Available in 1 st NWS version Available since 2 nd NWS version (work in progress)

WORLD cloud: satellite selection in JEDI production job brokerage 1.Filter out candidates with blacklisted AGIS closeness (closeness = -1) 2.Calculate network weight for remaining candidates, combining static and dynamic info 3. Multiply traditional weight (based on data availability, #jobs in different stages, etc.) by network weight Currently we are running in passive mode and sending the network brokerage decisions to Analytics platform, so we can tune the network model and algorithm 7

Monitoring NW brokerage: ES messages 8 You can explore the data yourself: There is no network information for most links over the last 6h, so static closeness will prevail. We should also include metrics that cover longer time periods

Example: Favorite satellites for nucleus BNL 9 T1-T1 mingling Average network weight for Nucleus BNL (7-14 March 2016) Intra cloud mingling This model is a reflection of the static model, we need to boost the dynamic data Inter cloud

Observations (1) This work is fairly new – we started after the Sitges TIM in December – We have improved considerably the data transmission model and are starting to use network data for a very broad case in PanDA – We have powerful building blocks, but we need to get them operational and tune the algorithms – We need a good data analysis of all the available information and recommendations (WIP by Mario et al.) AGIS closeness is too much of a reflection of the MONARC model – IMHO the data should reflect the reality, not a theoretical, obsolete model Simple, semi-static classification? Aggregation of data from NWS is work in progress 10

Observations (2) Verify, activate, improve algorithm for nuclei- satellite matching – Start using the second version of aggregated data, containing more info than nqueued and ntransferred: FTS Mbps over last week to have dynamic data over longer period PerfSonar data – Boost the dynamic data – Analyze if the network weight should be stronger Extend to other Rucio and FAX use cases 11

12

Other possible network brokerage use cases Network weight for input file transfers (AKA Rodney Walker’s case): – Input data is in site A and B, but sites are busy – Sites C and D are free, but don’t have input data  Consider network for the brokerage FAX network weight for Event Service jobs BUT: Both cases require that PanDA and the respective DM system (Rucio, FAX) follow a similar source selection logic – Otherwise PanDA might be taking useless decisions Review Overflow jobs? 13