PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.

Slides:

Advertisements

Similar presentations

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Advertisements

Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.

Scheduling in Batch Systems

LHCONE Point-to-Point Service Workshop - CERN Geneva Eric Boyd, Internet2 Slides borrowed liberally from Artur, Inder, Richard, and other workshop presenters.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Integration Program Update Rob Gardner US ATLAS Tier 3 Workshop OSG All LIGO.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

PANDA: Networking Update Kaushik De Univ. of Texas at Arlington SC15 Demo November 18, 2015.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Network integration with PanDA Artem Petrosyan PanDA UTA,

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,

PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.

Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)

Daniele Bonacorsi Andrea Sciabà

Computing Operations Roadmap

CPU SCHEDULING.

Virtualization and Clouds ATLAS position

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

U.S. ATLAS Grid Production Experience

Outline Benchmarking in ATLAS Performance scaling

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

POW MND section.

Chapter 5a: CPU Scheduling

Data Challenge with the Grid in ATLAS

PanDA in a Federated Environment

Readiness of ATLAS Computing - A personal view

The ADC Operations Story

Artem Trunov and EKP team EPK – Uni Karlsruhe

Simulation use cases for T2 in ALICE

Cloud Computing R&D Proposal

Chapter 6: CPU Scheduling

Chapter 6: CPU Scheduling

Monitoring of the infrastructure from the VO perspective

湖南大学-信息科学与工程学院-计算机与科学系

CPU Scheduling Basic Concepts Scheduling Criteria

CPU Scheduling G.Anuradha

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 5: CPU Scheduling

3: CPU Scheduling Basic Concepts Scheduling Criteria

Chapter5: CPU Scheduling

Chapter 5: CPU Scheduling

Chapter 6: CPU Scheduling

Lecture 2 Part 3 CPU Scheduling

Roadmap for Data Management and Caching

Chapter 6: CPU Scheduling

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

CPU Scheduling: Basic Concepts

Module 5: CPU Scheduling

Presentation transcript:

PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013

Introduction  Background  PanDA is a distributed computing workload management system  PanDA relies on networking for workload data transfer/access  However, networking is assumed in PanDA – not integrated in workflow  Note: data transfer/access is done asynchronously: by DQ2 in ATLAS, PhEDEx in CMS, pandamover/FAX for special cases…  Data transfer/access systems can provide first level of network optimizations – PanDA will use these enhancements as available  Ambitious goal for PanDA  Direct integration of networking with PanDA workflow – never attempted before for large scale automated WMS systems July 31, 2013Kaushik De

Jobs/week Managed by PanDA July 31, 2013Kaushik De Current scale – one million jobs completed daily at ~hundred sites

Concept: Network as a Resource  PanDA as workload manager  PanDA automatically chooses job execution site  Multi-level decision tree – task brokerage, job brokerage, dispatcher  Also manages predictive future workflows – at task definition, PD2P  Site selection is based on processing and storage requirements  Can we use network information in this decision?  Can we go even further – network provisioning?  Further – network knowledge used for all phases of job cycle?  Network as resource  Optimal site selection should take network capability into account  We do this already – but indirectly using job completion metrics  Network as a resource should be managed (i.e. provisioning)  We also do this crudely – mostly through timeouts, self throttling July 31, 2013Kaushik De

Scope of Effort  Three parallel efforts to integrate networking in PanDA  US ATLAS funded – primarily to improve integration with FAX  ASCR funded – bigPanDA project, taking PanDA beyond LHC  Next Generation Workload Management and Analysis System for Big Data, DOE funded (BNL, U Texas Arlington)  ANSE funded – this project  We are coordinating the three efforts to maximize results July 31, 2013Kaushik De

First Steps  Started the work on integrating network information into PanDA few months ago, following community discussions  New hires funded by ASCR and ANSE (and PanDA core team)  Step 1: Introduce network information into PanDA  Start with slowly varying (static) information from external probes  Populate various databases used by PanDA  Step 2: Use network information for workload management  Start with simple use cases that lead to measurable improvements in work flow / user experience  From PanDA perspective we assume  Network measurements are available  Network information is reliable  We can examine this assumption later – need to start somewhere July 30, 2013Kaushik De

Sources of Network Information  DDM Sonar measurements  ATLAS measures transfer rates for files between Tier 1 and Tier 2 sites (information used for site white/blacklisting)  Measurements available for small, medium, and large files  perfSonar measurements  All WLCG sites are being instrumented with PS boxes  US sites are already instrumented and fully monitored  FAX measurements  Read-time for remote files are measured for pairs of sites  Standard PanDA test jobs (HammerCloud jobs) are used  This is not an exclusive list – just a starting point July 30, 2013Kaushik De

Data Repositories  Native data repositories  Historical data stored from collectors  SSB – site status board for sonar and PS data (currently)  HC FAX data is kept independently and uploaded  AGIS (ATLAS Grid Information System)  Most recent / processed data only – updated periodically  Mixture of push/pull – moving to JSON API (pushed only)  schedConfigDB  Internal Oracle DB used by PanDA for fast access  Currently test cron updates – switching to standard collector  All work is currently in ‘dev’ branch July 30, 2013Kaushik De

SSB Data Injection and Monitoring July 30, 2013Kaushik De Recently implemented by Artem Petrosyan

Various Measurements July 30, 2013Kaushik De SSB:

PanDA Use Cases 1)Use network information for cloud selection 2)Use network information for FAX brokerage 3)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 4)Use network information for PD2P 5)Use network information for site selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers July 31, 2013Kaushik De

ATLAS Computing Model July 31, 2013Kaushik De Tier1 site Tier2 site Cloud Tier2 site Tier2D site Task job  11 Clouds 10 T1s + 1 T0 (CERN) Cloud = T1 + T2s + T2Ds (except CERN) T2D = multi-cloud T2 sites  2-16 T2s in each Cloud job Task  Cloud Task brokerage Jobs  Sites Job brokerage job

Cloud Selection Status  Work started – implement early Fall  Need to finish adding information to PanDA  Optimize choice of T1-T2 pairings (cloud selection)  In ATLAS, production tasks are assigned to Tier 1’s  Tier 2’s are attached to a Tier 1 cloud for data processing  Any T2 may be attached to multiple T1’s  Currently, operations team makes this assignment manually  This could/should be automated using network information  For example, each T2 could be assigned to a native cloud by operations team (eg. AGLT2 to US), and PanDA will assign to other clouds based on network performance metrics July 30, 2013Kaushik De

PanDA Use Cases 1)Use network information for cloud selection 2)Use network information for FAX brokerage 3)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 4)Use network information for PD2P 5)Use network information for site selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers July 31, 2013Kaushik De

FAX for User analysis  Work advancing rapidly – implement this Fall  Goal - reduce waiting time for user jobs (FAX job brokerage)  User analysis jobs go to sites with local input data  This can lead to long wait times occasionally (PD2P will make more copies eventually to reduce congestion)  While nearby sites with good network access may have idle CPU’s  We could use network information to assign work to ‘nearby’ sites  Use cost metric generated with Hammercloud tests initially – treat as ‘typical cost’ of data transfer between two sites  Brokerage will use concept of ‘nearby’ sites  Calculate weight based on usual brokerage criteria (availability of CPU…) plus network transfer cost  Jobs will be sent to site with best weight – not necessarily the site with local data or with available CPU’s July 30, 2013Kaushik De

HC BASED WAN TESTS  HC submits 1 job/day to all of the “client” nodes. Client node is the one using the data. It is an ANALY queue.  All the “server” sites have one and the same dataset. Server sites are the ones delivering data.  Each job, each half an hour, in parallel:  Pings of all of the “server” sites.  Copies a file from a site (xrdcp/dccp)  Reads the file from a root script  Uploads all the results to Oracle DB at CERN  Result are shown at:  Results are also given in JSON format to SSB: ssb.cern.ch/dashboard/request.py/siteview#currentView=Network+Measur ements&highlight=falsehttp://dashb-atlas- ssb.cern.ch/dashboard/request.py/siteview#currentView=Network+Measur ements&highlight=false

Read Testing (Hammercloud) July 31, 2013Kaushik De Source: Rob Gardner

Medium Term Plan  After the Fall – improve based on tests  Improve source of information  Reliability of network information  Dynamic network information  Internal measurements July 30, 2013Kaushik De

Proposed ANSE PanDA Use Cases 1)Use network information for cloud selection 2)Use network information for FAX brokerage 3)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 4)Use network information for PD2P 5)Use network information for site selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers July 31, 2013Kaushik De

Job States  Panda jobs go through a succession of steps tracked in DB  Defined  Assigned  Activated  Running  Holding  Transferring  Finished/failed July 31, 2013Kaushik De

Assigned Jobs  Assigned -> Activated workflow  Group of jobs are assigned to a site by PanDA brokerage  For missing input files, data transfer is requested asynchronously  PanDA waits for “transfer completed” callback from DDM system to activate jobs for execution  Network data transfer plays crucial role in this workflow  Can network technology help assigned->activated transition?  Number of assigned jobs depend on number of running jobs – can we use network status to adjust rate up/down?  Jobs are reassigned if transfer times out (fixed duration) – can knowledge of network status be used to set variable timeout?  Can we use network provisioning in this step? Per block? July 31, 2013Kaushik De

Transferring Jobs  Transferring state  After job execution is completed, asynchronous data transfer is requested from DDM  Callback is required for successful job completion  How can network technology help?  New jobs are not sent if too many transferring jobs  Can we make this limit variable using network information?  Fixed timeout is set for transferring - delays completion of tasks  Can network status information help – variable timeout?  Can we use provisioning? Per job, per block, or tune based on average rate? July 31, 2013Kaushik De

PanDA Use Cases 1)Use network information for cloud selection 2)Use network information for FAX brokerage 3)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 4)Use network information for PD2P 5)Use network information for site selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers July 31, 2013Kaushik De

PD2P – How LHC Model Changed  PD2P = PanDA Dynamic Data Placement  PD2P is used to distribute data for user analysis  For production PanDA schedules all data flows  Initial ATLAS computing model assumed pre-placed data distribution for user analysis – PanDA sent jobs to data  Soon after LHC data started, we implemented PD2P  Asynchronous usage based data placement  Repeated use of data → make additional copies  Backlog in processing → make additional copies  Rebrokerage of queued jobs → use new data location  Deletion service removes less used data  Basically, T1/T2 storage used as cache for user analysis  This is perfect for network integration  Use network status information for site selection  Provisioning - usually large datasets are transferred, known volume July 31, 2013Kaushik De

Proposed ANSE PanDA Use Cases 1)Use network information for cloud selection 2)Use network information for FAX brokerage 3)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 4)Use network information for PD2P 5)Use network information for site selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers July 31, 2013Kaushik De

Task Brokerage  Matchmaking per cloud is based on:  Free disk space in T1 SE, MoU share of T1  Availability of input dataset (a set of files)  The amount of CPU resources = the number of running jobs in the cloud (static information system is not used)  Downtime at T1  Already queued tasks with equal or higher priorities  High priority task can jump over low priority tasks  Can knowledge of network help  Can we consider availability of network as a resource, like we consider storage and CPU resources?  What kind of information is useful? July 31, 2013Kaushik De

Job Brokerage  Brokerage policies define job assignment to sites  IO intensive or TAPE read -> prefer T1  CPU intensive -> T1+T2s  Flexible: clouds may allow IO heavy jobs at T2s with low weight  Matchmaking per site in a cloud  Software availability  Free disk space in SE, Scratch disk size on Worker Node (WN), Memory size on WN  Occupancy = the number of running jobs / the number of queued jobs, and downtime  Locality (cache hits) of input files  Can we add network information to the matchmaking? July 31, 2013Kaushik De

Job Dispatcher  High performance/high throughput module  Send matching job to CE upon pilot request  REST non-blocking communication  Different from brokerage, which is asynchronous  Matching of jobs based on  Data locality  Memory and disk space  Highest priority job is dispatched  At this point networking is not as important  Is this true – we still have to transfer output  Can we initiate provisioning? July 31, 2013Kaushik De

Summary  Many different parts of PanDA can benefit from better integration with networking  Development has started  Stay tuned for results July 31, 2013Kaushik De