PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

Slides:

Advertisements

Similar presentations

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.

Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Integrating Network Awareness in ATLAS Distributed Computing Using the ANSE Project J.Batista, K.De, A.Klimentov, S.McKee, A.Petroysan for the ATLAS Collaboration.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

LHCONE Point-to-Point Service Workshop - CERN Geneva Eric Boyd, Internet2 Slides borrowed liberally from Artur, Inder, Richard, and other workshop presenters.

The Panda System Mark Sosebee (for K. De) University of Texas at Arlington dosar workshop March 30, 2006.

Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Integration Program Update Rob Gardner US ATLAS Tier 3 Workshop OSG All LIGO.

1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.

OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

PanDA A New Paradigm for Computing in HEP Kaushik De Univ. of Texas at Arlington NRC KI, Moscow January 29, 2015.

FAX UPDATE 26 TH AUGUST Running issues FAX failover Moving to new AMQ server Informing on endpoint status Monitoring developments Monitoring validation.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.

Efi.uchicago.edu ci.uchicago.edu Towards FAX usability Rob Gardner, Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago US ATLAS.

PanDA: Exascale Federation of Resources for the ATLAS Experiment

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Efi.uchicago.edu ci.uchicago.edu FAX status developments performance future Rob Gardner Yang Wei Andrew Hanushevsky Ilija Vukotic.

Overview of ASCR “Big PanDA” Project Alexei Klimentov Brookhaven National Laboratory September 4, 2013, Arlington, TX PanDA UTA.

A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Data Management: US Focus Kaushik De, Armen Vartapetian Univ. of Texas at Arlington US ATLAS Facility, SLAC Apr 7, 2014.

PANDA: Networking Update Kaushik De Univ. of Texas at Arlington SC15 Demo November 18, 2015.

PanDA & BigPanDA Kaushik De Univ. of Texas at Arlington BigPanDA Workshop, CERN October 21, 2013.

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.

SDN Provisioning, next steps after ANSE Kaushik De Univ. of Texas at Arlington US ATLAS Planning, CERN June 29, 2015.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

Efi.uchicago.edu ci.uchicago.edu Storage federations, caches & WMS Rob Gardner Computation and Enrico Fermi Institutes University of Chicago BigPanDA Workshop.

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Network integration with PanDA Artem Petrosyan PanDA UTA,

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)

Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.

Joint Institute for Nuclear Research Synthesis of the simulation and monitoring processes for the data storage and big data processing development in physical.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

PanDA Configurator and Network Aware Brokerage Fernando Barreiro Megino, Kaushik De, Tadashi Maeno 14 March 2015, US ATLAS Distributed Facilities Meeting,

Basic Concepts Maximum CPU utilization obtained with multiprogramming

PD2P Planning Kaushik De Univ. of Texas at Arlington S&C Week, CERN Dec 2, 2010.

PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

Efi.uchicago.edu ci.uchicago.edu Sharing Network Resources Ilija Vukotic Computation and Enrico Fermi Institutes University of Chicago Federated Storage.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.

Computing Operations Roadmap

PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL

PanDA in a Federated Environment

Simulation use cases for T2 in ALICE

CPU Scheduling G.Anuradha

Module 5: CPU Scheduling

Roadmap for Data Management and Caching

Module 5: CPU Scheduling

Module 5: CPU Scheduling

Presentation transcript:

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013

Introduction  Background  PanDA is a distributed computing workload management system  PanDA relies heavily on networking for data transfers  ANSE provides opportunity for PanDA integration with networking  Note: data transfers are done asynchronously within PanDA – by DQ2 in ATLAS, PhEDEx in CMS, pandamover  PhEDEx scope is part of ANSE  DQ2 is not part of ANSE  Pandamover may be within ANSE scope? TBD  Ambitious goals for PanDA in ANSE  Direct integration of networking with workflow – never attempted before for large scale automated systems like PanDA  Would prefer to make it agnostic to the choice of data transfer system – thus providing higher level of workflow optimizations May 6, 2013Kaushik De

Concept: Network as a Resource  PanDA as workload manager  PanDA automatically chooses job execution site  Multi-level decision tree – task brokerage, job brokerage, dispatcher  Also manages predictive workflows – at task definition, PD2P  Current scale – one million jobs completed daily at ~hundred sites  Site selection is based on processing and storage requirements  Can we use network information in this decision?  Can we go even further – network provisioning?  Network as resource  Optimal site selection should take network capability into account  We do this already – but rather crudely using job completion metrics  Network as a resource should be managed (i.e. provisioning)  We also do this crudely – mostly through timeouts, self throttling May 6, 2013Kaushik De

Current Status  Three parallel efforts to integrate networking in PanDA  US ATLAS funded – primarily to improve integration with FAX  ASCR funded – bigPanDA project, taking PanDA beyond LHC  ANSE funded – this project  We are coordinating the three efforts to maximize results  US ATLAS FAX status  Preliminary work done by UC team (HC testing), UTA team (pilots)  Need help from ANSE/ASCR to move forward – US ATLAS team is busy with many other higher priority items to get task finished  ASCR status  Next Generation Workload Management and Analysis System for Big Data, PANDA integration with networking, DOE funded (BNL, U Texas Arlington)  Work has started with BNL team (Dantong Yu) May 6, 2013Kaushik De

ANSE PanDA Status  Personnel  Artem Petrosyan and Danila Oleynik hired at UTA  Arriving at UTA tomorrow May 7  Funding FTE ANSE, 1.5 FTE ASCR  Both will be at UTA for ~2 years  There is commitment from Dubna for 1-2 more years  Nominally, Artem will be paid by ANSE – but they work as team  Plan of work  For discussion in this meeting May 6, 2013Kaushik De

Proposed ANSE PanDA Use Cases 1)Use network information for FAX brokerage 2)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 3)Use network information for PD2P 4)Use network information for site selection 5)Use network information for cloud selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers May 6, 2013Kaushik De

Proposed ANSE PanDA Use Cases 1)Use network information for FAX brokerage 2)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 3)Use network information for PD2P 4)Use network information for site selection 5)Use network information for cloud selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers May 6, 2013Kaushik De

FAX Integration with PanDA  We have developed detailed plans for integrating FAX with PanDA over the past year  Networking plays an important role in Federated storage  This time we are paying attention to networking up front  The most interesting use case – network information used for brokering of distributed analysis jobs to FAX enabled sites  This is first real use case for using external network information in PanDA May 6, 2013Kaushik De

FAX for Distributed Analysis  Plan of action  Currently, DA jobs are brokered to sites which have input datasets  This can limit and slow the execution of DA jobs  Use FAX to relax constraint on locality of data  Use cost metric generated with Hammercloud tests initially – treat as ‘typical cost’ of data transfer between two sites  Brokerage will use concept of ‘nearby’ sites  Calculate weight based on usual brokerage criteria (availability of CPU…) plus network transfer cost  Jobs will be sent to site with best weight – not necessarily the site with local data or available CPU’s  Cost metric already available – soon to be tested in brokerage Kaushik DeMay 6, 2013

HC BASED WAN TESTS  HC submits 1 job/day to all of the “client” nodes. Client node is the one using the data. It is an ANALY queue.  All the “server” sites have one and the same dataset. Server sites are the ones delivering data.  Each job, each half an hour, in parallel:  Pings of all of the “server” sites.  Copies a file from a site (xrdcp/dccp)  Reads the file from a root script  Uploads all the results to Oracle DB at CERN  Result are shown at:  Results are also given in JSON format to SSB: ssb.cern.ch/dashboard/request.py/siteview#currentView=Network+Measur ements&highlight=falsehttp://dashb-atlas- ssb.cern.ch/dashboard/request.py/siteview#currentView=Network+Measur ements&highlight=false 20/03/2016 ILIJA VUKOTIC 10

HC BASED WAN TESTS 20/03/2016 ILIJA VUKOTIC 11 Color coded conn. quality Weight is the time to transfer file

Read Testing (Hammercloud) May 6, 2013Kaushik De Source: Rob Gardner

Details of FAX DA Plan  Performance data kept in AGIS (A Grid Information System)  Schema has been defined for Hammecloud (HC)  Auto-update, validate and monitor this information  Alternative sources could be perfSONAR etc – schema is flexible  Information in schedConfigDB  Data is massaged and stored in special table in pandaDB (Oracle)  Schema is fixed - irrespective of source(s)  Update must be reliable – information used for real-time decisions  PanDA brokerage uses the information to calculate weight – single weight incorporating all knowledge of CPU, storage, and networking  First use case of network as resource May 6, 2013Kaushik De

Proposed ANSE PanDA Use Cases 1)Use network information for FAX brokerage 2)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 3)Use network information for PD2P 4)Use network information for site selection 5)Use network information for cloud selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers May 6, 2013Kaushik De

Job States  Panda jobs go through a succession of steps tracked in DB  Defined  Assigned  Activated  Running  Holding  Transferring  Finished/failed May 6, 2013Kaushik De

Assigned Jobs  Assigned -> Activated workflow  Group of jobs are assigned to a site by PanDA brokerage  For missing input files, data transfer is requested asynchronously  PanDA waits for “transfer completed” callback from DDM system to activate jobs for execution  Network data transfer plays crucial role in this workflow  Can network technology help assigned->activated transition?  Number of assigned jobs depend on number of running jobs – can we use network status to adjust rate up/down?  Jobs are reassigned if transfer times out (fixed duration) – can knowledge of network status be used to set variable timeout?  Can we use network provisioning in this step? Per block? May 6, 2013Kaushik De

Transferring Jobs  Transferring state  After job execution is completed, asynchronous data transfer is requested from DDM  Callback is required for successful job completion  How can network technology help?  New jobs are not sent if too many transferring jobs  Can we make this limit variable using network information?  Fixed timeout is set for transferring - delays completion of tasks  Can network status information help – variable timeout?  Can we use provisioning? Per job, per block, or tune based on average rate? May 6, 2013Kaushik De

Proposed ANSE PanDA Use Cases 1)Use network information for FAX brokerage 2)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 3)Use network information for PD2P 4)Use network information for site selection 5)Use network information for cloud selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers May 6, 2013Kaushik De

PD2P – How LHC Model Changed  PD2P = PanDA Dynamic Data Placement  PD2P is used to distribute data for user analysis  For production PanDA schedules all data flows  Initial ATLAS computing model assumed pre-placed data distribution for user analysis – PanDA sent jobs to data  Soon after LHC data started, we implemented PD2P  Asynchronous usage based data placement  Repeated use of data → make additional copies  Backlog in processing → make additional copies  Rebrokerage of queued jobs → use new data location  Deletion service removes less used data  Basically, T1/T2 storage used as cache for user analysis  This is perfect for network integration  Use network status information for site selection  Provisioning - usually large datasets are transferred, known volume May 6, 2013Kaushik De

Proposed ANSE PanDA Use Cases 1)Use network information for FAX brokerage 2)Use network information for job assignment  Improve flow of ‘activated’ jobs  Better accounting of ‘transferring’ jobs 3)Use network information for PD2P 4)Use network information for site selection 5)Use network information for cloud selection 6)Provision circuits for PD2P transfers 7)Provision circuits for input transfers 8)Provision circuits for output transfers May 6, 2013Kaushik De

ATLAS Computing Model May 6, 2013Kaushik De Tier1 site Tier2 site Cloud Tier2 site Tier2D site Task job  11 Clouds 10 T1s + 1 T0 (CERN) Cloud = T1 + T2s + T2Ds (except CERN) T2D = multi-cloud T2 sites  2-16 T2s in each Cloud job Task  Cloud Task brokerage Jobs  Sites Job brokerage job

Task Brokerage  Matchmaking per cloud is based on:  Free disk space in T1 SE, MoU share of T1  Availability of input dataset (a set of files)  The amount of CPU resources = the number of running jobs in the cloud (static information system is not used)  Downtime at T1  Already queued tasks with equal or higher priorities  High priority task can jump over low priority tasks  Can knowledge of network help  Can we consider availability of network as a resource, like we consider storage and CPU resources?  What kind of information is useful? May 6, 2013Kaushik De

Job Brokerage  Brokerage policies define job assignment to sites  IO intensive or TAPE read -> prefer T1  CPU intensive -> T1+T2s  Flexible: clouds may allow IO heavy jobs at T2s with low weight  Matchmaking per site in a cloud  Software availability  Free disk space in SE, Scratch disk size on Worker Node (WN), Memory size on WN  Occupancy = the number of running jobs / the number of queued jobs, and downtime  Locality (cache hits) of input files  Can we add network information to the matchmaking? May 6, 2013Kaushik De

Job Dispatcher  High performance/high throughput module  Send matching job to CE upon pilot request  REST non-blocking communication  Different from brokerage, which is asynchronous  Matching of jobs based on  Data locality  Memory and disk space  Highest priority job is dispatched  At this point networking is not as important  Is this true – we still have to transfer output  Can we initiate provisioning? May 6, 2013Kaushik De

Summary  Many different parts of PanDA can benefit from better integration with networking  Need to work out details of information to be collected  Need to work out details of circuit provisioning  Developers starting soon (actually tomorrow)!  It would be useful if we have well defined work plan for them, based on discussions here May 6, 2013Kaushik De