Download presentation
Presentation is loading. Please wait.
Published byMilo Austin Modified over 9 years ago
1
MultiJob PanDA Pilot Oleynik Danila 28/05/2015
2
Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot in details MultiJob PanDA pilot2
3
Initial PanDA pilot concept & HPC Pilot definition: «The Panda pilot is an execution environment used to prepare the computing element, request the actual payload (a production or user analysis job), execute it, and clean up when the payload has finished» One of HPC limitation is restricted number of launched jobs (pilots) under one account. (usually less than 10), but one job may occupy a lot of resources (tens – hundreds of nodes) For the moment ATLAS have no payloads which may be executed on more than one node (MPI) MultiJob PanDA pilot3
4
Motivation No way to get MPI ATLAS production payloads quickly HPC resources should be used as much efficient as possible. There is no gain to launch just only few panda jobs simultaneously, if much more resources available – Potential outcome from machine like Titan compatible with, at least, Tier2 center Possible solution, which allow significant increase efficiency of usage of HPC is launching of set of PanDA jobs in assemble as one MPI job. MultiJob PanDA pilot4
5
PanDA Pilot workflow at nutshell There are next basic steps in pilot workflow: – Retrieve job information – Setup environment – StgaeIn input data – Execute payloads – StageOut output data and logs During execution, pilot monitor available disk resources, output files and updates PanDA server with status of PanDA job. MultiJob PanDA pilot5
6
MultiJob Pilot Current realization of MultiJob pilot implemented with same workflow and framework as regular PanDA pilot Most of core components and basic procedures of regular pilot were modified to serve multiple jobs with different states Procedures for intercommunication between runJob and Monitor process was slightly redesigned (without changing of technology) Current version was designed as “proof of concept” MultiJob PanDA pilot6
7
MultiJob Pilot. Requesting jobs. For the moment there is no method on PanDA server to retrieve set of jobs Set of jobs collects from server in cycle one by one. One request takes ~1 sec. so this will not scale good for big amount of jobs It’s important to collect jobs only from one task in bunch, to avoid mess with environment setup later Number of requested jobs fitted with available backfill resources MultiJob PanDA pilot7
8
MultiJob Pilot. Environment setup and verification Environment setup in most of cases is specific for experiment. Organized for each job in set Optimized through reduction in the number of repeating identical checks MultiJob PanDA pilot8
9
MultiJob Pilot. StageIn Optimized through reduction of number of remote stagein in case data already copied locally (for other job in set) – This simple optimization give significant reduction of whole stagein time. MultiJob PanDA pilot9
10
MultiJob Pilot. Payload execution Number of jobs adjusted one more time according to backfill Jobs, which not fitted, will failed with sub-status “rejected” PanDA jobs launched as separeted MPI ranks through special wrapper – Transformation name and input parameters translated through file – CPU consumption time and trf exit code published in rank report file MultiJob PanDA pilot10
11
MultiJob Pilot. StageOut Not require special optimization for the moment, due to not time critical operation for HPC – Optimization will be reviewed as scale will goes to hundreds of simultaneously launched PanDA jobs by one pilot MultiJob PanDA pilot11
12
First results MultiJob pilot was tested with jobs from ATLAS production validated task. 1000 jobs was executed (100000 events generated) Scale was increased from 3 to 20 simultaneously launched jobs Significant increasing of execution time of simultaneously launched jobs was not observed MultiJob PanDA pilot12
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.