HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Slides:



Advertisements
Similar presentations
Cluster Computing at IQSS Alex Storer, Research Technology Consultant.
Advertisements

DAGMan Hands-On Kent Wenger University of Wisconsin Madison, Madison, WI.
Slide 1 of 10 Job Event Basics A Job Event is the name for the collection of components that comprise a scheduled job. On the iSeries a the available Job.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Manage Run Activities Cognos 8 BI. Objectives  At the end of this course, you should be able to:  manage current, upcoming and past activities  manage.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
1 Using Condor An Introduction ICE 2008.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
UNIX Processes. The UNIX Process A process is an instance of a program in execution. Created by another parent process as its child. One process can be.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
Grid Computing I CONDOR.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Grid job submission using HTCondor Andrew Lahiff.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Faucets Queuing System Presented by, Sameer Kumar.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Agenda The Bourne Shell – Part I Redirection ( >, >>,
Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
HTCondor and Workflows: Tutorial HTCondor Week 2016 Kent Wenger.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Intermediate HTCondor: Workflows Monday pm
An Introduction to Workflows with DAGMan
IW2D migration to HTCondor
HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.
High Availability in HTCondor
Grid Compute Resources and Job Management
Using Stork An Introduction Condor Week 2006
HTCondor and Workflows: An Introduction HTCondor Week 2013
Troubleshooting Your Jobs
What’s New in DAGMan HTCondor Week 2013
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Overview of Workflows: Why Use Them?
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger

2 Outline › Introduction/motivation › Basic DAG concepts › Pre/Post scripts › Rescue DAGs › Running and monitoring a DAG

Why workflows?

4 My jobs have dependencies… Can HTCondor help solve my dependency problems? Yes! Workflows are the answer

5 What are workflows? › General: a sequence of connected steps › Our case Steps are HTCondor jobs Sequence defined at higher level Controlled by a Workflow Management System (WMS), not just a script

Example workflow...10k... Preparation Simulation Analysis

7 Workflows – launch and forget › Automates tasks user could perform manually (for example, the previous slide)… – But WMS takes care of automatically › A workflow can take days, weeks or even months › The result: one user action can utilize many resources while maintaining complex job inter- dependencies and data flows

Workflow management systems DAGMan (Directed Acyclic Graph Manager) HTCondor's WMS Introduction/basic features in this talk Advanced/new features in later talk Pegasus A higher level on top of DAGMan Data- and grid-aware A talk tomorrow with more details

9 Outline › Introduction/motivation › Basic DAG concepts › Pre/Post scripts › Rescue DAGs › Running and monitoring a DAG

10 DAG (directed acyclic graph) definitions › DAGs have one or more nodes (or vertices) › Dependencies are represented by arcs (or edges). These are arrows that go from parent to child) › No cycles! A BC D No!

11 HTCondor and DAGs › Each node represents an HTCondor job (or cluster) › Dependencies define possible orders of job execution A BC D

12 Charlie learns DAGMan › Directed Acyclic Graph Manager › DAGMan allows Charlie to specify the dependencies between his HTCondor jobs, so DAGMan manages the jobs automatically › Dependency example: do not get married until rehab has completed successfully

13 Defining a DAG to DAGMan A DAG input file defines a DAG: # file name: diamond.dag Job A a.submit Job B b.submit Job C c.submit Job D d.submit Parent A Child B C Parent B C Child D A BC D

Basic DAG commands Job command defines a name, associates that name with an HTCondor submit file The name is used in many other DAG commands “Job” should really be “node” Parent...child command creates a dependency between nodes Child cannot run until parent completes successfully

15 Submit description files For node B: # file name: # b.submit universe = vanilla executable = B input = B.in output = B.out error = B.err log = B.log queue For node C: # file name: # c.submit universe = standard executable = C input = C.in1 output = C.out error = C.err log = C.log queue Input= C.in2 queue

16 Jobs/clusters › Submit description files used in a DAG can create multiple jobs, but they must all be in a single cluster. – A submit file that creates >1 cluster causes node failure › The failure of any job means the entire cluster fails. Other jobs in the cluster are removed.

17 Node success or failure › A node either succeeds or fails › Based on the return value of the job(s) 0: success not 0: failure › This example: C fails › Failed nodes block execution; DAG fails A BC D

18 Outline › Introduction/motivation › Basic DAG concepts › Pre/Post scripts › Rescue DAGs › Running and monitoring a DAG

PRE and POST scripts DAGMan allows optional PRE and/or POST scripts for any node Not necessarily a script: any executable Run before (PRE) or after (POST) job PRE script HTCondor job POST script A Scripts run on submit machine (not execute machine) In the DAG input file: Job A a.submit Script PRE A before-script arguments Script POST A after-script arguments

20 DAG node with scripts › DAGMan treats the node as a unit (e.g., dependencies are between nodes) › PRE script, Job, or POST script determines node success or failure (table in manual gives details) › If PRE script fails, job is not run. The POST script is run. PRE script HTCondor job POST script Node

Why PRE/POST scripts? › Set up input › Check output › Dynamically create submit file or sub-DAG (more later today) › Probably lots of other reasons… › Should be lightweight (run on submit machine) 21

22 Script argument variables › $JOB : node name › $JOBID : Condor ID (cluster.proc) (POST only) › $RETRY : current retry › $MAX_RETRIES : max # of retries › $RETURN : exit code of HTCondor/Stork job (POST only) › $PRE_SCRIPT_RETURN : PRE script return value (POST only) › $DAG_STATUS : A number indicating the state of DAGMan. See the manual for details. › $FAILED_COUNT : the number of nodes that have failed in the DAG

23 Outline › Introduction/motivation › Basic DAG concepts › Pre/Post scripts › Rescue DAGs › Running and monitoring a DAG

Rescue DAGs We want to re-try without duplicating work Rescue DAGs do this – details in later talk Generated automatically when DAG fails Run automatically A B1 D B2 C1 C2 C3 B3 What if things don't complete perfectly?

25 Outline › Introduction/motivation › Basic DAG concepts › Pre/Post scripts › Rescue DAGs › Running and monitoring a DAG

26 Submitting a DAG to HTCondor › To submit an entire DAG, run condor_submit_dag DagFile › condor_submit_dag creates a submit description file for DAGMan, and DAGMan itself is submitted as an HTCondor job (in the scheduler universe) › - f(orce) option forces overwriting of existing files (to re-run a previously-run DAG) › Don't try to run duplicate DAG instances!

Controlling running DAGs: remove condor_rm dagman_id Removes entire workflow Removes all queued node jobs Kills PRE/POST scripts Creates rescue DAG (more on this on later today) Work done by partially-completed node jobs is lost Relatively small jobs are good

Controlling running DAGs: hold/release condor_hold dagman_id “Pauses” the DAG Queued node jobs continue No new node jobs submitted No PRE or POST scripts are run DAGMan stays in queue if not released condor_release dagman_id DAGMan “catches up”, starts submitting jobs

Controlling running DAGs: the halt file “Pauses” the DAG (different semantics than hold) Queued node jobs continue POST scripts are run as jobs finish No new jobs will be submitted and no PRE scripts will be run When all submitted jobs complete, DAGMan creates a rescue DAG and exits (if not un-halted) 29

The halt file (cont) › Create a file named DagFile.halt in the same directory as your DAG file. › Remove halt file to resume normal operation › Should be noticed w/in 5 sec ( DAGMAN_USER_LOG_SCAN_INTERVAL ) › Good if load on submit machine is very high › Avoids hold/release problem of possible duplicate PRE/POST script instances 30

31 Monitoring running DAGs: condor_q -dag › Shows current workflow state › The -dag option associates DAG node jobs with the parent DAGMan job > condor_q -dag -- Submitter: : : llunet.cs.wisc.edu ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD nwp 4/25 13: :00:50 R condor_dagman -f – |-1 4/25 13: :00:23 R sh |-0 4/25 13: :00:30 R condor_dagman -f – |-A 4/25 13: :00:03 R sh jobs; 0 completed, 0 removed, 0 idle, 4 running, 0 held, 0 suspended

32 Monitoring a DAG: dagman.out file › Logs detailed workflow history › Mostly for debugging – first place to look if something goes wrong! › DagFile.dagman.out › Verbosity controlled by the DAGMAN_VERBOSITY configuration macro and –debug n on the condor_submit_dag command line 0: least verbose 7: most verbose › Don’t decrease verbosity unless really needed

33 Dagman.out contents... 04/17/11 13:11:26 Submitting Condor Node A job(s)... 04/17/11 13:11:26 submitting: condor_submit -a dag_node_name' '=' 'A -a +DAGManJobId' '=' ' a DAGManJobId' '=' ' a submit_event_notes' '=' 'DAG' 'Node:' 'A -a +DAGParentNodeNames' '=' '"" dag_files/A2.submit 04/17/11 13:11:27 From submit: Submitting job(s). 04/17/11 13:11:27 From submit: 1 job(s) submitted to cluster /17/11 13:11:27 assigned Condor ID ( ) 04/17/11 13:11:27 Just submitted 1 job this cycle... 04/17/11 13:11:27 Currently monitoring 1 Condor log file(s) 04/17/11 13:11:27 Event: ULOG_SUBMIT for Condor Node A ( ) 04/17/11 13:11:27 Number of idle job procs: 1 04/17/11 13:11:27 Of 4 nodes total: 04/17/11 13:11:27 Done Pre Queued Post Ready Un-Ready Failed 04/17/11 13:11:27 === === === === === === === 04/17/11 13:11: /17/11 13:11:27 0 job proc(s) currently held... This is a small excerpt of the dagman.out file.

34 More information › More in later talk! › There’s much more detail, as well as examples, in the DAGMan section of the online HTCondor manual. › DAGMan: dagman/dagman.html › For more questions: htcondor-