Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.

Advertisements

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.

1 Using Condor An Introduction ICE 2008.

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

1 Using Condor An Introduction ICE 2010.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Grid Compute Resources and Job Management New Mexico Grid School – April 9, 2009 Marco Mambelli – University of Chicago

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Alain Roy Computer Sciences Department University of Wisconsin-Madison I/O Access in Condor and Grid.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.

George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.

Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Christina Koch Research Computing Facilitators

Intermediate HTCondor: More Workflows Monday pm

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Intermediate HTCondor: Workflows Monday pm

An Introduction to Workflows with DAGMan

Grid Compute Resources and Job Management

Using Stork An Introduction Condor Week 2006

What’s New in DAGMan HTCondor Week 2013

STORK: A Scheduler for Data Placement Activities in Grid

Genre1: Condor Grid: CSECCR

Condor-G Making Condor Grid Enabled

Frieda meets Pegasus-WMS

Presentation transcript:

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job Dependencies with Condor

Condor DAGMan › What is DAGMan? › What is it good for? › How does it work? › What’s next?

Condor DAGMan DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Condor DAGMan Typical Scenarios › Jobs whose output needs to be summarized or post-processed once they complete. › Jobs that need data to be generated or pre-processed before they can use it. › Jobs which require data to be staged to/from remote repositories before they start or after they finish.

Condor DAGMan What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parents” or “children” (or neither) – as long as there are no loops! Job A Job BJob C Job D

Condor DAGMan An Example DAG › Jobs whose output needs to be summarized or post-processed once they complete: Job A Job BJob C Job D

Condor DAGMan Another Example DAG › Jobs that need data to be generated or pre-processed before they can use it: Job A Job BJob C Job D

Condor DAGMan Defining a DAG › A DAG is defined by a.dag file., listing all its nodes and any dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job BJob C Job D

Condor DAGMan Defining a DAG (cont’d) › Each node in the DAG will run a Condor job, specified by a Condor submit file: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job BJob C Job D

Condor DAGMan Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon & begin running your jobs:  % condor_submit_dag diamond.dag › The DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

Condor DAGMan DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File

Condor DAGMan DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A

Condor DAGMan DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File

Condor DAGMan DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the Rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C

Condor DAGMan DAGMan Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D

Condor DAGMan DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

Condor DAGMan Additional Features › Provides some other handy features for job management…  nodes can have PRE & POST scripts  job submission can be “throttled”

Condor DAGMan PRE & POST Scripts › Each node can have a PRE or POST script, executed as part of the node: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub PARENT A CHILD B C PARENT B C CHILD D Script PRE B stage-in.sh Script POST B stage-out.sh Job A PRE Job B POST Job C Job D

Condor DAGMan PRE & POST Scripts (cont’d) › Useful for staging a job’s data from remote repositories, and/or putting it back afterwards. › Ex:  PRE: Globus FTP the data from afar  Run the job  POST: Globus FTP the data back

Condor DAGMan Submit Throttling › DAGMan can limit the maximum number of jobs it will submit to Condor at once:  condor_submit_dag -maxjobs N › Useful for managing resource limitations (e.g., storage).  Ex: 1000 jobs, each of which require 1 GB of disk space, and you have 100 GB of disk.

Condor DAGMan Summary › DAGMAN:  manages dependencies, holding & running jobs only at the appropriate times  monitors job progress  is fault-tolerant  is recoverable in case of job failure  provides some additional features to Condor  currently DAGMan itself can only run on Unix, but its jobs can run anywhere

Condor DAGMan Future Work › More sophisticated management of remote data transfer & staging to maximize CPU throughput.  Keep the pipeline full! I.e., intelligently manage disk & network to always have remote data ready where a CPU becomes available.  Possible integration with Kangaroo, etc. › Better integration with Condor tools  condor_q, etc. displaying DAG information

Condor DAGMan Conclusion › Interested in seeing more?  Come to the DAGMan demo Wednesday 9am - noon Room 3393, Computer Sciences (1210 W. Dayton St.)  me:  Try it: