Presentation is loading. Please wait.

Presentation is loading. Please wait.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Similar presentations


Presentation on theme: "Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &"— Presentation transcript:

1 Peter Couvares Computer Sciences Department University of Wisconsin-Madison pfc@cs.wisc.edu http://www.cs.wisc.edu/condor Condor DAGMan: Introduction & Update

2 http://www.cs.wisc.edu/condor 2 DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

3 http://www.cs.wisc.edu/condor 3 Why is This Important? › Most real science involves complex sequences of tasks – on many resources at many sites.  E.g., move data, compute, check, move back, etc. › … and many types of jobs working together  Condor, Grid (Condor-G), MPI, shell scripts, etc. › Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial.

4 http://www.cs.wisc.edu/condor 4 What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

5 http://www.cs.wisc.edu/condor 5 Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor or Grid job specified by its accompanying Condor submit file Job A Job BJob C Job D

6 http://www.cs.wisc.edu/condor 6 Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc.

7 http://www.cs.wisc.edu/condor 7 DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File

8 http://www.cs.wisc.edu/condor 8 DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A

9 http://www.cs.wisc.edu/condor 9 DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File

10 http://www.cs.wisc.edu/condor 10 DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C

11 http://www.cs.wisc.edu/condor 11 DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

12 http://www.cs.wisc.edu/condor 12 Additional DAGMan Features › Provides other knobs handy for job management…  nodes can have PRE & POST scripts  job submission can be “throttled”  NEW: failed nodes can be automatically re-tried a configurable number of times

13 http://www.cs.wisc.edu/condor 13 PRE & POST Scripts › Executes locally on the submit host before or after job submission… › Example: # diamond.dag PRE A prepare-A.sh Job A a.sub Job B b.sub Job C c.sub Job D d.sub POST D double-check.sh Parent A Child B C Parent B C Child D › PRE/POST scripts are part of node PRE Job A Job BJob C Job D POST

14 http://www.cs.wisc.edu/condor 14 DAG “Throttling” › You can tell DAGMan to limit the maximum number of jobs it submits at any one time  condor_submit_dag -maxjobs N  useful for managing resource limitations (e.g., licenses) › You can also can limit the number of simultaneous PRE or POST scripts.  Added after Vladimir Litvin’s 7000-node DAG started 7000 PRE scripts on his machine!

15 http://www.cs.wisc.edu/condor 15 Node RETRY › Tells DAGMan to re-run a node multiple times if necessary… › Example: # diamond.dag Job A a.sub Job B b.sub RETRY B 5 Job C c.sub RETRY C 5 Job D d.sub Parent A Child B C Parent B C Child D Job A Job BJob C Job D

16 http://www.cs.wisc.edu/condor 16 DAGMan Progress › Testing… lots of testing.  10,000+ node DAGs run smoothly  Developed automated DAG testing tools to generate random DAGs and test for correct execution (Ning Lin & Will McDonald)  Lots of bugs fixed

17 http://www.cs.wisc.edu/condor 17 DAGMan Progress (cont’d) › New features  Improved logging (timestamps, etc.)  More efficient recovery  Node RETRY capability  DAG info in condor_q (with –dag flag)  Robust in more failure cases  Recursive DAGs for conditional execution › DAGMan for Windows (Ray Pingree)

18 http://www.cs.wisc.edu/condor 18 DAGMan Success › DAGMan is becoming part of the common framework for running on the grid.  Particle Physics Data Grid (PPDG)  Grid Physics Network (GriPhyN)  Many Super Computing 2001 demos  more…

19 http://www.cs.wisc.edu/condor 19 DAGMan in the GriPhyN Architecture Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAG DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus diagram by Ian Foster (Argonne)

20 DAGMan in PPDG Tools diagram by Jim Amundson (Fermilab)

21 http://www.cs.wisc.edu/condor 21 What’s Next? › More flexible control of node execution  Currently implicit: “all my parents returned 0”.  Why not, “all parents returned 0 AND ran for more than two hours” or “parent A returned 0 and parent B returned 42”? › 1 st step: represent DAG nodes internally as ClassAds  Allows DAGMan to decide when to run nodes based on arbitrary requirements

22 http://www.cs.wisc.edu/condor 22 What’s Next? (cont’d) › Extend DAGMan to utilize DaP Scheduler (DaP?) to intelligently schedule data transfers along with Condor and Condor-G jobs. DAGMan Condor-G Condor DaP Scheduler

23 http://www.cs.wisc.edu/condor 23 Thank You! › Interested in seeing more?  Come to the DAGMan BoF Wednesday 9am - noon Room 3393, Computer Sciences (1210 W. Dayton St.)  Email us: condor-admin@cs.wisc.edu  Try it! http://www.cs.wisc.edu/condor


Download ppt "Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &"

Similar presentations


Ads by Google