Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Slides:

Advertisements

Similar presentations

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

A Computation Management Agent for Multi-Institutional Grids

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

1 Using Condor An Introduction ICE 2008.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

STAR scheduling future directions Gabriele Carcassi 9 September 2002.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

1 Instant Data Warehouse Utilities Extended (Again!!) 14/7/ Today I am pleased to announce the publishing of some fantastic new functionality for.

1 Using Condor An Introduction ICE 2010.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid job submission using HTCondor Andrew Lahiff.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

Review of Condor,SGE,LSF,PBS

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)

Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Intermediate HTCondor: More Workflows Monday pm

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Intermediate HTCondor: Workflows Monday pm

An Introduction to Workflows with DAGMan

Grid Compute Resources and Job Management

Using Stork An Introduction Condor Week 2006

US CMS Testbed.

What’s New in DAGMan HTCondor Week 2013

STORK: A Scheduler for Data Placement Activities in Grid

Presentation transcript:

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction & Update

2 DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

3 Why is This Important? › Most real science involves complex sequences of tasks – on many resources at many sites.  E.g., move data, compute, check, move back, etc. › … and many types of jobs working together  Condor, Grid (Condor-G), MPI, shell scripts, etc. › Failures are a certainty, so recoverability of the sequence – not just the jobs – is crucial.

4 What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

5 Defining a DAG › A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor or Grid job specified by its accompanying Condor submit file Job A Job BJob C Job D

6 Submitting a DAG › To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe job to run DAGMan under Condor… so DAGMan itself will be robust in case of failure, machine reboots, etc.

7 DAGMan Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File

8 DAGMan Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A

9 DAGMan Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File

10 DAGMan Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C

11 DAGMan Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

12 Additional DAGMan Features › Provides other knobs handy for job management…  nodes can have PRE & POST scripts  job submission can be “throttled”  NEW: failed nodes can be automatically re-tried a configurable number of times

13 PRE & POST Scripts › Executes locally on the submit host before or after job submission… › Example: # diamond.dag PRE A prepare-A.sh Job A a.sub Job B b.sub Job C c.sub Job D d.sub POST D double-check.sh Parent A Child B C Parent B C Child D › PRE/POST scripts are part of node PRE Job A Job BJob C Job D POST

14 DAG “Throttling” › You can tell DAGMan to limit the maximum number of jobs it submits at any one time  condor_submit_dag -maxjobs N  useful for managing resource limitations (e.g., licenses) › You can also can limit the number of simultaneous PRE or POST scripts.  Added after Vladimir Litvin’s 7000-node DAG started 7000 PRE scripts on his machine!

15 Node RETRY › Tells DAGMan to re-run a node multiple times if necessary… › Example: # diamond.dag Job A a.sub Job B b.sub RETRY B 5 Job C c.sub RETRY C 5 Job D d.sub Parent A Child B C Parent B C Child D Job A Job BJob C Job D

16 DAGMan Progress › Testing… lots of testing.  10,000+ node DAGs run smoothly  Developed automated DAG testing tools to generate random DAGs and test for correct execution (Ning Lin & Will McDonald)  Lots of bugs fixed

17 DAGMan Progress (cont’d) › New features  Improved logging (timestamps, etc.)  More efficient recovery  Node RETRY capability  DAG info in condor_q (with –dag flag)  Robust in more failure cases  Recursive DAGs for conditional execution › DAGMan for Windows (Ray Pingree)

18 DAGMan Success › DAGMan is becoming part of the common framework for running on the grid.  Particle Physics Data Grid (PPDG)  Grid Physics Network (GriPhyN)  Many Super Computing 2001 demos  more…

19 DAGMan in the GriPhyN Architecture Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAG DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus diagram by Ian Foster (Argonne)

DAGMan in PPDG Tools diagram by Jim Amundson (Fermilab)

21 What’s Next? › More flexible control of node execution  Currently implicit: “all my parents returned 0”.  Why not, “all parents returned 0 AND ran for more than two hours” or “parent A returned 0 and parent B returned 42”? › 1 st step: represent DAG nodes internally as ClassAds  Allows DAGMan to decide when to run nodes based on arbitrary requirements

22 What’s Next? (cont’d) › Extend DAGMan to utilize DaP Scheduler (DaP?) to intelligently schedule data transfers along with Condor and Condor-G jobs. DAGMan Condor-G Condor DaP Scheduler

23 Thank You! › Interested in seeing more?  Come to the DAGMan BoF Wednesday 9am - noon Room 3393, Computer Sciences (1210 W. Dayton St.)  us:  Try it!