Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Slides:



Advertisements
Similar presentations
Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.
Advertisements

Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Application Description Languages Picture taken from
1 Using Condor An Introduction ICE 2008.
An Astronomical Image Mosaic Service for the National Virtual Observatory / ESTO.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Scientific Workflows on the Grid. Goals Enhance scientific productivity through: Discovery and application of datasets and programs at petabyte scale.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide High Level Grid Services Warren Smith Texas.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,
1 Using Condor An Introduction ICE 2010.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Dr. Ahmed Abdeen Hamed, Ph.D. University of Vermont, EPSCoR Research on Adaptation to Climate Change (RACC) Burlington Vermont USA MODELING THE IMPACTS.
Condor Week 2005Optimizing Workflows on the Grid1 Optimizing workflow execution on the Grid Gaurang Mehta - Based on “Optimizing.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Grid Compute Resources and Job Management New Mexico Grid School – April 9, 2009 Marco Mambelli – University of Chicago
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
6d.1 Schedulers and Resource Brokers Topics ITCS 4146/5146, UNC-Charlotte, B. Wilkinson, 2007 Feb 12, 2007 Local schedulers Condor.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor and Workflows: Tutorial HTCondor Week 2016 Kent Wenger.
Grid Application Description Languages Picture taken from
Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Christina Koch Research Computing Facilitators
High Level Grid Services
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Pegasus WMS Extends DAGMan to the grid world
Intermediate HTCondor: Workflows Monday pm
An Introduction to Workflows with DAGMan
Migratory File Services for Batch-Pipelined Workloads
Grid Compute Resources and Job Management
Using Stork An Introduction Condor Week 2006
Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
HTCondor and Workflows: An Introduction HTCondor Week 2013
What’s New in DAGMan HTCondor Week 2013
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
Frieda meets Pegasus-WMS
Presentation transcript:

Part 8: DAGMan

A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan

A: Grid Workflow Management

Job Dependencies In many applications, some jobs are dependent on other jobs E.g. job A must finish before job B starts Often because job B uses output from job A We call a set of interdependent jobs a workflow Condor-G can run jobs in any order We need a workflow manager

Two Motivating Examples The Sloan Digital Sky Survey The Montage Project

Sloan Digital Sky Survey Map one-quarter of the entire sky Determine the positions and absolute brightness of more than 100 million celestial objects. Measure the distance to a million of the nearest galaxies, and to 100,000 quasars.

Workflow to Find Galaxy Clusters catalog cluster 5 4 core brg field tsObj brg field tsObj 2 1 brg field tsObj 2 1 brg field tsObj 2 1 core 3 fieldPrep maxBrg maxBcg bcgCoal getCatalog

Workflow to Find Galaxy Clusters maxBrg maxBcg bcgCoal getCatalog

Montage Create a large mosaic image from many smaller images Used for astronomy data Correct optical distortions and intensity differences

Montage Workflow Data Stage in nodes Montage compute nodes Data stage out nodes Inter pool transfer nodes

Montage Workflow 1202 nodes

B: DAGMan

DAGMan Directed Acyclic Graph Manager Workflow manager for Condor-G DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. By default, Condor may run your jobs in any order, or everything simultaneously, so we need DAGMan to enforce an ordering when necessary. (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a “node” in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

Defining a DAG A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D each node will run the Condor job specified by its accompanying Condor submit file Job A Job BJob C Job D

Submitting a DAG To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a job with DAGMan as the executable. This job happens to run on the submitting machine, not any other computer. Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

DAGMan Running a DAG DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. Condor Job Queue C D A A B.dag File

DAGMan Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor queue at the appropriate times. Condor Job Queue C D B C B A

DAGMan Running a DAG (cont’d) In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor Job Queue X D A B Rescue File

DAGMan Recovering a DAG Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor Job Queue C D A B Rescue File C

DAGMan Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor Job Queue C D A B D

DAGMan Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor Job Queue C D A B

Additional DAGMan Features Provides other handy features for job management… nodes can have PRE & POST scripts failed nodes can be automatically re-tried a configurable number of times job submission can be “throttled”

Another sample DAGMan submit file # Filename: diamond.dag Job A A.condor Job B B.condor Job C C.condor Job D D.condor Script PRE A top_pre.csh Script PRE B mid_pre.perl $JOB Script POST B mid_post.perl $JOB $RETURN Script PRE C mid_pre.perl $JOB Script POST C mid_post.perl $JOB $RETURN Script PRE D bot_pre.csh PARENT A CHILD B C PARENT B C CHILD D Retry C 3 Job A Job BJob C Job D

Lab 8: DAGMan

In this lab, you’ll: Run a simple DAGMan job Run a more complex DAGMan job Recover a failed DAGMan job

Credits NSF disclaimer Portions of this presentation were adapted from the following sources: Jaime Frey, UW-Madison