Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.
Cluster Computing at IQSS Alex Storer, Research Technology Consultant.
DAGMan Hands-On Kent Wenger University of Wisconsin Madison, Madison, WI.
Lesson 10-Controlling User Processes. Overview Managing and processing processes. Managing jobs. Exiting/quitting when jobs have been stopped.
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
Status of the new CRS software (update) Tomasz Wlodek June 22, 2003.
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.
Exceptions in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CIS 101: Computer Programming and Problem Solving Lecture 5 Usman Roshan Department of Computer Science NJIT.
Guide To UNIX Using Linux Third Edition
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Agenda Control Flow Statements Purpose test statement if / elif / else Statements for loops while vs. until statements case statement break vs. continue.
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 20 Slide 1 Defect testing l Testing programs to establish the presence of system defects.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
Grid Computing I CONDOR.
CPS120 Introduction to Computer Science Iteration (Looping)
Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley
1 Using Condor An Introduction ICE 2010.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Linux Operations and Administration
Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Working with Condor. Links: Condor’s homepage:  Condor manual (for the version currently.
Condor Project Computer Sciences Department University of Wisconsin-Madison Master/Worker and Condor.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
CPS120 Introduction to Computer Science Iteration (Looping)
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
1 Project 7: Looping. Project 7 For this project you will produce two Java programs. The requirements for each program will be described separately on.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
HTCondor and Workflows: Tutorial HTCondor Week 2016 Kent Wenger.
Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
Intermediate HTCondor: Workflows Monday pm
An Introduction to Workflows with DAGMan
Workload Management System ( WMS )
Grid Compute Resources and Job Management
Using Stork An Introduction Condor Week 2006
Scripts & Functions Scripts and functions are contained in .m-files
HTCondor and Workflows: An Introduction HTCondor Week 2013
What’s New in DAGMan HTCondor Week 2013
Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha
Job Description Language (JDL)
Frieda meets Pegasus-WMS
Presentation transcript:

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona, 2006

2 Agenda  Extended user’s tutorial  Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing  Case studies, and a discussion of your application‘s needs

3 Some jobs have dependencies… Condor can help solve dependency problems

4 Frieda learns DAGMan  Directed Acyclic Graph Manager  DAGMan allows Frieda to specify the dependencies between her Condor jobs, so Condor manages the jobs automatically.  Dependency example: Do not run job B until job A has completed successfully.

5 What is a DAG?  Directed Acyclic Graph  A DAG is the data structure used by DAGMan to represent dependencies A BC D

6 DAG Definitions  DAGs have one or more nodes (or vertices).  Dependencies are represented by arcs (or edges). These are arrows that go from parent to child).  No cycles ! A BC D

7 Condor and DAGs  Each node represents a Condor job  Dependencies define the possible order of job execution Job A Job B Job C Job D

8 Defining a DAG to Condor A DAG input file defines a DAG: # file name: diamond.dag Job A a.submit Job B b.submit Job C c.submit Job D d.submit Parent A Child B C Parent B C Child D A BC D

9 Submit Description File For node B: # file name: # b.submit universe = vanilla executable = B input = B.in output = B.out error = B.err log = B.log queue For node C: # file name: # c.submit universe = standard executable = C input = C.in output = C.out error = C.err log = C.log queue

10 Submitting the DAG to Condor  To submit the entire DAG, run condor_submit_dag diamond.dag  condor_submit_dag creates a submit description file for DAGMan, and DAGMan itself is submitted as a Condor job!

11 a DAGMan requirement  The submit description file for each job must specify a log file  Log files may be separate or shared by different jobs within the DAG  The log files are used to synchronize job submission

12 Nodes  Job execution at a node is either successful or fails  Based on the return value of the job 0  success not 0  failure A BC D

13 Advanced DAGMan Tricks  Retry of a node  Abort the entire DAG  setting a variable, a VARS entry  Throttles and DAGs  PRE and POST scripts: editing the DAG  Nested DAGs: loops and more

14 Retry  Before a node is marked as failed...  Retry N times. In the DAG input file: Retry C 4 (to rerun node C four times before calling the node failed)  Retry N times, unless a node returns specific exit code. In the DAG input file: Retry C 4 UNLESS-EXIT 2

15 Abort the Entire DAG  If a specific error value should cause the entire DAG to stop  Place in the DAG input file: Abort-DAG-On B 3 Name of node Returned error code

16 VARS  An entry in the DAG input file intended to reduce the number of unique submit description files needed  defines a variable and value  associated with a node  use the value in a substitution macro

17 Root Invented Example: A Binary Tree A E B CD F Assume that a single executable processes each node. But, handling is different based on a node’s position as a left or right child.

18 The DAG Input File # tree example, file is tree.dag Job root node.submit Job A node.submit Vars A position=”left” Job B node.submit Vars B position=”right” Job C node.submit Vars C position=”left”... Parent root Child A B... Root A E B CD F

19 The Submit Description File # file name is node.submit executable = process.exe arguments = $(position) log = node.log queue The job at node A has the command line: process.exe left

20 Throttles  Throttles to control number of job submissions at one time  Maximum number of jobs submitted % condor_submit_dag –maxjobs 40 bigdag.dag  Maximum number of jobs idle % condor_submit_dag –maxidle 10 bigdag.dag

21  Submit DAG with  200,000 nodes  No dependencies between jobs  Use DAGMan to throttle the jobs, because Condor is scalable, but will have problems with 200,000 simultaneous job submissions Throttling Example A1A1 A2A2 A3A3 … A

22 DAGMan scripts  DAGMan allows PRE and/or POST scripts  Not necessarily a script: any executable  Run before (PRE) or after (POST) job  Run on the submit machine  In the DAG input file: Job A a.submit Script PRE A before-script Script POST A after-script

23 node A within the DAG before-script after-script Condor job described in a.submit

24 PRE script PRE script can make decisions  Should I pass different arguments to the job?  Should I change a submit description file?  Lazy decision making

25 POST script  POST script is always run, independent of the Condor job’s return value  POST script can change return value  DAGMan marks the node failed for a non- zero return value from the POST script  POST script can look at error code or output files and return 0 (success) or non-zero (failure) based on deeper knowledge.

26 Pre-defined variables  In the DAG input file: Job A a.submit Script PRE A before-script $JOB Script POST A after-script $JOB $RETURN (optional) arguments to script $JOB becomes the string that defines the node name $RETURN becomes the return value from the Condor job defined by the node

27 Script Throttles  Throttles to control the number of scripts running at one time % condor_submit_dag –maxpre 10 bigdag.dag OR % condor_submit_dag –maxpost 30 bigdag.dag

28 Nested DAGs  Idea: any DAG node can be a script that does: 1.Make decision 2.Create DAG input file 3.Call condor_submit_day –nosubmit 4.Outer DAG waits for inner DAG  DAG node will not complete until the inner (nested) DAG finishes  Why?  Implement a fixed-length loop  Modify behavior on the fly

29 Nested DAG Example A BC D V W Z X Y C is