Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Similar presentations


Presentation on theme: "Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each."— Presentation transcript:

1 Condor DAGMan Warren Smith

2 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each node has a Condor job –Edge edge is a dependency Example DAGMan file: Job A setup.condor Job B sweep1.condor Job C sweep2.condor Job D analyze.condor Parent A Child B C Parent B C Child D Job is used to name condor submit scripts Parent/Child specifies dependencies Node D Node A Node BNode C

3 12/11/2009 TeraGrid Science Gateways Telecon3 Managing a DAG condor_submit_dag –Creates a local job to manage the DAG Monitors jobs that make up the DAG Submits jobs when dependencies are satisfied lslogin2% condor_submit_dag example8.dag Checking all your submit files for log file names. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : example8.dag.condor.sub Log of DAGMan debugging messages : example8.dag.dagman.out Log of Condor library output : example8.dag.lib.out Log of Condor library error messages : example8.dag.lib.err Log of the life of condor_dagman itself : example8.dag.dagman.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 76. ----------------------------------------------------------------------- condor_q -dag and condor_q to monitor condor_rm to remove the DAG or individual jobs in the DAG

4 12/11/2009 TeraGrid Science Gateways Telecon4 DAGMan Node Each node can have pre- and post-scripts –Run on your submit system –SCRIPT PRE JobName ExecutableName [arguments] –SCRIPT POST JobName ExecutableName [arguments] Condor job runs if PRE succeeds POST runs if node executes Result of POST is result of node Job A setup.condor Job B sweep1.condor Job C sweep2.condor Job D analyze.condor Script PRE A retrieve.sh Script POST B check.sh $JOB $RETURN Script POST C check.sh $JOB $RETURN Script POST D archive.sh Parent A Child B C Parent B C Child D Node D Node A Node BNode C PRE script POST script Job B

5 12/11/2009 TeraGrid Science Gateways Telecon5 Managing Failures Retry statement for any node Retry B 3 POST script can analyze what happened and try to correct –Can be used with Retry ABORT-DAG-ON to exit immediately if can’t recover –On the exit code of a node ABORT-DAG-ON B 12 Rescue DAG –Condor executes a DAG as far as it can, even when individual nodes failure –Rescue DAG is generated if DAG didn’t fully complete Includes comments and marks which nodes completed Can be resubimitted as is or edited and submitted

6 12/11/2009 TeraGrid Science Gateways Telecon6 TeraGrid Condor-G Matchmaking Matchmaking selecting a resource for a job –A job provides requirements and preferences for a host –A resource provides them for jobs –Jobs are paired to resources Satisfy all requirements of both job and resource Optimize preferences of job and resource TeraGrid supports matchmaking of Condor-G jobs –Can be used with DAGMan Can’t express everything you might want –For example, “run job A on the same machine as job B” Available from several TeraGrid systems –http://info.teragrid.org/restdemo/html/tg/services/condor-g- matchmaking Your Condor install can be authorized

7 12/11/2009 TeraGrid Science Gateways Telecon7 System Information You can find information about systems using condor_status –Each row describes a queue on a system lslogin2% condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime … tacc.lonestar.deve LINUX X86_64 Unclaimed Idle 0.000 1301382 0+00:00:00 tacc.lonestar.high LINUX X86_64 Unclaimed Idle 1.750 1301382 0+00:00:00 tacc.lonestar.norm LINUX X86_64 Unclaimed Idle 2.410 1301382 0+00:00:00 tacc.lonestar.seri LINUX X86_64 Unclaimed Idle 3.083 1301382 0+00:00:00 … Load average tries to describe how busy a queue is –(slots used + slots requested) / slots used

8 12/11/2009 TeraGrid Science Gateways Telecon8 First Job with Matchmaking executable = /bin/hostname arguments = --fqdn transfer_executable = false output = example2.out error = example2.err log = example2.log requirements = (Name=="tacc.lonestar.development") universe = grid x509userproxy=/home/teragrid/tg458637/.globus/userproxy.pem grid_resource = $$(GramResource) globusrsl = (maxWallTime=5)(count=1)(queue=$$(Queue)) queue

9 12/11/2009 TeraGrid Science Gateways Telecon9 Notes on First Job x509userproxy –Don’t have to specify when not matchmaking –Need to when matchmaking Requirements –What the job requires of any machine it gets matched to –Boolean –Variables “Name” are for the machine being matched to Variables “$$(Queue)” are for the machine being matched to –$$() only needed outside of requirements

10 12/11/2009 TeraGrid Science Gateways Telecon10 Second Job with Matchmaking executable = /bin/hostname arguments = --fqdn transfer_executable = false output = example2-$(CLUSTER).$(PROCESS).out error = example2-$(CLUSTER).$(PROCESS).err log = example2-$(CLUSTER).$(PROCESS).log requirements = ((Name=="tacc.lonestar.development") || \ (Name=="tacc.ranger.development") || \ (Name=="loni-lsu.queenbee.workq") || \ (Name=="ncsa.abe.debug") || \ (Name=="ncsa.dtf.debug") || \ (Name=="sdsc.dtf.dque") || \ (Name=="purdue.steele.tg_workq")) rank = 100 - LoadAvg - CurMatches * 0.25 universe = grid x509userproxy=/home/teragrid/tg458637/.globus/userproxy.pem grid_resource = $$(GramResource) globusrsl = (maxWallTime=5)(count=1)(queue=$$(Queue)) queue 10

11 12/11/2009 TeraGrid Science Gateways Telecon11 Notes on Second Job $(Cluster) –The ID for the set of jobs submitted by this script $(Process) –The ID (0 - (n-1)) of a job within a cluster Requirements is (mostly) a boolean expression –() to group, || for or, && for and –, >=, ==, != –A few others since expressions actually have 3 values True, false, undefined Rank expression used to identify best machine –Higher rank is better –CurMatches is number of jobs matched to that machine in current round –100 is the max load average queue 10 –Submits 10 copies of this job

12 12/11/2009 TeraGrid Science Gateways Telecon12 Ranking Machines It’s a bit of an art at this point –Let me know what works for you Let Warren know if additional information is needed Working on providing queue wait time predictions –QBETS predictions have been available –Most likely moving to a new technology over the next few months

13 12/11/2009 TeraGrid Science Gateways Telecon13 Additional Information Condor User Guide –http://www.cs.wisc.edu/condor/manual/v7.2/ TeraGrid –Condor-G page http://www.teragrid.org/userinfo/jobs/condorg.php –Condor-G matchmaking wiki http://www.teragridforum.org/mediawiki/index.php?title=Schedwg_ condorg_userguide


Download ppt "Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each."

Similar presentations


Ads by Google