Condor: Job Management

Condor: Job Management
Jessica Frierson and Somadina Mbadiwe

Job Management on the Grid: Goals
We need a service that will securely Create an environment for a job Stage files to and/or from the environment Cause execution of job processes Monitor execution Signal us when important state changes occur Enable us access output files

Solution? Condor

Condor: Background Known as Condor until 2012 Current name: HTCondor
HT for High Throughput Created at UW-Madison in 1988 It continued growing as grid and cloud technology grew UW for University of Wisconsin

Condor A specialized resource management system for compute-intensive jobs that provides: Job queueing mechanism Scheduling policy Priority scheme Resource monitoring Resource management

Condor: Core Philosophy
Satisfy the needs of users who need extra capacity ...without lowering the quality of service experienced by the owners of under utilized workstations

Condor: Job Management in a nutshell
User submits their serial or parallel jobs to Condor Condor then Places them in a queue Chooses when and where the jobs will be run, based on a policy Carefully monitors their progress, and Informs the user upon completion

Figure: The Condor Kernel This figure shows the major processes in a Condor system. The common generic name for each process is given in large print. In parentheses are the technical Condor-specific names used in some publications.

Inter-Cluster Job Management: Gateway Flocking
Pros Completely transparent to participants Cross-pool matches can be made without any modification by users Administration is only required between adjacent pools Cons Impossible to track a remote user’s usage statistics – because each pool is represented by a single gateway Only allows sharing at the organizational level Complex – both technically and administratively

Inter-Cluster Job Management: Direct Flocking
Pros Agents may report itself to multiple matchmakers Only requires agreement between an individual and an organization Jobs may execute in either community as resources become available Cons Not as powerful as Gateway Flocking Only allows sharing at the organizational level Individual user can’t join multiple communitiee

Using Condor: Just 4 Steps
Prepare the Job Select a Universe Create a Submit Description File Submit the Job

Step 1: Prepare the Job Job must be able to execute as a batch operation Rewrite your program if need be Job must be able to run unattended No user interactions Don’t worry, you can give Condor sets of input arguments Put input data in a file where Condor can read it Test that the program can read these inputs from a file

ClassAds The means for jobs and resources to “advertise” themselves to the matchmaker Job: Its ClassAd will specify some details about itself and the kind of resource it’s interested in Machine: Its ClassAd will specify some details about itself and when it’ll be available for use

Step 2: Select a Universe
Universe: refers to a Condor runtime environment The choice will depend on the kind of job you want Condor to run Some Condor runtime environments Standard (default) and Vanilla: for serial jobs Parallel and PVM: for parallel and PVM jobs MPI: for parallel MPI jobs GLOBUS: for grid applications Scheduler: for meta-schedulers

Step 3: Create a Submit Description File (SDF)
Forget the name; it’s just a plain ASCII text file File extension is irrelevant SDF tells Condor about your job Which Universe (runtime environment) Which executable to run – and where to find it Input, Output and Log file locations Command-line arguments, if any You can specify multiple sets of input arguments Environment variables Any special preference or requirements

Step 3: Create a Submit Description File (SDF) - Examples
################################ # MPI example submit description file universe = MPI executable = simplempi Arguments = arg1 arg2 arg3 log = logfile input = infile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 queue ###################################### ## MPI example submit description file ## without using a shared filesystem universe = MPI executable = simplempi Arguments = arg1 arg2 arg3 log = logfile input = infile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit queue

Step 4: Submit the Job By running the “condor_submit” command:
condor_submit my_sdf You can augment the commands in the SDF like this: condor_submit –a “rank=Memory>=64” –a “error=err.log” my_sdf Condor will not run jobs submitted by user root (UID = 0) or by a user whose default group is group wheel (GID = 0). They will sit forever in the queue All path-names specified in the SDF must be less than 256 characters in length And, command line arguments must be less than 4096 characters in length

Condor’s Promise I am going to do whatever it takes to run your jobs, even if some machines… Crash Are disconnected Run out of memory Are removed or added from the pool Are put to other uses

Monitoring your Jobs

Further Reading Douglas Thain, Todd Tannenbaum, and Miron Livny: Condor and the Grid. Concurrency: Pract. Exper. 2002; 0:0–20 Condor (Version ) Manual: Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny: Condor – A Distributed Job Scheduler. Michael N.Fienen and Randall J. Hunt: High-Throughput Computing Versus High-Performance Computing for Groundwater Applications. ion/ _High-Throughput_Computing_Versus_High- Performance_Computing_for_Groundwater_Applications/links/54cc0c440cf29 ca810f4b153.pdf

Condor: Job Management

Similar presentations

Presentation on theme: "Condor: Job Management"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Condor: Job Management

Similar presentations

Presentation on theme: "Condor: Job Management"— Presentation transcript:

Similar presentations

About project

Feedback