Condor: Job Management

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

DCC/FCUP Grid Computing 1 Resource Management Systems.

Condor and the Grid D. Thain, T. Tannenbaum, M. Livny Christopher M. Moretti 23 February 2007.

Workload Management Massimo Sgaravatto INFN Padova.

High Throughput Computing with Condor at Notre Dame Douglas Thain 30 April 2009.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

1 Chapter Overview Transferring and Transforming Data Introducing Microsoft Data Transformation Services (DTS) Transferring and Transforming Data with.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

HTPC - High Throughput Parallel Computing (on the OSG) Dan Fraser, UChicago OSG Production Coordinator Horst Severini, OU (Greg Thain, Uwisc) OU Supercomputing.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Computing I CONDOR.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.

Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Grid Workload Management Massimo Sgaravatto INFN Padova.

Grid job submission using HTCondor Andrew Lahiff.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.

HTCondor Accounting Update

Workload Management Workpackage

Condor A New PACI Partner Opportunity Miron Livny

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

Example: Rapid Atmospheric Modeling System, ColoState U

IW2D migration to HTCondor

GWE Core Grid Wizard Enterprise (

Chapter 2: System Structures

Grid Compute Resources and Job Management

Building Grids with Condor

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Haiyan Meng and Douglas Thain

Basic Grid Projects – Condor (Part I)

Introduction to High Throughput Computing and HTCondor

Genre1: Condor Grid: CSECCR

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

The Condor JobRouter.

Wide Area Workload Management Work Package DATAGRID project

Condor: Firewall Mirroring

Condor-G Making Condor Grid Enabled

GLOW A Campus Grid within OSG

JRA 1 Progress Report ETICS 2 All-Hands Meeting

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Condor: Job Management Jessica Frierson and Somadina Mbadiwe

Job Management on the Grid: Goals We need a service that will securely Create an environment for a job Stage files to and/or from the environment Cause execution of job processes Monitor execution Signal us when important state changes occur Enable us access output files

Solution? Condor

Condor: Background Known as Condor until 2012 Current name: HTCondor HT for High Throughput Created at UW-Madison in 1988 It continued growing as grid and cloud technology grew UW for University of Wisconsin

Condor A specialized resource management system for compute-intensive jobs that provides: Job queueing mechanism Scheduling policy Priority scheme Resource monitoring Resource management

Condor: Core Philosophy Satisfy the needs of users who need extra capacity ...without lowering the quality of service experienced by the owners of under utilized workstations

Condor: Job Management in a nutshell User submits their serial or parallel jobs to Condor Condor then Places them in a queue Chooses when and where the jobs will be run, based on a policy Carefully monitors their progress, and Informs the user upon completion

Figure: The Condor Kernel This figure shows the major processes in a Condor system. The common generic name for each process is given in large print. In parentheses are the technical Condor-specific names used in some publications.

Inter-Cluster Job Management: Gateway Flocking Pros Completely transparent to participants Cross-pool matches can be made without any modification by users Administration is only required between adjacent pools Cons Impossible to track a remote user’s usage statistics – because each pool is represented by a single gateway Only allows sharing at the organizational level Complex – both technically and administratively

Inter-Cluster Job Management: Direct Flocking Pros Agents may report itself to multiple matchmakers Only requires agreement between an individual and an organization Jobs may execute in either community as resources become available Cons Not as powerful as Gateway Flocking Only allows sharing at the organizational level Individual user can’t join multiple communitiee

Using Condor: Just 4 Steps Prepare the Job Select a Universe Create a Submit Description File Submit the Job

Step 1: Prepare the Job Job must be able to execute as a batch operation Rewrite your program if need be Job must be able to run unattended No user interactions Don’t worry, you can give Condor sets of input arguments Put input data in a file where Condor can read it Test that the program can read these inputs from a file

ClassAds The means for jobs and resources to “advertise” themselves to the matchmaker Job: Its ClassAd will specify some details about itself and the kind of resource it’s interested in Machine: Its ClassAd will specify some details about itself and when it’ll be available for use

Step 2: Select a Universe Universe: refers to a Condor runtime environment The choice will depend on the kind of job you want Condor to run Some Condor runtime environments Standard (default) and Vanilla: for serial jobs Parallel and PVM: for parallel and PVM jobs MPI: for parallel MPI jobs GLOBUS: for grid applications Scheduler: for meta-schedulers

Step 3: Create a Submit Description File (SDF) Forget the name; it’s just a plain ASCII text file File extension is irrelevant SDF tells Condor about your job Which Universe (runtime environment) Which executable to run – and where to find it Input, Output and Log file locations Command-line arguments, if any You can specify multiple sets of input arguments Environment variables Any special preference or requirements

Step 3: Create a Submit Description File (SDF) - Examples ################################ # MPI example submit description file universe = MPI executable = simplempi Arguments = arg1 arg2 arg3 log = logfile input = infile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 queue ###################################### ## MPI example submit description file ## without using a shared filesystem universe = MPI executable = simplempi Arguments = arg1 arg2 arg3 log = logfile input = infile.$(NODE) output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 should_transfer_files = yes when_to_transfer_output = on_exit queue

Step 4: Submit the Job By running the “condor_submit” command: condor_submit my_sdf You can augment the commands in the SDF like this: condor_submit –a “rank=Memory>=64” –a “error=err.log” my_sdf Condor will not run jobs submitted by user root (UID = 0) or by a user whose default group is group wheel (GID = 0). They will sit forever in the queue All path-names specified in the SDF must be less than 256 characters in length And, command line arguments must be less than 4096 characters in length

Condor’s Promise I am going to do whatever it takes to run your jobs, even if some machines… Crash Are disconnected Run out of memory Are removed or added from the pool Are put to other uses

Monitoring your Jobs

Further Reading Douglas Thain, Todd Tannenbaum, and Miron Livny: Condor and the Grid. Concurrency: Pract. Exper. 2002; 0:0–20 Condor (Version 7.6.10) Manual: http://research.cs.wisc.edu/htcondor/manual/v7.6/ref.html Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny: Condor – A Distributed Job Scheduler. https://research.cs.wisc.edu/htcondor/doc/beowulf-chapter-rev1.pdf Michael N.Fienen and Randall J. Hunt: High-Throughput Computing Versus High-Performance Computing for Groundwater Applications. https://www.researchgate.net/profile/Michael_Fienen/publicat ion/271530646_High-Throughput_Computing_Versus_High- Performance_Computing_for_Groundwater_Applications/links/54cc0c440cf29 ca810f4b153.pdf