Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

Slides:

Advertisements

Similar presentations

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.

Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Part 7: CondorG A: Condor-G B: Laboratory: CondorG.

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

SIE’s favourite pet: Condor (or how to easily run your programs in dozens of machines at a time) Adrián Santos Marrero E.T.S.I. Informática - ULL.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

2-1.1 Job Submission © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification date: Jan 18, 2010.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

6a.1 Globus Toolkit Execution Management. Data Management Security Common Runtime Execution Management Information Services Web Services Components Non-WS.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

1 Using Condor An Introduction ICE 2008.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

1b.1 Globus Toolkit 4.0 Grid Resource Allocation Manager (GRAM) Job submission ITCS 4146/5146 Grid Computing, 2007, UNC-Charlotte, B. Wilkinson. Jan 24,

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Overview of TeraGrid Resources and Usage Selim Kalayci Florida International University 07/14/2009 Note: Slides are compiled from various TeraGrid Documentations.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

High Performance Louisiana State University - LONI HPC Enablement Workshop – LaTech University,

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid Computing I CONDOR.

COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

2-1.1 Job Submission Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 2, pp

Grid Workload Management Massimo Sgaravatto INFN Padova.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Grid Compute Resources and Job Management New Mexico Grid School – April 9, 2009 Marco Mambelli – University of Chicago

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Review of Condor,SGE,LSF,PBS

Resource Management Ewa Deelman.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

STAR Scheduling status Gabriele Carcassi 9 September 2002.

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

STAR Scheduler Gabriele Carcassi STAR Collaboration.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Workload Management Workpackage

Peter Kacsuk – Sipos Gergely MTA SZTAKI

Grid Compute Resources and Job Management

Condor: Job Management

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Condor-G Making Condor Grid Enabled

Presentation transcript:

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid Documentations

2 Grid Job Management using Globus Common WS interface to schedulers – Unix, Condor, LSF, PBS, SGE, … More generally: interface for process execution management – Lay down execution environment – Stage data – Monitor & manage lifecycle – Kill it, clean up

3 Grid Job Management Goals Provide a service to securely: Create an environment for a job Stage files to/from environment Cause execution of job process(es) – Via various local resource managers Monitor execution Signal important state changes to client Enable client access to output files – Streaming access during execution

4 GRAM GRAM: Globus Resource Allocation and Management GRAM is a Globus Toolkit component – For Grid job management GRAM is a unifying remote interface to Resource Managers – Yet preserves local site security/control Remote credential management File staging via RFT and GridFTP

5 A Simple Example First, login to queenbee.loni-lsu.teragrid.orgqueenbee.loni-lsu.teragrid.org Command example: % globusrun-ws -submit -c /bin/date Submitting job...Done. Job ID: uuid:002a6ab d9-bae6-0002a5ad41e5 Termination time: 01/07/ :55 GMT Current job state: Active Current job state: CleanUp Current job state: Done Destroying job...Done. A successful submission will create a new ManagedJob resource with its own unique EPR for messaging Use – o option to create the EPR file % globusrun-ws -submit –o job.epr -c /bin/date

6 A Simple Example(2) To see the output, use –s (stream) option % globusrun-ws -submit –s -c /bin/date Termination time: 06/14/ :07 GMT Current job state: Active Current job state: CleanUp-Hold Wed Jun 13 14:07:54 EDT 2007 Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. If you want to send the output to a file, use –so option % globusrun-ws -submit –s –so job.out -c /bin/date … % cat job.out Wed Jun 13 14:07:54 EDT 2007

7 A Simple Example(3) Submitting your job to different schedulers – Fork % globusrun-ws -submit -Ft Fork -s -c /bin/date (Actually, the default is Fork. So, you can skip it in this case.) – SGE % globusrun-ws -submit -Ft PBS-s -c /bin/date Submitting to a remote site % globusrun-ws -submit -F tg- login.frost.ncar.teragrid.org -c /bin/date

8 Batch Job Submissions % globusrun-ws -submit -batch -o job_epr -c /bin/sleep 50 Submitting job...Done. Job ID: uuid:f c5-11d9-97e3-0002a5ad41e5 Termination time: 01/08/ :05 GMT % globusrun-ws -status -j job_epr Current job state: Active % globusrun-ws -status -j job_epr Current job state: Done % globusrun-ws -kill -j job_epr Requesting original job description...Done. Destroying job...Done.

9 Resource Specification Language (RSL) RSL is the language used by the clients to submit a job. All job submission parameters are described in RSL, including the executable file and arguments. You can specify the type and capabilities of resources to execute your job. You can also coordinate Stage-in and Stage-out operations through RSL.

10 Submitting a job through RSL Command: % globusrun-ws -submit -f touch.xml Contents of touch.xml file: /bin/touch touched_it

11 Condor is a software system that creates an HTC environment – Created at UW-MadisonUW-Madison Condor is a specialized workload management system for compute-intensive jobs. – Detects machine availability – Harnesses available resources – Uses remote system calls to send R/W operations over the network – Provides powerful resource management by matching resource owners with consumers (broker)

12 How Condor works Condor provides: a job queueing mechanism scheduling policy priority scheme resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, … chooses when and where to run the jobs based upon a policy, … carefully monitors their progress, and … ultimately informs the user upon completion.

13 Condor - features Checkpoint & migration Remote system calls – Able to transfer data files and executables across machines Job ordering Job requirements and preferences can be specified via powerful expressions

14 Condor lets you manage a large number of jobs. Specify the jobs in a file and submit them to Condor Condor runs them and keeps you notified on their progress – Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. – Handles inter-job dependencies (DAGMan)‏ Users can set Condor's job priorities Condor administrators can set user priorities

Condor-G Condor-G is a specialization of Condor. It is also known as the “Globus universe” or “Grid universe”. Condor-G can submit jobs to Globus resources, just like globusrun-ws. Condor-G combines the inter-domain resource management protocols of the Globus Toolkit and the intra-domain resource and job management methods of Condor for managing Grid jobs.

16 Condor-G … does whatever it takes to run your jobs, even if … – The gatekeeper is temporarily unavailable – The job manager crashes – Your local machine crashes – The network goes down

Remote Resource Access: Globus “globusrun myjob …” Globus GRAM Protocol Globus JobManager fork() Organization A Organization B

Globus Globus GRAM Protocol Globus JobManager fork() Organization A Organization B “globusrun myjob …”

Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B “globusrun myjob …”

Globus + Condor “globusrun …” Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B

Condor-G + Globus + Condor Globus GRAM Protocol Globus JobManager Submit to Condor Condor Pool Organization A Organization B Condor-G myjob1 myjob2 myjob3 myjob4 myjob5 …

Just to be fair… The gatekeeper doesn’t have to submit to a Condor pool. – It could be PBS, LSF, Sun Grid Engine… Condor-G will work fine whatever the remote batch system is.

23 Four Steps to Run a Job with Condor These choices tell Condor – how – when – where to run the job, – and describe exactly what you want to run. Choose a Universe for your job Make your job batch-ready Create a submit description file Run condor_submit

24 1. Choose a Universe There are many choices – Vanilla: any old job – Standard: checkpointing & remote I/O – Java: better for Java jobs – MPI: Run parallel MPI jobs – Virtual Machine: Run a virtual machine as job – … For now, we’ll just consider vanilla

25 2. Make your job batch-ready Must be able to run in the background: – no interactive input, windows, GUI, etc. Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices Organize data files

26 3. Create a Submit Description File A plain ASCII text file Condor does not care about file extensions Tells Condor about your job: – Which executable to run and where to find it – Which universe – Location of input, output and error files – Command-line arguments, if any – Environment variables – Any special requirements or preferences

27 Simple Submit Description File # myjob.submit file # Simple condor_submit input file # (Lines beginning with # are comments)‏ # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = analysis Log = my_job.log Queue

28 4. Run condor_submit You give condor_submit the name of the submit file you have created: condor_submit my_job.submit condor_submit parses the submit file

29 Another Submit Description File # Example condor_submit input file # (Lines beginning with # are comments)‏ # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = vanilla Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 Queue

“Clusters” and “Processes” If your submit file describes multiple jobs, we call this a “cluster” Each job within a cluster is called a “process” or “proc” If you only specify one job, you still get a cluster, but it has only one process A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) Process numbers always start at 0

Example Submit Description File for a Cluster # Example condor_submit input file that defines # a cluster of two jobs with different iwd Universe = vanilla Executable = my_job Arguments = -arg1 -arg2 InitialDir = run_0 Queue  Becomes job 2.0 InitialDir = run_1 Queue  Becomes job 2.1

Submit Description File for a BIG Cluster of Jobs The initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once $(Process) will be expanded to the process number for each job in the cluster (from 0 up to 599 in this case), so we’ll have “run_0”, “run_1”, … “run_599” directories All the input/output files will be in different directories!

Submit Description File for a BIG Cluster of Jobs # Example condor_submit input file that defines # a cluster of 600 jobs with different iwd Universe = vanilla Executable = my_job Arguments = -arg1 –arg2 InitialDir = run_$(Process) Queue 600

34 Other Condor commands condor_q – show status of job queue condor_status – show status of compute nodes condor_rm – remove a job condor_hold – hold a job temporarily condor_release – release a job from hold

35 Submitting more complex jobs express dependencies between jobs  WORKFLOWS Condor DAGMan. Next week

Hands-on Lab