Grid Compute Resources and Job Management

Slides:

Advertisements

Similar presentations

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Advertisements

Part 7: CondorG A: Condor-G B: Laboratory: CondorG.

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

1 Using Condor An Introduction ICE 2008.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid job submission using HTCondor Andrew Lahiff.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Review of Condor,SGE,LSF,PBS

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Resource access in the EGEE project Massimo Sgaravatto INFN Padova

Workload Management Workpackage

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Processes and threads.

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

U.S. ATLAS Grid Production Experience

Intermediate HTCondor: Workflows Monday pm

Peter Kacsuk – Sipos Gergely MTA SZTAKI

GWE Core Grid Wizard Enterprise (

Using Stork An Introduction Condor Week 2006

Building Grids with Condor

Condor: Job Management

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Basic Grid Projects – Condor (Part I)

Genre1: Condor Grid: CSECCR

Wide Area Workload Management Work Package DATAGRID project

Overview of Workflows: Why Use Them?

Condor-G Making Condor Grid Enabled

JRA 1 Progress Report ETICS 2 All-Hands Meeting

Grid Computing Software Interface

Presentation transcript:

Grid Compute Resources and Job Management

How do we access the grid ? Command line with tools that you'll use Specialised applications Ex: Write a program to process images that sends data to run on the grid as an inbuilt feature. Web portals I2U2 SIDGrid‏

Grid Middleware glues the grid together A short, intuitive definition: the software that glues together different clusters into a grid, taking into consideration the socio-political side of things (such as common policies on who can use what, how much, and what for)

Grid middleware Offers services that couple users with remote resources through resource brokers Remote process management Co-allocation of resources Storage access Information Security QoS

Globus Toolkit the de facto standard for grid middleware. Developed at ANL & UChicago (Globus Alliance) Open source Adopted by different scientific communities and industries Conceived as an open set of architectures, services and software libraries that support grids and grid applications Provides services in major areas of distributed systems: Core services Data management Security

GRAM Globus Resource Allocation Manager GRAM = provides a standardised interface to submit jobs to LRMs. Clients submit a job request to GRAM GRAM translates into something a(ny) LRM can understand …. Same job request can be used for many different kinds of LRM

Local Resource Managers (LRM) Compute resources have a local resource manager (LRM) that controls: Who is allowed to run jobs How jobs run on a specific resource Example policy: Each cluster node can run one job. If there are more jobs, then they must wait in a queue LRMs allow nodes in a cluster can be reserved for a specific person Examples: PBS, LSF, Condor This component deals with job submission. Naturally, users want to be able to runs their programs – for example, data processing, simulation. This section provides you with the answer on how this task can be accomplished on grids. We can run our jobs on other computers on the grid to make use of those resources. The challenging part is that sites on the grid have different platforms, therefore a uniform way of accessing the grid must be deployed. This is exactly the purpose of the grid software stack - it tries to deal with some of those differences. On a particular site, we have a few components involved in job submission. Closest to the hardware, we have a piece of software called a Local Resource Manager. This is a piece of software which manages the local resource, by specifying the order and location of jobs. In its simplest form, it makes sure only one user is using each compute node at once, and has some kind of queue. More advanced LRMs can deal with reservations, prioritising between different users and various other parameters. LRMs are not grid specific – an LRM manages jobs on a particular cluster. It is important to realize that an LRM doesn't deal with jobs coming in from elsewhere. You need to, for example, log in directly to the cluster to submit your jobs.

Job Management on a Grid GRAM LSF User Condor Site A Site C Each cluster has a local resource manager (LRM) on it. Each site can have a different local resource manager, chosen by the site administration. What we need to do now is to be able to access these resources in a homogeneous way -- this can be done my placing a new layer between the user (submission node) and each cluster. In the globus world, this is accomplished by piece of the globus toolkit called GRAM – the Grid Resource Allocation Manager. GRAM provides a common network interface to various LRMs – so users can go to any site on the grid and make a job submission without really needing to know which LRM is being used. But GRAM doesn't manage the execution – it is up the the LRM to do that. Make sure you understand the distinction between the resource allocation manager (GRAM) and the LRM (for example Condor). GRAM and the LRM work together – GRAM gets the job to a cluster in a standard way for execution, and the LRM then executes it. http://www.globus.org/toolkit/docs/4.0/execution/ PBS fork Site B Site D The Grid

GRAM Given a job specification: Creates an environment for the job Stages files to and from the environment Submits a job to a local resource manager Monitors a job Sends notifications of the job state change Streams a job’s stdout/err during execution

GRAM components Submitting machine Gatekeeper globus-job-run Internet globus-job-run Jobmanager Jobmanager LRM eg Condor, PBS, LSF Submitting machine (e.g. User's workstation) Worker nodes / CPUs Worker node / CPU Worker node / CPU Worker node / CPU Worker node / CPU Worker node / CPU

Condor is a software system that creates an HTC environment Created at UW-Madison Condor is a specialized workload management system for compute-intensive jobs. Detects machine availability Harnesses available resources Uses remote system calls to send R/W operations over the network Provides powerful resource management by matching resource owners with consumers (broker) Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. Condor is a batch job system that can take advantage of both dedicated and non-dedicated computers to run jobs. It focuses on high-throughput rather than high-performance, and provides a wide variety of features including checkpointing, transparent process migration, remote I/O, parallel programming with MPI, the ability to run large workflows, and more. Condor-G is designed to interact specifically with Globus and can provide a resource selection service to different and multiple grid sites. A site often uses the same workload management system for all of their resources, but workload management systems in use across a grid often vary. Ideally, which workload management system is ultimately used to submit a job should be transparent to the grid user. In addition to providing a suite of web services to submit, monitor and cancel jobs in a grid environment, Globus provides interface support for several common workload management systems and directions for developing an alternative interface to shield the user from system-specific detail.

How Condor works Condor provides: a job queueing mechanism scheduling policy priority scheme resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, … chooses when and where to run the jobs based upon a policy, … carefully monitors their progress, and … ultimately informs the user upon completion.

Condor - features Checkpoint & migration Remote system calls Able to transfer data files and executables across machines Job ordering Job requirements and preferences can be specified via powerful expressions

Condor lets you manage a large number of jobs. Specify the jobs in a file and submit them to Condor Condor runs them and keeps you notified on their progress Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Handles inter-job dependencies (DAGMan)‏ Users can set Condor's job priorities Condor administrators can set user priorities Can do this as: Local resource manager (LRM) on a compute resource Grid client submitting to GRAM (as Condor-G)‏ Commonly when you submit a job to the grid, you do it as a large number of separate job submissions – for example if you want to use 100 CPUs – you might make a few hundred job submissions then, each one intended to use its own CPU. This can get a bit hard to manage, so you can use a piece of software called Condor – this can (amongst many other things) manage large numbers of jobs for you. It can track their progress, restart them if they fail, handle dependencies between different jobs so that you can say that some jobs must not start till some other job has finished.

Condor-G is the grid job management part of Condor. Use Condor-G to submit to resources accessible through a Globus interface.

Condor-G … does whatever it takes to run your jobs, even if … The gatekeeper is temporarily unavailable The job manager crashes Your local machine crashes The network goes down

Remote Resource Access: Condor-G + Globus + Condor GRAM Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to LRM Organization A Organization B

Condor-G: Access non-Condor Grid resources Globus middleware deployed across entire Grid remote access to computational resources dependable, robust data transfer Condor job scheduling across multiple resources strong fault tolerance with checkpointing and migration layered over Globus as “personal batch system” for the Grid 18

Four Steps to Run a Job with Condor These choices tell Condor how when where to run the job, and describe exactly what you want to run. Make your job batch-ready Create a submit description file Run condor_submit There are several steps involved in running a Condor job. These choices tell Condor how, when and where to run the job, and describe exactly what you want to run.

1. Make your job batch-ready Must be able to run in the background: no interactive input, windows, GUI, etc. Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices Organize data files Having a job that expects GUI input from the user may not work well. Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs.

2. Create a Submit Description File A plain ASCII text file Condor does not care about file extensions Tells Condor about your job: Which executable to run and where to find it Which universe Location of input, output and error files Command-line arguments, if any Environment variables Any special requirements or preferences You need to create a submit description file that tells Condor about your job. Specify things like which universe to use, the executable to run and where to find it, the location of inputs and a place to put the outputs, any arguments to the executable, environment setup, preferences, and so on. Condor then uses this submit description file to make a ClassAd for your job. (But condor_submit will make a ClassAd from it)

Simple Submit Description File # myjob.submit file Universe = grid grid_resource = gt2 osgce.cs.clemson.edu/jobmanager-fork Executable = /bin/hostname Arguments = -f Log = /tmp/benc-grid.log Output = grid.out Error = grid.error should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue Here is an example of a very simple submit description file. You may include comments in this file as well. You can see that the syntax is typically “key = value”. The “queue” command is also important -- it tells Condor to submit this job into the job queue.

4. Run condor_submit You give condor_submit the name of the submit file you have created: condor_submit my_job.submit condor_submit parses the submit file The condor_submit command performs the conversion of your submit file into a ClassAd, and places it into the job queue.

Details Lots of options available in the submit file Commands to watch the queue, the state of your pool, and lots more You’ll see much of this in the hands-on exercises. The example submit files we looked at contain merely a subset of the many options available. Once you submit your job, Condor provides several tools to watch the status of your job, the state of the pool, and much more. You’ll see more of how this works during the tutorial.

Other Condor commands condor_q – show status of job queue condor_status – show status of compute nodes condor_rm – remove a job condor_hold – hold a job temporarily condor_release – release a job from hold

Submitting more complex jobs express dependencies between jobs  WORKFLOWS We would like the workflow to be managed even in the face of failures You have seen how to submit a simple job. Now, what happens if the job you’d like to run is more complex? What if you want to establish dependencies between jobs? How can you ensure that your job is resilient to failures?

DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”)‏ In the simplest case, DAGMan manages the dependencies between your Condor jobs.

What is a DAG? Job A Job B Job C Job D A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a “node” in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job B Job C Job D a DAG is the best data structure to represent a workflow of jobs with dependencies. children may not run until their parents have finished – this is why the graph is a directed graph … there’s a direction to the flow of work. In this example, called a “diamond” dag, job A must run first; when it finishes, jobs B and C can run together; and when they are both finished, D can run; when D is finished the DAG is finished. Loops are prohibited because they would lead to deadlock – in a loop, neither node could run until the other finished, and so neither would start – this restriction is what makes the graph acyclic

Defining a DAG A DAG is defined by a .dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D each node will run the Condor job specified by its accompanying Condor submit file Job A Job B Job C Job D This is all it takes to specify the example “diamond” dag. DAGMan will output the log file for each job and use it to determine what to do next.

Submitting a DAG % condor_submit_dag diamond.dag To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it. Just like any other Condor job, you get fault tolerance in case the machine crashes or reboots, or if there’s a network outage And you’re notified when it’s done, and whether it succeeded or failed % condor_q -- Submitter: foo.bar.edu : <128.105.175.133:1027> : foo.bar.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2456.0 user 3/8 19:22 0+00:00:02 R 0 3.2 condor_dagman -f -

Running a DAG DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor-G based on the DAG dependencies. DAGMan A Condor-G Job Queue .dag File A B C First, job A will be submitted alone… D

Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor-G queue at the appropriate times. DAGMan A Condor-G Job Queue B B C Once job A completes successfully, jobs B and C will be submitted at the same time… C D

Running a DAG (cont’d) A Rescue File DAGMan D In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. DAGMan A Condor-G Job Queue Rescue File B X If job C fails, DAGMan will wait until job B completes, and then will exit, creating a rescue file. Job D will not run. In its log, DAGMan will provide additional details of which node failed and why. D

Recovering a DAG -- fault tolerance Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. DAGMan A Condor-G Job Queue Rescue File B C Since jobs A and B have already completed, DAGMan will start by re-submitting job C C D

Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. DAGMan A Condor-G Job Queue B C D D

Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. DAGMan A Condor-G Job Queue B C D

We have seen how Condor: … monitors submitted jobs and reports progress … implements your policy on the execution order of the jobs … keeps a log of your job activities

OSG & job submissions OSG sites present interfaces allowing remotely submitted jobs to be accepted, queued and executed locally. OSG supports the Condor-G job submission client which interfaces to the Globus GRAM interface at the executing site. Job managers at the backend of the GRAM gatekeeper support job execution by local Condor, LSF, PBS, or SGE batch systems.