Grid Compute Resources and Job Management

Grid Compute Resources and Job Management

How do we access the grid ?
Command line with tools that you'll use Specialised applications Ex: Write a program to process images that sends data to run on the grid as an inbuilt feature. Web portals I2U2 SIDGrid‏

Grid Middleware glues the grid together
A short, intuitive definition: the software that glues together different clusters into a grid, taking into consideration the socio-political side of things (such as common policies on who can use what, how much, and what for)

Grid middleware Offers services that couple users with remote resources through resource brokers Remote process management Co-allocation of resources Storage access Information Security QoS

Globus Toolkit the de facto standard for grid middleware.
Developed at ANL & UChicago (Globus Alliance) Open source Adopted by different scientific communities and industries Conceived as an open set of architectures, services and software libraries that support grids and grid applications Provides services in major areas of distributed systems: Core services Data management Security

GRAM Globus Resource Allocation Manager
GRAM = provides a standardised interface to submit jobs to LRMs. Clients submit a job request to GRAM GRAM translates into something a(ny) LRM can understand …. Same job request can be used for many different kinds of LRM

Local Resource Managers (LRM)
Compute resources have a local resource manager (LRM) that controls: Who is allowed to run jobs How jobs run on a specific resource Example policy: Each cluster node can run one job. If there are more jobs, then they must wait in a queue LRMs allow nodes in a cluster can be reserved for a specific person Examples: PBS, LSF, Condor This component deals with job submission. Naturally, users want to be able to runs their programs – for example, data processing, simulation. This section provides you with the answer on how this task can be accomplished on grids. We can run our jobs on other computers on the grid to make use of those resources. The challenging part is that sites on the grid have different platforms, therefore a uniform way of accessing the grid must be deployed. This is exactly the purpose of the grid software stack - it tries to deal with some of those differences. On a particular site, we have a few components involved in job submission. Closest to the hardware, we have a piece of software called a Local Resource Manager. This is a piece of software which manages the local resource, by specifying the order and location of jobs. In its simplest form, it makes sure only one user is using each compute node at once, and has some kind of queue. More advanced LRMs can deal with reservations, prioritising between different users and various other parameters. LRMs are not grid specific – an LRM manages jobs on a particular cluster. It is important to realize that an LRM doesn't deal with jobs coming in from elsewhere. You need to, for example, log in directly to the cluster to submit your jobs.

Job Management on a Grid
GRAM LSF User Condor Site A Site C Each cluster has a local resource manager (LRM) on it. Each site can have a different local resource manager, chosen by the site administration. What we need to do now is to be able to access these resources in a homogeneous way -- this can be done my placing a new layer between the user (submission node) and each cluster. In the globus world, this is accomplished by piece of the globus toolkit called GRAM – the Grid Resource Allocation Manager. GRAM provides a common network interface to various LRMs – so users can go to any site on the grid and make a job submission without really needing to know which LRM is being used. But GRAM doesn't manage the execution – it is up the the LRM to do that. Make sure you understand the distinction between the resource allocation manager (GRAM) and the LRM (for example Condor). GRAM and the LRM work together – GRAM gets the job to a cluster in a standard way for execution, and the LRM then executes it. PBS fork Site B Site D The Grid

GRAM Given a job specification: Creates an environment for the job
Stages files to and from the environment Submits a job to a local resource manager Monitors a job Sends notifications of the job state change Streams a job’s stdout/err during execution

GRAM components Submitting machine Gatekeeper globus-job-run
Internet globus-job-run Jobmanager Jobmanager LRM eg Condor, PBS, LSF Submitting machine (e.g. User's workstation) Worker nodes / CPUs Worker node / CPU Worker node / CPU Worker node / CPU Worker node / CPU Worker node / CPU

Condor is a software system that creates an HTC environment
Created at UW-Madison Condor is a specialized workload management system for compute-intensive jobs. Detects machine availability Harnesses available resources Uses remote system calls to send R/W operations over the network Provides powerful resource management by matching resource owners with consumers (broker) Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. Condor is a batch job system that can take advantage of both dedicated and non-dedicated computers to run jobs. It focuses on high-throughput rather than high-performance, and provides a wide variety of features including checkpointing, transparent process migration, remote I/O, parallel programming with MPI, the ability to run large workflows, and more. Condor-G is designed to interact specifically with Globus and can provide a resource selection service to different and multiple grid sites. A site often uses the same workload management system for all of their resources, but workload management systems in use across a grid often vary. Ideally, which workload management system is ultimately used to submit a job should be transparent to the grid user. In addition to providing a suite of web services to submit, monitor and cancel jobs in a grid environment, Globus provides interface support for several common workload management systems and directions for developing an alternative interface to shield the user from system-specific detail.

How Condor works Condor provides: a job queueing mechanism
scheduling policy priority scheme resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, … chooses when and where to run the jobs based upon a policy, … carefully monitors their progress, and … ultimately informs the user upon completion.

Condor - features Checkpoint & migration Remote system calls
Able to transfer data files and executables across machines Job ordering Job requirements and preferences can be specified via powerful expressions

Condor lets you manage a large number of jobs.
Specify the jobs in a file and submit them to Condor Condor runs them and keeps you notified on their progress Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Handles inter-job dependencies (DAGMan)‏ Users can set Condor's job priorities Condor administrators can set user priorities Can do this as: Local resource manager (LRM) on a compute resource Grid client submitting to GRAM (as Condor-G)‏ Commonly when you submit a job to the grid, you do it as a large number of separate job submissions – for example if you want to use 100 CPUs – you might make a few hundred job submissions then, each one intended to use its own CPU. This can get a bit hard to manage, so you can use a piece of software called Condor – this can (amongst many other things) manage large numbers of jobs for you. It can track their progress, restart them if they fail, handle dependencies between different jobs so that you can say that some jobs must not start till some other job has finished.

Condor-G is the grid job management part of Condor.
Use Condor-G to submit to resources accessible through a Globus interface.

Condor-G … does whatever it takes to run your jobs, even if …
The gatekeeper is temporarily unavailable The job manager crashes Your local machine crashes The network goes down

Remote Resource Access: Condor-G + Globus + Condor
GRAM Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to LRM Organization A Organization B

Condor-G: Access non-Condor Grid resources
Globus middleware deployed across entire Grid remote access to computational resources dependable, robust data transfer Condor job scheduling across multiple resources strong fault tolerance with checkpointing and migration layered over Globus as “personal batch system” for the Grid 18

Four Steps to Run a Job with Condor
These choices tell Condor how when where to run the job, and describe exactly what you want to run. Make your job batch-ready Create a submit description file Run condor_submit There are several steps involved in running a Condor job. These choices tell Condor how, when and where to run the job, and describe exactly what you want to run.

1. Make your job batch-ready
Must be able to run in the background: no interactive input, windows, GUI, etc. Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices Organize data files Having a job that expects GUI input from the user may not work well. Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs.

2. Create a Submit Description File
A plain ASCII text file Condor does not care about file extensions Tells Condor about your job: Which executable to run and where to find it Which universe Location of input, output and error files Command-line arguments, if any Environment variables Any special requirements or preferences You need to create a submit description file that tells Condor about your job. Specify things like which universe to use, the executable to run and where to find it, the location of inputs and a place to put the outputs, any arguments to the executable, environment setup, preferences, and so on. Condor then uses this submit description file to make a ClassAd for your job. (But condor_submit will make a ClassAd from it)

Simple Submit Description File
# myjob.submit file Universe = grid grid_resource = gt2 osgce.cs.clemson.edu/jobmanager-fork Executable = /bin/hostname Arguments = -f Log = /tmp/benc-grid.log Output = grid.out Error = grid.error should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue Here is an example of a very simple submit description file. You may include comments in this file as well. You can see that the syntax is typically “key = value”. The “queue” command is also important -- it tells Condor to submit this job into the job queue.

4. Run condor_submit You give condor_submit the name of the submit file you have created: condor_submit my_job.submit condor_submit parses the submit file The condor_submit command performs the conversion of your submit file into a ClassAd, and places it into the job queue.

Details Lots of options available in the submit file Commands to
watch the queue, the state of your pool, and lots more You’ll see much of this in the hands-on exercises. The example submit files we looked at contain merely a subset of the many options available. Once you submit your job, Condor provides several tools to watch the status of your job, the state of the pool, and much more. You’ll see more of how this works during the tutorial.

Other Condor commands condor_q – show status of job queue
condor_status – show status of compute nodes condor_rm – remove a job condor_hold – hold a job temporarily condor_release – release a job from hold

Submitting more complex jobs
express dependencies between jobs  WORKFLOWS We would like the workflow to be managed even in the face of failures You have seen how to submit a simple job. Now, what happens if the job you’d like to run is more complex? What if you want to establish dependencies between jobs? How can you ensure that your job is resilient to failures?

DAGMan Directed Acyclic Graph Manager
DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”)‏ In the simplest case, DAGMan manages the dependencies between your Condor jobs.

What is a DAG? Job A Job B Job C Job D
A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a “node” in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job B Job C Job D a DAG is the best data structure to represent a workflow of jobs with dependencies. children may not run until their parents have finished – this is why the graph is a directed graph … there’s a direction to the flow of work. In this example, called a “diamond” dag, job A must run first; when it finishes, jobs B and C can run together; and when they are both finished, D can run; when D is finished the DAG is finished. Loops are prohibited because they would lead to deadlock – in a loop, neither node could run until the other finished, and so neither would start – this restriction is what makes the graph acyclic

Defining a DAG A DAG is defined by a .dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D each node will run the Condor job specified by its accompanying Condor submit file Job A Job B Job C Job D This is all it takes to specify the example “diamond” dag. DAGMan will output the log file for each job and use it to determine what to do next.

Submitting a DAG % condor_submit_dag diamond.dag
To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it. Just like any other Condor job, you get fault tolerance in case the machine crashes or reboots, or if there’s a network outage And you’re notified when it’s done, and whether it succeeded or failed % condor_q -- Submitter: foo.bar.edu : < :1027> : foo.bar.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD user /8 19: :00:02 R condor_dagman -f -

Running a DAG DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor-G based on the DAG dependencies. DAGMan A Condor-G Job Queue .dag File A B C First, job A will be submitted alone… D

Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor-G queue at the appropriate times. DAGMan A Condor-G Job Queue B B C Once job A completes successfully, jobs B and C will be submitted at the same time… C D

Running a DAG (cont’d) A Rescue File DAGMan D
In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. DAGMan A Condor-G Job Queue Rescue File B X If job C fails, DAGMan will wait until job B completes, and then will exit, creating a rescue file. Job D will not run. In its log, DAGMan will provide additional details of which node failed and why. D

Recovering a DAG -- fault tolerance
Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. DAGMan A Condor-G Job Queue Rescue File B C Since jobs A and B have already completed, DAGMan will start by re-submitting job C C D

Recovering a DAG (cont’d)
Once that job completes, DAGMan will continue the DAG as if the failure never happened. DAGMan A Condor-G Job Queue B C D D

Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. DAGMan A Condor-G Job Queue B C D

We have seen how Condor:
… monitors submitted jobs and reports progress … implements your policy on the execution order of the jobs … keeps a log of your job activities

OSG & job submissions OSG sites present interfaces allowing remotely submitted jobs to be accepted, queued and executed locally. OSG supports the Condor-G job submission client which interfaces to the Globus GRAM interface at the executing site. Job managers at the backend of the GRAM gatekeeper support job execution by local Condor, LSF, PBS, or SGE batch systems.

Grid Compute Resources and Job Management

Similar presentations

Presentation on theme: "Grid Compute Resources and Job Management"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grid Compute Resources and Job Management

Similar presentations

Presentation on theme: "Grid Compute Resources and Job Management"— Presentation transcript:

Similar presentations

About project

Feedback