Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Slides:

Advertisements

Similar presentations

Part 7: CondorG A: Condor-G B: Laboratory: CondorG.

Advertisements

June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

1 Using Condor An Introduction ICE 2008.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

1 Using Condor An Introduction ICE 2010.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Grid Compute Resources and Job Management New Mexico Grid School – April 9, 2009 Marco Mambelli – University of Chicago

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Part Five: Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Review of Condor,SGE,LSF,PBS

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

Advanced services in gLite Gergely Sipos and Peter Kacsuk MTA SZTAKI.

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

Intermediate HTCondor: Workflows Monday pm

Grid Compute Resources and Job Management

Using Stork An Introduction Condor Week 2006

Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.

Genre1: Condor Grid: CSECCR

Wide Area Workload Management Work Package DATAGRID project

Using Condor An Introduction Paradyn/Condor Week 2002

Condor-G Making Condor Grid Enabled

Presentation transcript:

Grid Compute Resources and Job Management

2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through resource brokers Remote process management Co-allocation of resources Storage access Information Security QoS

3 Terms: Globus GRAM Condor Condor-G

4 Local Resource Managers (LRM) Compute resources have a local resource manager (LRM) that controls:  Who is allowed to run jobs  How jobs run on a specific resource  Specifies the order and location of jobs Example policy:  Each cluster node can run one job.  If there are more jobs, then they must wait in a queue Examples: PBS, LSF, Condor

5 GRAM Globus Resource Allocation Manager GRAM = provides a standardised interface to submit jobs to LRMs. Clients submit a job request to GRAM GRAM translates into something a(ny) LRM can understand …. Same job request can be used for many different kinds of LRM

6 Job Management on a Grid User The Grid Condor PBS LSF fork GRAM Site A Site B Site C Site D

7 GRAM’s abilities Given a job specification:  Creates an environment for the job  Stages files to and from the environment  Submits a job to a local resource manager  Monitors a job  Sends notifications of the job state change  Streams a job’s stdout/err during execution

8 GRAM components Worker nodes / CPUsWorker node / CPU LRM eg Condor, PBS, LSF Gatekeeper Internet Jobmanager globus-job-run Submitting machine (e.g. User's workstation)

9 Submitting a job with GRAM globus-job-run command $ globus-job-run grid07.uchicago.edu /bin/hostname  Run '/bin/hostname' on the resource grid07.uchicago.edu We don't care what LRM is used on ‘grid07’. This command works with any LRM.

10 Condor Condor is a specialized workload management system for compute-intensive jobs. is a software system that creates an HTC environment (created at UW-Madison)UW-Madison)  Detects machine availability  Harnesses available resources  Provides powerful resource management by matching resource owners with consumers (broker)

11 How Condor works Condor provides: a job queueing mechanism scheduling policy priority scheme resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, … chooses when and where to run the jobs based upon a policy, … carefully monitors their progress, and … ultimately informs the user upon completion.

12 Condor lets you manage a large number of jobs. Specify the jobs in a file and submit them to Condor Condor runs them and keeps you notified on their progress  Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.  Handles inter-job dependencies (DAGMan) ‏ - workflows Users can set Condor's job priorities Condor administrators can set user priorities Can do this as:  Local resource manager (LRM) on a compute resource  Grid client submitting to GRAM (as Condor-G) ‏

13 Condor-G is the job management part of Condor.  Hint: Install Condor-G to submit to resources accessible through a Globus interface. Condor-G does not create a grid service. It only deals with using remote grid services.

14 Globus GRAM Protocol Globus GRAM Submit to LRM Organization A Organization B Condor-G myjob1 myjob2 myjob3 myjob4 myjob5 … Remote Resource Access: Condor-G + Globus + Condor

15 Four Steps to Run a Job with Condor Specify to Condor :  how  when  where to run the job,  and describe exactly what you want to run. Choose a Universe for your job Make your job batch-ready Create a submit description file Run condor_submit

16 Simple Submit Description File # myjob.submit file # Simple condor_submit input file # (Lines beginning with # are comments)‏ # NOTE: the words on the left side are not # case sensitive, but filenames are! Universe = grid Executable = analysis Log = my_job.log Queue

17 Run condor_submit You give condor_submit the name of the submit file you have created: condor_submit my_job.submit condor_submit parses the submit file

18 Another Submit Description File # Example condor_submit input file Universe = grid Executable = /home/wright/condor/my_job.condor Input = my_job.stdin Output = my_job.stdout Error = my_job.stderr Arguments = -arg1 -arg2 InitialDir = /home/wright/condor/run_1 Queue

19 Other Condor commands condor_q – show status of job queue condor_status – show status of compute nodes condor_rm – remove a job condor_hold – hold a job temporarily condor_release – release a job from hold

20 Submitting more complex jobs express dependencies between jobs  WORKFLOWS And also, we would like the workflow to be managed even in the face of failures

21 Want other Scheduling possibilities? Use the Scheduler Universe In addition to VANILLA, another job universe is the Scheduler Universe. Scheduler Universe jobs run on the submitting machine and serve as a meta-scheduler. Condor’s Scheduler Universe lets you set up and manage job workflows. DAGMan meta-scheduler included  DAGMan manages these jobs

22 DAGMan Directed Acyclic Graph Manager  Provides a workflow engine that manages Condor jobs organized as DAGs (representing task precedence relationships)  Focus on scheduling and execution of long running jobs DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”) ‏

23 What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a “node” in the DAG. Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

24 A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D each node will run the Condor job specified by its accompanying Condor submit file Defining a DAG Job A Job BJob C Job D

25 Submitting a DAG To start your DAG, just run condor_submit_dag with your.dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

26 DAGMan Running a DAG DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor-G based on the DAG dependencies. Condor-G Job Queue C D A A B.dag File

27 DAGMan Running a DAG (cont’d) DAGMan holds & submits jobs to the Condor-G queue at the appropriate times. Condor-G Job Queue C D B C B A

28 DAGMan Running a DAG (cont’d) In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. Condor-G Job Queue X D A B Rescue File

29 DAGMan Recovering a DAG -- fault tolerance Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. Condor-G Job Queue C D A B Rescue File C

30 DAGMan Recovering a DAG (cont’d) Once that job completes, DAGMan will continue the DAG as if the failure never happened. Condor-G Job Queue C D A B D

31 DAGMan Finishing a DAG Once the DAG is complete, the DAGMan job itself is finished, and exits. Condor-G Job Queue C D A B

32 We have seen how Condor: … monitors submitted jobs and reports progress … implements your policy on the execution order of the jobs … keeps a log of your job activities

33 …. Now go to the Lab part ….

Acknowledgments: This presentation based on: Grid Resources and Job Management Jaime Frey and Becky Gietzel Condor Project U. Wisconsin-Madison