6d.1 Schedulers and Resource Brokers Topics ITCS 4146/5146, UNC-Charlotte, B. Wilkinson, 2007 Feb 12, 2007 Local schedulers Condor.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.
Condor and the Grid D. Thain, T. Tannenbaum, M. Livny Christopher M. Moretti 23 February 2007.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Workload Management Massimo Sgaravatto INFN Padova.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Grid Computing 7700 Fall 2005 Lecture 17: Resource Management Gabrielle Allen
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
3-1.1 Schedulers Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 3, pp For.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
The Anatomy of the Grid Mahdi Hamzeh Fall 2005 Class Presentation for the Parallel Processing Course. All figures and data are copyrights of their respective.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
3-1.1 Schedulers © 2011 B. Wilkinson/Clayton Ferner. Fall 2011 Grid computing course. Modification date: Oct 15, 2011.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Condor DAGMan: Managing Job Dependencies with Condor
OpenPBS – Distributed Workload Management System
Intermediate HTCondor: Workflows Monday pm
Grid Compute Resources and Job Management
Condor: Job Management
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
Presentation transcript:

6d.1 Schedulers and Resource Brokers Topics ITCS 4146/5146, UNC-Charlotte, B. Wilkinson, 2007 Feb 12, 2007 Local schedulers Condor

6d.2 Scheduler Job manager submits jobs to scheduler. Scheduler assigns work to resources to achieve specified time requirements.

6d.3From "Introduction to Grid Computing with Globus," IBM Redbooks Scheduling

6d.4 Executing GT 4 jobs Globus has the modes. Interactive Interactive-streaming Batch

6d.5 GT 4 “Fork” Scheduler Attempts to execute the job immediately Provided for starting and controlling a job on a local host if job does not require any special software loaded or requirements.

6d.6 Batch scheduling Batch, a term from old computing days, when one submitted a pack of punched cards as the program to a computer and one would come back after the program had been run on the computer, maybe overnight.

6d.7 GRAM services GT4 Java Container GRAM services Local scheduler User job Compute element GRAM adapter Local job control Job functions Relationship between GT4 GRAM and a Local Scheduler I Foster Client Various possible

6d.8 globusrun-ws -Ff flag Selects scheduler. Default: Fork for single jobs. Other schedulers have to be added separately, and supported by a GRAM “adapter.”

6d.9 Scheduler adapters included in GT 4 PBS (Portable Batch System) Condor LSF (Load Sharing Facility) Third party adapter provided for: SGE (Sun Grid Engine)

6d.10 globusrun-ws -Ff flag Examples globusrun-ws -Ft Condor on coit-grid02-4 globusrun-ws -Ft SGE coit-grid01 & toralds.cis.uncw.edu

6d.11 (Local) Scheduler Issues Distribute job Based on load and characteristics of machines, available disk storage, network characteristics, …. Runtime scheduling! Arrange data in right place (Staging) –Data Replication and movement as needed –Data Error checking

6d.12 Scheduler Issues (continued) Performance –Error checking – check pointing –Monitoring job, progress monitoring –QOS (Quality of service) –Cost (area considered by Nimrod-G scheduler) Security –Need to authenticate and authorize remote user for job submission Fault Tolerance Automatic scheduling

6d.13 Scheduling policies First-in, First-out Favor certain types of jobs Shortest job first Smallest (or largest) memory first Short (or long) running job first Fair sharing or priority to certain users Dynamic policies –Depending upon time of day and load –Custom, preemptive, process migration

6d.14 Advance Reservation Requesting actions at times in future. “A service level agreement in which the conditions of the agreement start at some agreed-upon time in the future” From: “The Grid 2, Blueprint for a New Computing Infrastructure,” I. Foster and C. Kesselman editors, Morgan Kaufmann, 2004.

6d.15 Resource Broker “A scheduler that optimizers the performance of a particular resource. Performance may be measured by such criteria as fairness (to ensure that all requests for the resources are satisfied) or utilization (to measure the amount of the resource used).” From: “The Grid 2, Blueprint for a New Computing Infrastructure,” I. Foster and C. Kesselman editors, Morgan Kaufmann, 2004.

6d.16 Scheduler/Resource Broker Examples We have used Condor and Sun Grid Engine: Condor/Condor-G –Used in Fall 2004 course and this year in assignment 3. Sun Grid Engine –Used in Fall 2005 course

6d.17 Condor First developed at University of Wisconsin-Madison in mid 1980’s to convert a collection of distributed workstations and clusters into a high- throughput computing facility. Key concept - using wasted computer power of idle workstations.

6d.18 Condor Converts collections of distributed workstations and dedicated clusters into a distributed high-throughput computing facility.

6d.19 Uses Consider following scenario: –I have a simulation that takes two hours to run on my high-end computer –I need to run it 1000 times with slightly different parameters each time. –If I do this on one computer, it will take at least 2000 hours (or about 3 months) From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23, 2004

6d.20 –Suppose my department has 100 PCs like mine that are mostly sitting idle overnight (say 8 hours a day). –If I could use them when their legitimate users are not using them, so that I do not inconvenience them, I could get about 800 CPU hours/day. –This is an ideal situation for Condor. I could do my simulations in 2.5 days. From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23, 2004

6d.21 Condor Features Include: –Resource finder –Batch queue manager –Scheduler –Checkpoint/restart –Process migration

6d.22 Intended to run job even if: Machines crash Disk space exhausted Software not installed Machines are needed by others Machines are managed by others Machines are far away

6d.23 How does Condor work? A collection of machines running Condor called a pool. Individual pools can be joined together in a process called flocking. From: “Condor: What it is and why you should worry about it,” by B. Beckles, University of Cambridge, Seminar, June 23, 2004

6d.24 Machine Roles Machines have one or more of 4 roles: –Central manager –Submit machine (Submit host) –Execution machine (Execute host) –Checkpoint server

6d.25 Central Manager Resource broker for a pool. Keeps track of which machines are available, what jobs are running, negotiates which machine will run which job, etc. Only one central manager per pool.

6d.26 Submit Machine Machine which submits jobs to pool. Must be at least one submit machine in a pool, and usually more than one.

6d.27 Execute Machine Machine on which jobs can be run. Must be at least one execute machine in a pool, and usually more than one.

6d.28 Checkpoint Server Machine which stores al checkpoint files produced by job which checkpoint. Can only be one checkpoint machine in a pool. Optional to have a checkpoint machine.

6d.29 Possible Configuration A central manager. Some machine that can only be submit hosts. Some machine that can be only execute hosts. Some machines that can be both submit and execute hosts.

6d.30

6d.31 Types of Jobs Classified according to environment it provides. Currently seven environments: –Standard –Vanilla –PVM –MPI –Globus –Java –Scheduler

6d.32 Standard For jobs compiled with Condor libraries. Allows for checking pointing and remote system calls. Must be single threaded. Not available under Windows.

6d.33 Checkpointing Certain jobs can checkpoint, both periodically for safety and when interrupted. If checkpointed job interrupted, it will resume at the last checkpointed state when it starts again. Generally no change to source code - need to link Condor’s Standard Universe support library.

6d.34 Vanilla For jobs that cannot be compiled with Condor libraries, and for shell scripts and Windows batch files. No checkpointing or remote system calls.

6d.35 PVM For PVM programs. MPI For MPI programs (MPICH). Both PVM and MPI are message-passing libraries used in message passing programs. Used for local clusters of computers. MPI could be used in grid computing – we will talk about this later in the course.

6d.36 Globus For submitting jobs to resources managed by Globus (version 2.2 and higher). Java For Java programs (written for Java Virtual Interface). Scheduler Used with DAG scheduled jobs, see later.

6d.37 Submitting a job Job submitted to “submit host” using Condor_submit command. Job described in “submit description” file. Submit description file includes details such as given in an RSL file in Globus, i.e. the name of the executable, arguments, etc.

6d.38 Condor Submit Description File # This is a comment, condor submit file Universe = vanilla Executable = /home/abw/condor/myProg Input = myProg.stdin Output = myProg.stdout Error = myProg.stderr Arguments = -arg1 -arg2 InitialDir = /home/abw/condor/assignment4 Queue Describes job to Condor. Used with Condor _submit command. Description File Example

6d.39 Submitting Multiple Jobs Submit file can specify multiple jobs –Example: Queue 500 will submit 500 jobs at once Condor calls groups of jobs a cluster Each job within cluster called a process Condor job ID is the cluster number, a period and process number, for example 26.2 Single jobs also a cluster but with a single process (process 0)

6d.40 Submitting a job with requirements and preferences Done using Condor’s “ClassAd” mechanism, which may include: –What it requires –What it desires –What it prefers, and –What it will accept These details start in submit description file.

6d.41 condor-submit command creates a “ClassAd” from the submit description file, which is then used in ClassAd matchmaking mechanism. Command: condor_submit submit.prog1 ClassAd file submit description file

6d.42 Specifying Requirements A C/Java-like Boolean expression that evaluates to TRUE for a match. # This is a comment, condor submit file Universe = vanilla Executable = /home/abw/condor/myProg InitialDir = /home/abw/condor/assignment4 Requirements = Memory >= 512 && Disk > queue 500

6d.43 ClassAd Matchmaking Used to ensure job done according to constraints of users and owners. Example of user constraints “ I need a Pentium IV with at least 512 Mbytes of RAM and speed of at least 3.8 GHz Example of machine owner constraints “Never run jobs owned by Fred”

6d.44 ClassAd Matchmaking Steps 1.Agents (jobs) and resources (computers) advertise their characteristics and requirements in “classified advertisements.” 2.Matchmaker scans ClassAds and creates pairs that satisfy each others constraints and preferences. 3.Matchmaker informs both parties of match. 4.Agent and resource make contact.

6d.45 Job Job ClassAd Machine ClassAdd Machine Match Machine

6d.46 Job ClassAd Example [ MyType = “Job” TargetType=“Machine” Requirements = ((other.Arch==“INTEL”&&other.OpSys==“LINUX”) && other.Disk>myDiskUsage) DiskUsage = 6000 ] 6 MB Requirements statement must evaluate to true

6d.47 Machine ClassAd Example [ MyType=“Machine” TargetType=“Job” Machine=“coit-grid01.uncc.edu” Requirements= ((LoadAvg<= )&& (KeyboardIdle>(15*60)) Arch=“INTEL” OpSys=“LINUX” Disk= ] Keyboard idle for more than 15 minutes Low load average

6d.48 ClassAd’s Rank Statement Can be used in job ClassAdd for selection between compatible machines. Choose highest rank Rank expression should evaluate to a floating point number. Example Rank = (Memory * 10000) + KFlops Machine speed

6d.49 Rank Statement Can also be used in Machines ClassAd in matchmaking. Example Rank = (other.Department == self.Department) where Department defined in job ClassAdd, say: Department=“Computer Science”

6d.50 Job ClassAd [ MyType = “Job” TargetType=“Machine” … Department=“Computer Science” … ] Machines ClassAd [ MyType=“Machine” TargetType=“Job” … Rank = (other.Department == self.Department) … ] Using rank in Machines ClassAd

6d.51 Directed Acyclic Graph Manager (DAGMan) Meta-scheduler Allows one to specify dependencies between Condor Jobs.

6d.52 Example “Do not run Job B until Job A completed successfully” Especially important to jobs working together (as in Grid computing).

6d.53 Directed Acyclic Graph (DAG) A data structure used to represent dependencies. Directed graph. No cycles. Each job is a node in the DAG. Each node can have any number of parents and childred as long as there are no loops (Acyclic graph).

6d.54 DAG Job A Job CJob B Job D Do job A. Do jobs B and C after job A finished Do job D after both jobs B and C finished.

6d.55 Defining a DAG Defined by a.dag file, listing each of the nodes and their dependencies. Each “job” statement has an abstract job name (say A) and a file (say a.condor) PARENT-CHILD statement describes relationship between two or more jobs Other statements available.

6d.56 Example # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job CJob B Job D

6d.57 To start a DAG, use condor_submit_dag command with dag file: condor_submit_dag diamond.dag condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable.

6d.58 Running a DAG DAGMan acts as a scheduler managing the submission of jobs to Condor based upon DAG dependencies. DAGMan holds and submits jobs to Condor queue at appropriate times.

6d.59 Job Failures DAGMan continues until it cannot make progress and then creates a rescue file holding current state of DAG. When failed job ready to re-run, rescue file used to restore prior state of DAG.

6d.60 Summary of Key Condor Features High throughput computing using an opportunitistic environment. Provides a mechanisms for running jobs on remote machines. Matchmaking Checkpointing DAG scheduling

6d.61 Quiz Give one reason why a scheduler or resource broker is used in conjunction with Globus: (a)Globus does not provide the ability to submit jobs. (b)Globus does not provide the ability to make advance reservations. (c) No reason whatsoever. (d) Globus does not provide the ability to transfer files.

6d.62 (a)There are no similarities. (b)They both provide a means of specifying command line arguments for the job. (c)They both provide a means of specifying whether a named user is allowed to execute a job. (d)They both provide a means of specifying machine requirements for a job. Identify which of the following are similarities between Condor ClassAd and Globus RSL (version 1 or 2). (There may be more than one similarity.)

6d.63 In the context of schedulers, what is meant by the term “Advance Reservation”? (a)Requesting an advance. (b)Submitting a more advanced job. (c)Move onto the next job. (d)Requesting actions at a future time.

6d.64 More Information Chapter 11, Condor and the Grid, D. Thain, T. Tannenbaum, and M. Livny, Grid Computing: Making The Global Infrastructure a Reality, F. Berman, A. J. G. Hey, and G. Fox, editors, John Wiley, “Condor-G: A Computation Management Agent for Multi-Institutional Grids,” J. Frey, T. Tannenbaum, I. Foster, M. Livny, S. Tuecke, Proc. 10 th Int. Symp. High Performance Distributed Computing (HPDC- 10) Aug

6d.65 Questions