Presentation is loading. Please wait.

Presentation is loading. Please wait.

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Similar presentations


Presentation on theme: "TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu."— Presentation transcript:

1 TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu

2 Outline Advance Reservation and Coscheduling –GUR Metascheduling –MCP –Condor-G Matchmaking Batch Queue Prediction –QBETS –Karnak Serial Computing –MyCluster –Condor Glide Ins Urgent Computing –SPRUCE 6/9/2008 Metascheduling and Co-Scheduling Tutorial2

3 Advance Reservation Reserve resources in advance –128 nodes on Queen Bee at 2pm tomorrow for 3 hours –Reservation request from user handled in an automated manner User can then submit jobs to those reserved nodes –Typically can submit to it once the reservation is accepted Variety of uses –Classes or training when quick turnaround is needed –More efficient debugging and tuning Needs to be supported by the batch scheduler –Capability is available in almost every scheduler –Currently only enabled on Queen Bee 6/9/2008 Metascheduling and Co-Scheduling Tutorial3

4 Coscheduling Simultaneous access to resources on two or more systems Typically implemented using multiple advance reservations –128 nodes on Queen Bee and 128 nodes on Lonestar at 2pm tomorrow for 3 hours –Depends on cluster schedulers supporting advance reservations Variety of uses –Visualization of a simulation in progress –Multi-system simulations (e.g. MPIg) –Teaching and training 6/9/2008 Metascheduling and Co-Scheduling Tutorial4

5 Grid Universal Remote (GUR) GUR supports both advance reservation and coscheduling –Only TeraGrid-supported tool for this –(Not counting the reservation form on the web site) Command line program that accepts a description file –Candidate systems –Total number of nodes needed –Total duration –Earliest start and latest end Tries different configurations within the specified bounds Client available: Queen Bee, new SDSC system (future) Reserve nodes at: Queen Bee, Ranger (future), new SDSC system (future) https://www.teragrid.org/web/user-support/gur 6/9/2008 Metascheduling and Co-Scheduling Tutorial5

6 Metascheduling Users have jobs that can run on any of several TeraGrid systems Help users select where to submit them –Automatically on a per-job basis –Optimize execution of jobs Manage the execution of the jobs 6/9/2008 Metascheduling and Co-Scheduling Tutorial6

7 Master Control Program (MCP) Submits multiple copies of a job to different systems Once one copy starts, others are cancelled Command line programs –Specify a submit script for each system a copy will be submitted to Script expected by the batch scheduler on that system –MCP annotations describing how to access each system In each submit script Stored in a configuration file Client available: Queen Bee Send jobs to: Abe, Lincoln, Queen Bee, Cobalt, BigRed, NSTG https://www.teragrid.org/web/user-support/mcp 6/9/2008 Metascheduling and Co-Scheduling Tutorial7

8 Condor-G Condor atop Globus Globus provides basic mechanisms –Authentication & authorization –File transfer –Remote job execution & management Condor provides more advanced mechanisms –Improved user interface (batch scheduling) User provides a submit script Typical batch scheduling commands –condor_status – information about systems available to Condor –condor_submit – submit a job –condor_q – observe jobs submitted to the Condor install on this system –condor_rm – cancel a job –Fault tolerance with retries –Improves the scalability of Globus v2 job management

9 Condor-G Matchmaking Condor’s term for selecting a resource for a job –A job provides requirements and preferences for a resource it can execute on –A resource provides requirements and preferences for jobs that can execute on it –Jobs are paired to resources Satisfy all requirements of both job and resource Optimize preferences of job and resource Accessible from: Ranger, Queen Bee, Lonestar, Steele Can match jobs to: Ranger, Abe, Queen Bee, Lonestar, Cobalt, Pople, Big Red, NSTG https://www.teragrid.org/web/user-support/condorg_match

10 Batch Queue Prediction Predict how long jobs will wait before they start Useful information for resource selection –Manually by users –Automatically by tools 6/9/2008 Metascheduling and Co-Scheduling Tutorial10

11 QBETS Provides 2 types of predictions –The probability that a hypothetical job will start by a deadline –The amount of time that a job is expected to wait X % of the time –Job described by number of nodes and execution time Integrated into the TeraGrid User Portal Downgraded to experimental –Amount of funding provided to developers –Experience with the service Provides predictions for Ranger, Abe, Queen Bee, Lonestar, Big Red 6/9/2008 Metascheduling and Co-Scheduling Tutorial11

12 Karnak Provides queue wait predictions for –Hypothetical jobs –Jobs already queued Provides current and historical job statistics Implemented as a REST service –HTTP protocol, various data formats (HTML, XML, text, JSON in progress) Command line clients Status is beta TeraGrid User Portal integration in progress Provides predictions for Ranger, Abe, Lonestar, Cobalt, Pople, NSTG –Any system that deploys the glue2 CTSS package and publishes job information http://karnak.teragrid.org 6/9/2008 Metascheduling and Co-Scheduling Tutorial12

13 Serial Computing There are some TeraGrid users that have a lot of serial computation to run One place for them to do that is the Condor pool at Purdue The Condor pool may not satisfy some requirements –Amount of nodes available –Co-location with large data sets TeraGrid cluster schedulers are optimized for parallel jobs, not serial jobs –Per-user limits on number of jobs –One job per node (> 1 processing core) There are a few ways to run many serial jobs on TeraGrid clusters –Different RPs have different opinions about whether their clusters should be used this way I think this should generally be resolved when allocations are reviewed 6/9/2008 Metascheduling and Co-Scheduling Tutorial13

14 MyCluster MyCluster lets a user create a personal cluster –This personal cluster is managed by a user-specified scheduler (e.g. Condor) Parallel jobs are submitted to gather up nodes –This matches the scheduling strategies of most TeraGrid clusters –These jobs start up scheduler daemons –Scheduler daemons interact with the user’s personal scheduler User can run serial jobs on the nodes –Via jobs submitted to their personal scheduler Developer is no longer with TeraGrid so future is uncertain Installed on Lonestar and Ranger Can incorporate nodes from any TeraGrid system https://www.teragrid.org/web/user-support/mycluster 6/9/2008 Metascheduling and Co-Scheduling Tutorial14

15 Condor Glideins Similar idea to MyCluster User runs their own Condor scheduler User submits parallel jobs to TeraGrid resources that start up Condor daemons –These nodes are then available to the user’s Condor pool User submits serial jobs to their Condor scheduler Isn’t officially documented/supported on TeraGrid Is being used by a few science gateways See Condor manual for more info: http://www.cs.wisc.edu/condor/manual/v7.5/5_4Glidein.ht ml 6/9/2008 Metascheduling and Co-Scheduling Tutorial15

16 Urgent Computing High priority job execution –Elevated priority –Next to run –Preemption Requested and managed in an automated way Historically done via a manual process 6/9/2008 Metascheduling and Co-Scheduling Tutorial16

17 Special PRiority and Urgent Computing Environment (SPRUCE) Automated setup and execution of urgent jobs Ahead of time: –Resource is configured to support SPRUCE –Project gets all of their code working well on the resource –Project is provided with tokens that can be used to request urgent access To run an urgent job –User presents token to the resource Was used a bit by the LEAD gateway Not in production on TeraGrid –SPRUCE still installed on several TeraGrid systems –The status of those installs is unknown –SPRUCE project seems somewhat dormant http://spruce.teragrid.org/index.php 6/9/2008 Metascheduling and Co-Scheduling Tutorial17

18 Discussion Any questions about those capabilities and tools? Have you or any of your users used these capabilities? Any comments for us? Have users asked for any other scheduling capabilities? 6/9/2008 Metascheduling and Co-Scheduling Tutorial18


Download ppt "TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu."

Similar presentations


Ads by Google