TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Slides:

Advertisements

Similar presentations

National Institute of Advanced Industrial Science and Technology Advance Reservation-based Grid Co-allocation System Atsuko Takefusa, Hidemoto Nakada,

Advertisements

Grid Wizard Enterprise GSlicer3 Tutorial. Introduction This tutorial assumes you already completed the basic and advanced tutorial. GSlicer3 is a Slicer3.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

(e)Science-Driven, Production- Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

High Throughput Urgent Computing Jason Cope Condor Week 2008.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)

TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.

KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.

December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Ashok Agarwal 1 BaBar MC Production on the Canadian Grid using a Web Services Approach Ashok Agarwal, Ron Desmarais, Ian Gable, Sergey Popov, Sydney Schaffer,

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

TeraGrid Science Gateways: Scaling TeraGrid Access Aaron Shelmire¹, Jim Basney², Jim Marsteller¹, Von Welch²,

Bigben Pittsburgh Supercomputing Center J. Ray Scott

Web Services An introduction for eWiSACWIS May 2008.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

TeraGrid CTSS Plans and Status Dane Skow for Lee Liming and JP Navarro OSG Consortium Meeting 22 August, 2006.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Creating and running an application.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Data, Visualization and Scheduling (DVS) TeraGrid Annual Meeting, April 2008 Kelly Gaither, GIG Area Director DVS.

Configuring and Deploying Web Applications Lesson 7.

AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

2004 Queue Scheduling and Advance Reservations with COSY Junwei Cao Falk Zimmermann C&C Research Laboratories NEC Europe Ltd.

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

XSEDE GLUE2 Update 1. Current XSEDE Usage Using legacy TeraGrid information services Publishing compute information about clusters – Subset of XSEDE clusters.

Batch Queue Prediction Warren Smith. 12/1/2009 TeraGrid Quarterly - Batch Queue Prediction2 Recommendation Scheduling Working Group recommends that TeraGrid.

Workload Management Workpackage

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

CREAM Status and Plans Massimo Sgaravatto – INFN Padova

Wide Area Workload Management Work Package DATAGRID project

Condor-G Making Condor Grid Enabled

Presentation transcript:

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu

Outline Advance Reservation and Coscheduling –GUR Metascheduling –MCP –Condor-G Matchmaking Batch Queue Prediction –QBETS –Karnak Serial Computing –MyCluster –Condor Glide Ins Urgent Computing –SPRUCE 6/9/2008 Metascheduling and Co-Scheduling Tutorial2

Advance Reservation Reserve resources in advance –128 nodes on Queen Bee at 2pm tomorrow for 3 hours –Reservation request from user handled in an automated manner User can then submit jobs to those reserved nodes –Typically can submit to it once the reservation is accepted Variety of uses –Classes or training when quick turnaround is needed –More efficient debugging and tuning Needs to be supported by the batch scheduler –Capability is available in almost every scheduler –Currently only enabled on Queen Bee 6/9/2008 Metascheduling and Co-Scheduling Tutorial3

Coscheduling Simultaneous access to resources on two or more systems Typically implemented using multiple advance reservations –128 nodes on Queen Bee and 128 nodes on Lonestar at 2pm tomorrow for 3 hours –Depends on cluster schedulers supporting advance reservations Variety of uses –Visualization of a simulation in progress –Multi-system simulations (e.g. MPIg) –Teaching and training 6/9/2008 Metascheduling and Co-Scheduling Tutorial4

Grid Universal Remote (GUR) GUR supports both advance reservation and coscheduling –Only TeraGrid-supported tool for this –(Not counting the reservation form on the web site) Command line program that accepts a description file –Candidate systems –Total number of nodes needed –Total duration –Earliest start and latest end Tries different configurations within the specified bounds Client available: Queen Bee, new SDSC system (future) Reserve nodes at: Queen Bee, Ranger (future), new SDSC system (future) 6/9/2008 Metascheduling and Co-Scheduling Tutorial5

Metascheduling Users have jobs that can run on any of several TeraGrid systems Help users select where to submit them –Automatically on a per-job basis –Optimize execution of jobs Manage the execution of the jobs 6/9/2008 Metascheduling and Co-Scheduling Tutorial6

Master Control Program (MCP) Submits multiple copies of a job to different systems Once one copy starts, others are cancelled Command line programs –Specify a submit script for each system a copy will be submitted to Script expected by the batch scheduler on that system –MCP annotations describing how to access each system In each submit script Stored in a configuration file Client available: Queen Bee Send jobs to: Abe, Lincoln, Queen Bee, Cobalt, BigRed, NSTG 6/9/2008 Metascheduling and Co-Scheduling Tutorial7

Condor-G Condor atop Globus Globus provides basic mechanisms –Authentication & authorization –File transfer –Remote job execution & management Condor provides more advanced mechanisms –Improved user interface (batch scheduling) User provides a submit script Typical batch scheduling commands –condor_status – information about systems available to Condor –condor_submit – submit a job –condor_q – observe jobs submitted to the Condor install on this system –condor_rm – cancel a job –Fault tolerance with retries –Improves the scalability of Globus v2 job management

Condor-G Matchmaking Condor’s term for selecting a resource for a job –A job provides requirements and preferences for a resource it can execute on –A resource provides requirements and preferences for jobs that can execute on it –Jobs are paired to resources Satisfy all requirements of both job and resource Optimize preferences of job and resource Accessible from: Ranger, Queen Bee, Lonestar, Steele Can match jobs to: Ranger, Abe, Queen Bee, Lonestar, Cobalt, Pople, Big Red, NSTG

Batch Queue Prediction Predict how long jobs will wait before they start Useful information for resource selection –Manually by users –Automatically by tools 6/9/2008 Metascheduling and Co-Scheduling Tutorial10

QBETS Provides 2 types of predictions –The probability that a hypothetical job will start by a deadline –The amount of time that a job is expected to wait X % of the time –Job described by number of nodes and execution time Integrated into the TeraGrid User Portal Downgraded to experimental –Amount of funding provided to developers –Experience with the service Provides predictions for Ranger, Abe, Queen Bee, Lonestar, Big Red 6/9/2008 Metascheduling and Co-Scheduling Tutorial11

Karnak Provides queue wait predictions for –Hypothetical jobs –Jobs already queued Provides current and historical job statistics Implemented as a REST service –HTTP protocol, various data formats (HTML, XML, text, JSON in progress) Command line clients Status is beta TeraGrid User Portal integration in progress Provides predictions for Ranger, Abe, Lonestar, Cobalt, Pople, NSTG –Any system that deploys the glue2 CTSS package and publishes job information 6/9/2008 Metascheduling and Co-Scheduling Tutorial12

Serial Computing There are some TeraGrid users that have a lot of serial computation to run One place for them to do that is the Condor pool at Purdue The Condor pool may not satisfy some requirements –Amount of nodes available –Co-location with large data sets TeraGrid cluster schedulers are optimized for parallel jobs, not serial jobs –Per-user limits on number of jobs –One job per node (> 1 processing core) There are a few ways to run many serial jobs on TeraGrid clusters –Different RPs have different opinions about whether their clusters should be used this way I think this should generally be resolved when allocations are reviewed 6/9/2008 Metascheduling and Co-Scheduling Tutorial13

MyCluster MyCluster lets a user create a personal cluster –This personal cluster is managed by a user-specified scheduler (e.g. Condor) Parallel jobs are submitted to gather up nodes –This matches the scheduling strategies of most TeraGrid clusters –These jobs start up scheduler daemons –Scheduler daemons interact with the user’s personal scheduler User can run serial jobs on the nodes –Via jobs submitted to their personal scheduler Developer is no longer with TeraGrid so future is uncertain Installed on Lonestar and Ranger Can incorporate nodes from any TeraGrid system 6/9/2008 Metascheduling and Co-Scheduling Tutorial14

Condor Glideins Similar idea to MyCluster User runs their own Condor scheduler User submits parallel jobs to TeraGrid resources that start up Condor daemons –These nodes are then available to the user’s Condor pool User submits serial jobs to their Condor scheduler Isn’t officially documented/supported on TeraGrid Is being used by a few science gateways See Condor manual for more info: ml 6/9/2008 Metascheduling and Co-Scheduling Tutorial15

Urgent Computing High priority job execution –Elevated priority –Next to run –Preemption Requested and managed in an automated way Historically done via a manual process 6/9/2008 Metascheduling and Co-Scheduling Tutorial16

Special PRiority and Urgent Computing Environment (SPRUCE) Automated setup and execution of urgent jobs Ahead of time: –Resource is configured to support SPRUCE –Project gets all of their code working well on the resource –Project is provided with tokens that can be used to request urgent access To run an urgent job –User presents token to the resource Was used a bit by the LEAD gateway Not in production on TeraGrid –SPRUCE still installed on several TeraGrid systems –The status of those installs is unknown –SPRUCE project seems somewhat dormant 6/9/2008 Metascheduling and Co-Scheduling Tutorial17

Discussion Any questions about those capabilities and tools? Have you or any of your users used these capabilities? Any comments for us? Have users asked for any other scheduling capabilities? 6/9/2008 Metascheduling and Co-Scheduling Tutorial18