Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Premier Director Document Imaging
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Introduction to UNIX/Linux Exercises Dan Stanzione.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Mastering the AS/400, Third Edition, author Jerry Fottral 1 Week 2 The System The AS/400 is a multi-user, multi-tasking system -- a system on which many.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Grid Computing I CONDOR.
Module 9: Preparing to Administer a Server. Overview Introduction to Administering a Server Configuring Remote Desktop to Administer a Server Managing.
Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
PROOF work progress. Progress on PROOF The TCondor class was rewritten. Tested on a condor pool with 44 nodes. Monitoring with Ganglia page. The tests.
Operating Systems Process Management.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Nick LeRoy Computer Sciences Department University of Wisconsin-Madison Hawkeye.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
An operating system (OS) is a collection of system programs that together control the operation of a computer system.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Five todos when moving an application to distributed HTC.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Development Environment
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Building Grids with Condor
Basic Grid Projects – Condor (Part I)
Ainsley Smith Tel: Ex
Sun Grid Engine.
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
Presentation transcript:

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL) Condor COD (Computing On Demand) Condor Week 5/5/2003

What problem are we trying to solve? › Some people want to run interactive, yet compute-intensive applications › Jobs that take lots of compute power over a relatively short period of time › They want to use batch computing resources, but need them right away › Ideally, when they’re not in use, resources would go back to the batch system

Some example applications: › A distributed build/compilation of a large software system › A very complex spreadsheet that takes a lot of cycles when you press “recalculate” › High-energy physics (HEP) “analysis” jobs › Visualization tools for data-mining, rendering graphics, etc.

Batch Jobs Compute Farm User’s Workstation Example application for COD On-demand workers Idle nodes Data Display Controller application

› Condor COD: “Computing on Demand”  Use Condor to manage the batch resources when they’re not in use by the interactive jobs  Allow the interactive jobs to come in with high priority and run instead of the batch job on any given resource What’s the Condor solution?

Why did we have to change Condor for that? › Doesn’t Condor already notice when an interactive job starts on a CPU? › Doesn’t Condor already provide checkpointing when that happens? › Can’t I configure Condor to run whatever jobs I want with a higher priority on my own machines?

Well, yes… But that’s not good enough… › Not all jobs can be checkpointed, and even those that can take some time… › We want this to be instantaneous, not waiting for the batch system to schedule tasks… › You can configure Condor to run higher priority jobs, but the other jobs are kicked off the machine…

What’s new about COD? › “Checkpoint to swap space”  When a high-priority COD job appears, the lower-priority batch job is suspended  The COD job can run right away, while the batch job is suspended  Batch jobs (even those that can’t checkpoint) can resume instantly once there are no more active COD jobs

But wait, there’s more… › The condor_startd can now manage multiple “claims” on each resource  If any COD claim becomes active, the regular Condor claim is automatically suspended  Without an active COD, regular claim resumes › There is a new command-line tool to request, activate, suspend, resume and release these claims › There’s even a C++ object to do all of that, if you really want it…

COD claim-management commands › Request: authorizes the user and returns a unique claim ID for future commands › Activate: spawns an application on a given COD claim, with various options to define the application, job ID, etc  Suspends any regular Condor job  You can have multiple COD claims on a single resource, and they can all be running simultaneously

COD commands (cont’d) › Suspend:  Given COD claim is suspended  If there are no more active COD claims, a regular Condor batch job can now run › Resume: Given COD claim is resumed, suspending the Condor batch job (if any) › Deactivate: Kill the application but hold onto the COD claim › Release: Get rid of the COD claim itself

COD command protocol › All commands use ClassAds  Allows for a flexible protocol  Excellent error propagation  Can use existing ClassAd technology › Similar to existing Condor protocol  Separation of claiming from activation, so you can have hot-spares, etc.

How does all of that solve the problem? › The interactive COD application starts up, and goes out to claim some compute nodes › Once the helper applications are in place and ready, these COD claims are suspended, allowing batch jobs to run › When the interactive application has work, it can instantly suspend the batch jobs and resume the COD applications to perform the computations

User’s Workstation Compute Farm Step 1: Initial state Idle nodes Batch jobs Idle nodes %

User’s Workstation Compute Farm Step 2: Application spawned Idle nodes Batch jobs Idle nodes % fractal-gen –n 4 Controller application spawned

User’s Workstation Compute Farm Step 3: Compute node setup Idle nodes Batch jobs request activate Claiming and initializing [4] compute nodes for rendering… Got reply from: c1.cluster.org c6.cluster.org c14.cluster.org c17.cluster.org SUCCESS! On-demand workers On-demand workers

% condor_cod_request –name c1.cluster.org \ –classad c1.out Successfully sent CA_REQUEST_CLAIM to startd at Result ClassAd written to c1.out ID of new claim is: “ # #2” % condor_cod_activate –keyword fractgen \ –id “ # #2” Successfully sent CA_ACTIVATE_CLAIM to startd at % … Step 3: Commands used

User’s Workstation Compute Farm Step 4: “Checkpoint” to swap Batch jobs suspend Idle nodes Suspended worker SELECT FRACTAL TYPE (more user input…)

Step 4: Commands used › Rendering application on each COD node is suspended while interactive tool waits for input › The resources are now available for batch Condor jobs % condor_cod_suspend \ –id “ # #2” Successfully sent CA_SUSPEND_CLAIM to startd at % …

User’s Workstation Compute Farm Step 5: Batch jobs can run Batch queue Batch jobs Idle nodes SPECIFY PARAMETERS max_iterations: TL: , BR: , (more user input…)

Compute Farm Step 6: Computation burst Idle nodes Batch jobs User’s Workstation resume Suspended batch job Interactive workers On-demand workers CLICK TO VIEW YOUR FRACTAL… RENDER

Step 6: Commands used › Batch Condor jobs on COD nodes are suspended › All COD rendering applications are resumed on each node % condor_cod_resume \ –id “ # #2” Successfully sent CA_RESUME_CLAIM to startd at % …

Compute Farm Step 7: Results produced Idle nodes Batch jobs User’s Workstation Suspended batch job Interactive workers On-demand workers Data Display

Compute Farm Step 8: User input while batch work resumes Idle nodes Batch jobs User’s Workstation ZOOM BOX COORDINATES: TL = , BR = , Suspended worker suspend

Compute Farm Step 9: Computation burst #2 Idle nodes Batch jobs User’s Workstation Interactive workers Suspended batch job On-demand workers resume Data Display RENDER

Compute Farm Step 10: Clean-up Idle nodes Batch jobs User’s Workstation release Idle nodes REALLY QUIT? Y/N Releasing compute nodes… 4 nodes terminated successfully!

Step 10: Commands used › The jobs are cleaned up, claims released, and resources returned to batch system % condor_cod_release \ –id “ # #2” Successfully sent CA_RELEASE_CLAIM to startd at State of claim when it was released: "Running" % …

Other changes for COD: › The condor_starter has been modified so that it can run jobs without communicating with a condor_shadow  All the great job control features of the starter without a shadow  Starter can write its own UserLog  Other useful features for COD

condor_status –cod › New “ –cod” option to condor_status to view COD claims in a Condor pool: Name ID ClaimState TimeInState RemoteUser JobId Keyword astro.cs.wi COD1 Idle 0+00:00:04 wright chopin.cs.w COD1 Running 0+00:02:05 wright 3.0 fractgen chopin.cs.w COD2 Suspended 0+00:10:21 wright 4.0 fractgen Total Idle Running Suspended Vacating Killing INTEL/LINUX Total

What else could I use all these new features for? › Short-running system administration tasks that need quick access but don’t want to disturb the jobs in your batch system › A “Grid Shell”  A condor_starter that doesn’t need a condor_shadow is a powerful job management environment that can monitor a job running under a “hostile” batch system on the grid

Future work › More ways to tell COD about your application  For now, you define important attributes in your condor_config file and pre-stage the executables › Ability to transfer files to and from a COD job at a remote machine  We’ve already got the functionality in Condor, so why rely on a shared filesystem or pre-staging?

More future work › Accounting for COD jobs › Working with some real-world applications and integrating these new COD features  Would the real users please stand up? › Better “Grid Shell” support  This is really a separate-yet-related area of work…

How do you use COD? › Upgrade to Condor version or greater… COD is already included › There will be a new section in the Condor manual (coming soon) › If you need more help, ask the ever helpful › Find me at the BoF on Wednesday, 9am to Noon (room TBA)