Presentation is loading. Please wait.

Presentation is loading. Please wait.

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Similar presentations

Presentation on theme: "Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)"— Presentation transcript:

1 Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL) Condor COD (Computing On Demand) Condor Week 5/5/2003

2 What problem are we trying to solve? › Some people want to run interactive, yet compute-intensive applications › Jobs that take lots of compute power over a relatively short period of time › They want to use batch computing resources, but need them right away › Ideally, when they’re not in use, resources would go back to the batch system

3 Some example applications: › A distributed build/compilation of a large software system › A very complex spreadsheet that takes a lot of cycles when you press “recalculate” › High-energy physics (HEP) “analysis” jobs › Visualization tools for data-mining, rendering graphics, etc.

4 Batch Jobs Compute Farm User’s Workstation Example application for COD On-demand workers Idle nodes Data Display Controller application

5 › Condor COD: “Computing on Demand”  Use Condor to manage the batch resources when they’re not in use by the interactive jobs  Allow the interactive jobs to come in with high priority and run instead of the batch job on any given resource What’s the Condor solution?

6 Why did we have to change Condor for that? › Doesn’t Condor already notice when an interactive job starts on a CPU? › Doesn’t Condor already provide checkpointing when that happens? › Can’t I configure Condor to run whatever jobs I want with a higher priority on my own machines?

7 Well, yes… But that’s not good enough… › Not all jobs can be checkpointed, and even those that can take some time… › We want this to be instantaneous, not waiting for the batch system to schedule tasks… › You can configure Condor to run higher priority jobs, but the other jobs are kicked off the machine…

8 What’s new about COD? › “Checkpoint to swap space”  When a high-priority COD job appears, the lower-priority batch job is suspended  The COD job can run right away, while the batch job is suspended  Batch jobs (even those that can’t checkpoint) can resume instantly once there are no more active COD jobs

9 But wait, there’s more… › The condor_startd can now manage multiple “claims” on each resource  If any COD claim becomes active, the regular Condor claim is automatically suspended  Without an active COD, regular claim resumes › There is a new command-line tool to request, activate, suspend, resume and release these claims › There’s even a C++ object to do all of that, if you really want it…

10 COD claim-management commands › Request: authorizes the user and returns a unique claim ID for future commands › Activate: spawns an application on a given COD claim, with various options to define the application, job ID, etc  Suspends any regular Condor job  You can have multiple COD claims on a single resource, and they can all be running simultaneously

11 COD commands (cont’d) › Suspend:  Given COD claim is suspended  If there are no more active COD claims, a regular Condor batch job can now run › Resume: Given COD claim is resumed, suspending the Condor batch job (if any) › Deactivate: Kill the application but hold onto the COD claim › Release: Get rid of the COD claim itself

12 COD command protocol › All commands use ClassAds  Allows for a flexible protocol  Excellent error propagation  Can use existing ClassAd technology › Similar to existing Condor protocol  Separation of claiming from activation, so you can have hot-spares, etc.

13 How does all of that solve the problem? › The interactive COD application starts up, and goes out to claim some compute nodes › Once the helper applications are in place and ready, these COD claims are suspended, allowing batch jobs to run › When the interactive application has work, it can instantly suspend the batch jobs and resume the COD applications to perform the computations

14 User’s Workstation Compute Farm Step 1: Initial state Idle nodes Batch jobs Idle nodes %

15 User’s Workstation Compute Farm Step 2: Application spawned Idle nodes Batch jobs Idle nodes % fractal-gen –n 4 Controller application spawned

16 User’s Workstation Compute Farm Step 3: Compute node setup Idle nodes Batch jobs request activate Claiming and initializing [4] compute nodes for rendering… Got reply from: SUCCESS! On-demand workers On-demand workers

17 % condor_cod_request –name \ –classad c1.out Successfully sent CA_REQUEST_CLAIM to startd at Result ClassAd written to c1.out ID of new claim is: “ #1051656208#2” % condor_cod_activate –keyword fractgen \ –id “ #1051656208#2” Successfully sent CA_ACTIVATE_CLAIM to startd at % … Step 3: Commands used

18 User’s Workstation Compute Farm Step 4: “Checkpoint” to swap Batch jobs suspend Idle nodes Suspended worker SELECT FRACTAL TYPE (more user input…)

19 Step 4: Commands used › Rendering application on each COD node is suspended while interactive tool waits for input › The resources are now available for batch Condor jobs % condor_cod_suspend \ –id “ #1051656208#2” Successfully sent CA_SUSPEND_CLAIM to startd at % …

20 User’s Workstation Compute Farm Step 5: Batch jobs can run Batch queue Batch jobs Idle nodes SPECIFY PARAMETERS max_iterations: 400000 TL: -0.65865, -0.56254 BR: -0.45865, -0.71254 (more user input…)

21 Compute Farm Step 6: Computation burst Idle nodes Batch jobs User’s Workstation resume Suspended batch job Interactive workers On-demand workers CLICK TO VIEW YOUR FRACTAL… RENDER

22 Step 6: Commands used › Batch Condor jobs on COD nodes are suspended › All COD rendering applications are resumed on each node % condor_cod_resume \ –id “ #1051656208#2” Successfully sent CA_RESUME_CLAIM to startd at % …

23 Compute Farm Step 7: Results produced Idle nodes Batch jobs User’s Workstation Suspended batch job Interactive workers On-demand workers Data Display

24 Compute Farm Step 8: User input while batch work resumes Idle nodes Batch jobs User’s Workstation ZOOM BOX COORDINATES: TL = -0.60301, -0.61087 BR = -0.58037, -0.62785 Suspended worker suspend

25 Compute Farm Step 9: Computation burst #2 Idle nodes Batch jobs User’s Workstation Interactive workers Suspended batch job On-demand workers resume Data Display RENDER

26 Compute Farm Step 10: Clean-up Idle nodes Batch jobs User’s Workstation release Idle nodes REALLY QUIT? Y/N Releasing compute nodes… 4 nodes terminated successfully!

27 Step 10: Commands used › The jobs are cleaned up, claims released, and resources returned to batch system % condor_cod_release \ –id “ #1051656208#2” Successfully sent CA_RELEASE_CLAIM to startd at State of claim when it was released: "Running" % …

28 Other changes for COD: › The condor_starter has been modified so that it can run jobs without communicating with a condor_shadow  All the great job control features of the starter without a shadow  Starter can write its own UserLog  Other useful features for COD

29 condor_status –cod › New “ –cod” option to condor_status to view COD claims in a Condor pool: Name ID ClaimState TimeInState RemoteUser JobId Keyword astro.cs.wi COD1 Idle 0+00:00:04 wright chopin.cs.w COD1 Running 0+00:02:05 wright 3.0 fractgen chopin.cs.w COD2 Suspended 0+00:10:21 wright 4.0 fractgen Total Idle Running Suspended Vacating Killing INTEL/LINUX 3 1 1 1 0 0 Total 3 1 1 1 0 0

30 What else could I use all these new features for? › Short-running system administration tasks that need quick access but don’t want to disturb the jobs in your batch system › A “Grid Shell”  A condor_starter that doesn’t need a condor_shadow is a powerful job management environment that can monitor a job running under a “hostile” batch system on the grid

31 Future work › More ways to tell COD about your application  For now, you define important attributes in your condor_config file and pre-stage the executables › Ability to transfer files to and from a COD job at a remote machine  We’ve already got the functionality in Condor, so why rely on a shared filesystem or pre-staging?

32 More future work › Accounting for COD jobs › Working with some real-world applications and integrating these new COD features  Would the real users please stand up? › Better “Grid Shell” support  This is really a separate-yet-related area of work…

33 How do you use COD? › Upgrade to Condor version 6.5.3 or greater… COD is already included › There will be a new section in the Condor manual (coming soon) › If you need more help, ask the ever helpful › Find me at the BoF on Wednesday, 9am to Noon (room TBA)

Download ppt "Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)"

Similar presentations

Ads by Google