Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Similar presentations


Presentation on theme: "Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum."— Presentation transcript:

1 Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum - U. Wisconsin And many others!

2 Scientific Motivation Astronomy is increasingly being done by using large surveys with 100s of millions of objects. Analyzing large astronomical datasets frequently means performing the same analysis task on >100,000 objects. Each object may take several hours of computing. The amount of computing time required may vary, sometimes dramatically, from object to object.

3 Solution: PBS? In theory, PBS should provide the answer. Submit 100,000 single-processor PBS jobs In practice, this does not work. Teragrid nodes are multiprocessor Only 1 PBS job per node Teragrid machines frequently restrict the number of jobs a single user may run. Chad might get really mad if I submitted 100,000 PBS jobs!

4 Solution: mprun? We could submit a single job that uses many processors. Now we have a reasonable number of PBS jobs (Chad will now be happy). Scheduling priority would reflect our actual resource usage. This still has problems. Each job takes a different amount of time to run: we are using resources inefficiently.

5 The Real Solution: Condor+GridShell The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job. We can even submit large PBS jobs to multiple Teragrid machines, then farm out serial work units as resources become availiable. Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object

6 The Real Solution: Condor+GridShell The real solution is to submit one large PBS job, then use a private scheduler to manage serial work units within each PBS job. We can even submit large PBS jobs to multiple Teragrid machines, then farm out serial work units as resources become availiable. Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object Condor GridShell

7 Condor Overview Condor was first designed as a CPU cycle harvester for workstations sitting on people’s desks. Condor is designed to schedule large numbers of jobs across a distributed, heterogeneous and dynamic set of computational resources.

8 Condor: The User Experience 1. User writes a simple Condor submit script: # my_job.submit: # A simple Condor submit script Universe = vanilla Executable = my_program Queue 2. User submits the job: % condor_submit my_job.submit Submitting job(s). 1 job(s) submitted to cluster 1.

9 Condor: The User Experience 3. User watches job run: 4. Job completes. User is happy. % condor_q -- Submitter: perdita.cs.wisc.edu : : ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 Jeff 6/16 06:52 0+00:01:21 R 0 0.0 my_program 1 jobs; 0 idle, 1 running, 0 held %

10 Advantages of Condor Condor user experience is simple Condor is flexible Resources can be any mix of architectures Resources do not need a common filesystem Resources do not need common user accounting Condor is dynamic Resources can disappear and reappear Condor is fault-tolerant Jobs are automatically migrated to new resources if existing one become unavailable.

11 Condor Daemons condor_startd – ( runs on execution node ) Advertises specs and availability of execution node (ClassAds). Starts jobs on exec. node. condor_schedd – ( runs on submit node ) Handles job submission. Tracks job status. condor_collector – ( runs on central manager ) Collects system information from execution node. condor_negotiator –( runs on central manager ) Matches schedd jobs to machines.

12 Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Startd sends system specifications (ClassAds) and system status to Collector

13 Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Schedd sends job info to Negotiator User submits Condor job

14 Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Negotiator uses information from Collector to match Schedd jobs to available Startds

15 Condor Daemon Layout Central Manager collector negotiator Submission Machine schedd Execution Machine startd Schedd sends job to Startd on assigned execution node

16 “Personal” Condor on a Teragrid Platform Condor daemons can be run as a normal user. Condor “GlideIn”™ ability supports the ability to launch condor_startd’s on nodes within an LSF or PBS job.

17 “Personal” Condor on a Teragrid Platform (Condor runs with normal user permissions) Central Manager collector negotiator Submission Machine schedd Execution PE startd Execution PE startd Execution PE startd Submission Machine (could be login node) Login Node PBS Job - GlideIn

18 GridShell Overview Allows users to interact with distributed grid computing resources from a simple shell-like interface. extends TCSH version 6.12 to incorporates grid-enabled features: parallel inter-script message-passing and synchronization output redirection to remote files parametric sweep

19 GridShell Examples Redirecting the standard output of a command to a remote file location using GlobusFTP: a.out > gsiftp://tg-login.ncsa.teragrid.org/data Message passing between 2 parallel tasks: if ( $_GRID_TASKID == 0) then echo "hello" > task_1 else Set msg=`cat < task_0` endif Executing 256 instances of a job: a.out on 256 procs

20 Merging GridShell with Condor Use GridShell to launch Condor GlideIn jobs at multiple grid sites All Condor GlideIn jobs report back to a central collector This converts the entire Teragrid into your own personal Condor pool!

21 Merging GridShell with Condor Login Node Gridshell event monitor SDSC PSC NCSA User starts GridShell Session at PSC

22 Merging GridShell with Condor Login Node Gridshell event monitor Login Node Gridshell event monitor Login Node Gridshell event monitor SDSC PSC NCSA GridShell session starts event monitor on remote login nodes via Globus

23 Merging GridShell with Condor Login Node collector negotiator schedd Gridshell event monitor Login Node Gridshell event monitor Login Node Gridshell event monitor SDSC PSC NCSA Local event monitor starts condor daemons on login node

24 Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA All event monitors submit Condor GlideIn PBS jobs

25 Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA Condor startd’s tell collector that they have started

26 Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA Condor schedd distributes independent work units to compute nodes

27 GridShell in a NutShell Using GridShell coupled with Condor one can easily harness the power of the Teragrid to process large numbers of independent work units. Scheduling can be done dynamically from a central Condor queue to multiple grid sites as clusters of processors become availible. All of this fits into existing Teragrid software.

28 Merging GridShell with Condor Login Node collector negotiator schedd Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd Login Node Gridshell event monitor PBS Job startd SDSC PSC NCSA


Download ppt "Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum."

Similar presentations


Ads by Google