Presentation is loading. Please wait.

Presentation is loading. Please wait.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00.

Similar presentations


Presentation on theme: "N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00."— Presentation transcript:

1 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00

2 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 2 NERSC Batch Systems LoadLeveler - IBM SP NQS/NQE - Cray T3E/J90’s This talk will focus on the MPP systems Using the batch system on the J90’s is similar to the T3E The IBM batch system: http://hpcf.nersc.gov/running_jobs/ibm/batch.html The Cray batch system: http://hpcf.nersc.gov/running_jobs/cray/batch.html Batch differences between IBM and Cray: http://hpcf.nersc.gov/running_jobs/ibm/lldiff.html

3 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 3 About the T3E 644 application processors (PEs) 33 command PEs Additional PEs for OS NQE/NQS jobs run on application PEs Interactive jobs (“mpprun” jobs) run on command PEs Single system image A single parallel job must run on a contiguous set of PEs A job will not be scheduled if there are enough idle PEs but they are fragmented throughout the torus

4 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 4 About the SP 256 compute nodes 8 login nodes Additional nodes for file system, network, etc. Each node has 2 processors that share memory Each node can have either 1 or 2 MPI tasks Each node runs full copy of AIX OS LoadLeveler jobs can run only on the compute nodes Interactive jobs (“poe” jobs) can run on either compute or login nodes

5 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 5 How To Use a Batch System Write a batch script –must use keywords specific to the scheduler –default values will be different for each site Submit your job –commands are specific to scheduler Monitor your job –commands are specific to scheduler –run limits are specific to site Check results when complete Call NERSC consultants when your job disappears :o)

6 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 6 T3E Batch Terminology PE - processor element (a single CPU) Torus - the high-speed connection between PEs. All communication between PEs must go through the torus. Swapping - when a job is stopped by the system to allow a higher priority job run on that PE. The job may stay in memory. Also called “gang-scheduling”. Migrating - when a job is moved to a different set of PEs to better pack the torus Checkpoint - when a job is stopped by the system and an image is saved to be restarted at a later time.

7 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 7 More T3E Batch Terminology Pipe Queue - a queue in the NQE portion of the scheduler. It determines which batch queues the job may be submitted to. The user must specify this on the cqsub command line if anything other than “regular”. Batch Queue - a queue on the NQS portion of the scheduler. The batch queues are served in a first-fit manner. The user should not specify any batch queue on the command line or in their script.

8 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 8 NQS/NQE Developed by Cray Very complex set of scheduling parameters Complicated to understand Fragile Powerful and flexible Allows checkpoint/restart

9 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 9 What NQE Does Users submit jobs to NQE NQE assigns it a unique identifier called the taskid and stores it in a database The status of the job is “NPend” NQE examines various parameters and decides when to pass the job to the LWS The LWS then submits the job to an NQS batch queue (see next slide for NQS details) After job completes NQE stores the job information for about 4 hours

10 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 10 What NQS Does NQS receives a job from the LWS The job is placed in a batch queue which is determined by number of requested PEs and time The status of the job is now “NSubm” NQS batch queues are served in a first-fit manner When the job is ready to be scheduled, it is sent to the GRM (global resource manager) At this point the status of the job is “R03” The job may be stopped for checkpointing or swapping but still have a “running” status in NQS

11 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 11 NQS/NQE Commands cqsub - submit your job % cqsub -la regular script_file Task id t7225 inserted into database nqedb. cqstatl - monitor your NQE job qstat - monitor your NQS job cqdel - delete your queued or running job % cqdel t7225

12 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 12 Sample T3E Batch Script #QSUB -s /bin/csh #Specify C Shell for 'set echo' #QSUB -A abc #charge account abc for this job #QSUB -r sample #Job name #QSUB -eo -o batch_log.out #Write error and output to single file. #QSUB -l mpp_t=00:30:00 #Wallclock time #QSUB -l mpp_p=8 #PEs to be used (Required). ja #Turn on Job Accounting mpprun -n 8./a.out #Execute on 8 PEs reading data.in ja -s

13 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 13 Monitoring Your Job on the T3E % cqstatl -a | grep jimbob t4417 l441h4 scheduler.main jimbob NQE Database NPend t4605 (1259.mcurie) l513v8 lws.mcurie jimbob nqs@mcurie NSubm t4777 (1082.mcurie) l541l2 monitor.main jimbob NQE Database NComp t4884 (1092.mcurie) l543l1 lws.mcurie jimbob nqs@mcurie NSubm t4885 (1093.mcurie) l545l1 lws.mcurie jimbob nqs@bmcurie Nsubm t4960 l546 scheduler.main jimbob NQE Database NPend % qstat -a | grep jimbob 1259.mcurie l513v8 jimbob pe32@mcurie277126 255 1800 R03 1092.mcurie l543l1 jimbob pe32@mcurie341626 252 1800 R03 1093.mcurie l545l1 jimbob pe32@mcurie 921 28672 1800 Qge

14 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 14 Monitoring Your Job on the T3E (cont’) Use commands pslist (see next slide) and tstat to check running jobs Using ps on a command PE will list all instances of a parallel job because the T3E has a single system image % mpprun -n 4./a.out % ps -u jimbob PID TTYTIME CMD 7523 ?0:01csh 7568 ?12:13a.out 16991 ?12:13a.out 16992 ?12:13a.out 16993 ?12:13a.out

15 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 15 Monitoring Your Job on the T3E (cont’) S USER RK APID JID PE_RANG NPE TTY TIME CMD STATUS - -------- -- -------- ------ ------- --- -------- -------- ------------ ------- a user1 0 29451 29786 000-015 16 ? 02:50:32 sander b buffysum 0 29567 29787 016-031 16 ? 02:57:45 osiris.e ACTIVE PEs = 631 q buffysum 1 18268 29715 146-161 16 ? 00:42:28 osiris.e Swapped 1 of 16 r miyoung 1 77041 28668 172-235 64 ? 03:52:11 vasp s bufyysum 1 53202 30069 236-275 40 ? 00:18:16 osiris.e Swapped 1 of 40 t willow 1 51069 27914 276-325 50 ? 00:53:03 MicroMag. u hal 1 77007 30569 326-357 32 ? 00:26:09 alknemd ACTIVE PEs = 266 BATCH = 770 INTERACTIVE = 12 WAIT QUEUE: user uid gid acid Label Size ApId Command Reason Flags giles 13668 2607 2607 - 64 55171 xlatqcdp Ap. limit a---- bobg 14721 2751 2751 - 54 68936 Cmdft Ap. limit a---- jimbo 15761 3009 3009 - 32 77407 pop.8x4 Ap. limit af---

16 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 16 Possible Job States on the T3E STJob StateDescription R03 RunningThe job is currently running. NSubm Submitted The job has been submitted to the NQS scheduler and is being considered to run. NPend Pending The job is still residing in the NQE database and is not being considered to run. This is probably because you already have 3 jobs in the queue. NComp Completed The job has completed. NTerm Terminated The job was terminated, probably due to an error in the batch script.

17 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 17 Current Queue Limits on the T3E Pipe QBatch QMAX PETime debugdebug_small3233 min debug_medium12810 min productionpe16164 hr pe32324 hr pe64644 hr pe1281284 hr pe2562564 hr pe5125124 hr longlong12812812 hr long 25625612 hr

18 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 18 Queue Configuration on the T3E Time (PDT)Action 7:00 amlong256 stopped pe256 stopped 10:00 pmpe512 started long 128, pe128 stopped and checkpointed pe64, pe32, pe16 run as backfill 1:00 ampe512 stopped and checkpointed long256, pe256, long 128, pe128 started

19 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 19 LoadLeveler Product of IBM Conceptually very simple Few commands and options available Packs system well with backfilling algorithm Allows MIMD jobs Does not have checkpoint/restart to favor certain jobs

20 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 20 SP/LoadLeveler Terminology Keyword - used to specify your job parameters (e.g. number of nodes and wallclock time) to the LoadLeveler scheduler Node - a set of 2 processors that share memory and a switch adapter. NERSC users are charged for exclusive use of a node. Job ID - the identifier for a LoadLeveler job, e.g. gs01013.1234.0. Switch - a high-speed connection between the nodes. All communication between nodes goes through the switch. Class - a user submits a batch job to a particular class. Each class has a different priority and different limits.

21 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 21 What LoadLeveler Does Jobs are submitted directly to LoadLeveler The following keywords are set: –node_usage = not_shared –tasks_per_node = 2 The user can override tasks_per_node but not node_usage Incorrect keywords and parameters are passed silently to scheduler! NERSC only checks for valid repo and class names Prolog script creates $SCRATCH and $TMPDIR directories and environment variables –$SCRATCH is a global (GPFS) filesystem and $TMPDIR is local

22 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 22 LoadLeveler Commands llsubmit - submit your job % llsubmit script_file llsubmit: The job "gs01007.nersc.gov.101" has been submitted. llqs - monitor your job llq - get details about one of your queued or running jobs llcancel - delete your queued or running job % llcancel gs01005.84.0 llcancel: Cancel command has been sent to the central manager.

23 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 23 Sample SP Batch Script #!/usr/bin/csh #@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = css0,not_shared,us #@ node_usage = not_shared #@ class = regular #@ tasks_per_node = 2 #@ node = 32 #@ wall_clock_limit= 01:00:00 #@ queue./a.out < input

24 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 24 Monitoring Your Job on the SP gseaborg% llqs Step Id JobName UserName Class ST NDS WallClck Submit Time ---------------- --------------- -------- ------- -- --- -------- ----------- gs01007.1087.0 a240 buffy regular R 32 00:31:44 3/13 04:30 gs01001.529.0 s1.x willow regular R 64 00:28:17 3/12 21:45 gs01001.578.0 xdnull xander debug R 5 00:05:19 3/14 12:44 gs01009.929.0 gs01009.nersc.g spike regular R 128 03:57:27 3/13 05:17 gs01001.530.0 s2.x willow regular I 64 04:00:00 3/12 21:48 gs01001.532.0 s3.x willow regular I 64 04:00:00 3/12 21:50 gs01001.533.0 y1.x willow regular I 64 04:00:00 3/12 22:17 gs01001.534.0 y2.x willow regular I 64 04:00:00 3/12 22:17 gs01001.535.0 y3.x willow regular I 64 04:00:00 3/12 22:17 gs01001.537.0 gs01001.nersc.g spike regular I 128 02:30:00 3/13 06:10 gs01009.930.0 gs01009.nersc.g spike regular I 128 02:30:00 3/13 07:17

25 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 25 Monitoring Your Job on the SP (cont’) Issuing a ps command will show only what is running on that login node, not any instances of your parallel job If you could issue a ps command on a compute node running 2 MPI tasks of your parallel job, you would see: gseaborg% ps -u jimbob UIDPIDTTYTIMECMD 143979444-58:37a.out 1439710878-0:00pmdv2 14397114520:00 1439715634-0:00LoadL_starter 1439716828-58:28a.out 1439719696-0:00pmdv2 1439719772-0:02poe 14397208780:00

26 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 26 Possible Job States on the SP STJob StateDescription R RunningThe job is currently running. I Idle The job is being considered to run. NQ Not Queued The job is not being considered to run. This is probably because you have submitted more than 10 jobs. ST Starting The job is starting to run. HU User Hold The user put the job on hold. You must issue the llhold -r command in order for it to be considered for scheduling. HS System Hold The job was put on hold by the system. This is probably because you are over disk quota in $HOME.

27 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 27 Current Class Limits on the SP CLASSNODETIMEPRIORITY debug1630 min20000 premium2564 hr10000 regular2564 hr5000 low2564 hr1 interactive820 min15000 Same configuration runs all the time.

28 N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 28 More Information Please see NERSC Web documentation The IBM batch system: http://hpcf.nersc.gov/running_jobs/ibm/batch.html The Cray batch system: http://hpcf.nersc.gov/running_jobs/cray/batch.html Batch differences between IBM and Cray: http://hpcf.nersc.gov/running_jobs/ibm/lldiff.html


Download ppt "N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00."

Similar presentations


Ads by Google