Presentation is loading. Please wait.

Presentation is loading. Please wait.

Critical Flags, Variables, and Other Important ALCF Minutiae Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility.

Similar presentations


Presentation on theme: "Critical Flags, Variables, and Other Important ALCF Minutiae Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility."— Presentation transcript:

1 Critical Flags, Variables, and Other Important ALCF Minutiae Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility

2 Presentation outline  It’s all about your job! –Job management –Job basics Submission Queuing Execution Termination  Software environment  Optimization for beginners  ALCF resources, outlined Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 2

3 Job management  Cobalt (the ALCF resource scheduler) is used on all ALCF systems –Similar to PBS but not the same –Find more information at http://trac.mcs.anl.gov/projects/cobalthttp://trac.mcs.anl.gov/projects/cobalt  Job management commands: –qsub: submit a job –qstat: query a job status –qdel: delete a job –qalter: alter batched job parameters –qmove: move job to different queue –qhold: place queued (non-running) job on hold –qrls: release hold on job –showres: show current and future reservations Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 3

4 Job basics – submission  Two modes of submitting jobs –Basic –Script mode  Get all flags and options by running ‘man qsub’  For example:  qsub -A alchemy -n 40960 --mode c1 -t 720 --env “OMP_NUM_THREADS=4” lead_to_gold –In English: Charge project “Alchemy” for this job. Run on 40960 nodes, with one MPI rank per node. Run for 720 minutes. Set the “OMP_NUM_THREADS” environment variable to 4. Run the “lead_to_gold” binary. Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 4

5 qsub checks your submission for sanity  Did you specify a nodecount and walltime? Are they legal?  Is the mode you specified valid?  Did you ask for more than the minimum runtime?  Are you a member of the project you specified? Does that project have a usable allocation?  If so … all systems go! Get a JOBID, and put it in the queue Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 5

6 Not there yet! Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 6

7 Job basics - life in the queue  Periodically, your job’s score will increase  Periodically, the scheduler will decide if there are any jobs it wants to run  Check current state with qstat  At some point, your score will be high enough, and it will be YOUR TURN! Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 7

8 Score accrual  Large jobs are prioritized  Jobs that have been waiting long are prioritized  INCITE/ALCC projects are prioritized  Negative allocations have a score cap lower than the starting score of other jobs Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 8

9 Job basics - execution  Book-keeping –Put a start record in the database. Output a log file start record. Send email of job start if –notify was requested. Start job timers  Fire up to execute the job –Cobalt boots partition –runjob starts executable Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 9

10 Script mode jobs  All jobs launch via runjob on the service nodes  Script mode jobs launch your script on a special login node  That script is responsible for calling runjob to launch the actual compute-node job  You are charged for the duration of the script Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 10

11 Job basics – termination aka are we there yet?  Your requested wall-time ticks down. Either your runjob returns, or you run out of wall-time and your job is forcibly removed  Job-end cleanup happens –If your partition wasn’t cleaned up, that happens now  Job-end book-keeping happens –Database, log file, notify if requested Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 11

12 Job basics – Termination, life after your job  If you had a job depending on you, it can be released to run. If you had a non-zero exit code, it moves to dep_fail instead  That night, the log files will be fed into clusterbank (the ALCF accounting system) to create charges Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 12

13 Non-standard job events  Reservations and/or draining  qsub rejection  Job holds  Job redefinition (qalter)  Job removal (qdel)  Abnormal job failure  Why isn’t this job running? Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 13

14 Software environment - SoftEnv  A tool for managing your environment –Sets your PATH to access desired front-end tools –Your compiler version can be changed here  Settings: –Maintained in the file ~/.soft –Add/remove keywords from ~/.soft to change environment –Make sure @default is at the very end  Commands: –softenv a list of all keywords defined on the systems –resoft reloads initial environment from ~/.soft file –soft add|remove keyword Temporarily modify environment by adding/removing keywords  http://www.mcs.anl.gov/hs/software/systems/softenv/softenv-intro.html http://www.mcs.anl.gov/hs/software/systems/softenv/softenv-intro.html Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 14

15 Software libraries  ALCF Supports two sets of libraries: –IBM system and provided libraries: /bgsys/drivers/ppcfloor glibc mpi –Site supported libraries and programs: /soft/ PETSc ESSL –And many others See http://www.alcf.anl.gov/resource-guides/software-and-librarieshttp://www.alcf.anl.gov/resource-guides/software-and-libraries Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 15

16 Compiler wrappers Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 16  MPI wrappers for IBM XL cross-compilers:  MPI wrappers for GNU cross-compilers: WrapperThread-Safe Wrapper Underlying Compiler Description mpixlcmpixlc_rbgxlcIBM BG C Compiler mpixlcxxmpixlcxx_rbgxlCIBM BG C++ Compiler mpixlf77mpixlf77_rbgxlfIBM BG Fortran 77 Compiler mpixlf90mpixlf90_rbgxlf90IBM BG Fortran 90 Compiler mpixlf95mpixlf95_rbgxlf95IBM BG Fortran 95 Compiler mpixlf2003mpixlf2003_rbgxlf2003IBM BG Fortran 2003 Compiler WrapperUnderlying CompilerDescription mpiccpowerpc-bgp-linux-gccGNU BG C Compiler mpicxxpowerpc-bgp-linux-g++GNU BG C++ Compiler mpif77powerpc-bgp-linux-gfortranGNU BG Fortran 77 Compiler mpif90powerpc-bgp-linux-gfortranGNU BG Fortran 90 Compiler

17 Optimization for beginners  Suggested set of optimization levels from least to most optimization:  -O0 # best level for use with a debugger  -O2 # good level for verifying correctness, baseline perf  -O2 -qmaxmem=-1 -qhot=level=0  -O3 -qstrict (preserves program semantics)  -O3  -O3 -qhot=level=1  -O4  -O5 Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 17

18 Optimization tips  -qlistopt generates a listing with all flags used in compilation  -qreport produces a listing, shows how code was optimized  Performance can decrease at higher levels of optimization, especially at -O4 or -O5  May specify different optimization levels for different routines/files Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 18

19 ALCF Resources – BG/Q systems  Mira – BG/Q system –49,152 nodes / 786,432 cores –786 TB of memory –Peak flop rate: 10 PF –Linpack flop rate: 8.1 PF  Cetus (T&D) – BG/Q system –1024 nodes / 16,384 cores –16 TB of memory –Peak flop rate: 208 TF  Vesta (T&D) -­‐ BG/Q systems –2,048 nodes / 32,768 cores –32 TB of memory –Peak flop rate: 416 TF Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 19

20 ALCF Resources – supporting systems  Tukey –Nvidia system –100 nodes / 1600 x86 cores/ 200 M2070 GPUs –6.4 TB x86 memory / 1.2 TB GPU memory –Peak flop rate: 220 TF  Storage –Scratch: 28.8 PB raw capacity, 240 GB/s bw (GPFS) –Home: 1.8 PB raw capacity, 45 GB/s bw (GPFS) –Storage upgrade planned in 2015 Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 20

21 ALCF Resources Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 21 Mira 48 racks/768K cores 10 PF Cetus (Dev) 1 rack/16K cores 208 TF Tukey (Viz) 100 nodes/1600 cores 200 NVIDIA GPUs 220 TF Networks 100Gb (via Esnet, internet2 UltraScienceNet) Vesta (Dev) 2 racks/32K cores 416 TF

22 Coming up next…  Data Transfers in the ALCF - Robert Scott, ALCF Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 22

23 Thank You!  Questions? Argonne Leadership Computing Facility - supported by the Office of Science of the U.S. Department of Energy 23


Download ppt "Critical Flags, Variables, and Other Important ALCF Minutiae Jini Ramprakash Technical Support Specialist Argonne Leadership Computing Facility."

Similar presentations


Ads by Google