An Introduction to High Performance Computing

An Introduction to High Performance Computing
Zhiwei Wang Research Computing Support Group of OIT University of Texas at San Antonio

Shamu – the HPC Cluster at UTSA
88 nodes 4952 Cores Tflops: 28T Memory 150 Terabyte Storage 2 Nvidia Tesla K80 GPU nodes 1 node with 72 CPU XEON cores and 1.5 TB RAM 40G infiniband 10G interface for login node 10G interface to TACC CentOS Linux

What is SGE (Sun Grid Engine)
Sun Grid Engine is a grid computing computer cluster software system (otherwise known as a batch-queuing system). It is typically used on a computer farm or high-performance computing (HPC) cluster and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs.

How SGE Works SGE support batch and interactive jobs.
Three kind of batch jobs – sequential batch, parallel batch and array batch

How to Access Shamu From Linux and Mac
ssh -p1209 ssh –Y -p1209 From Windows Install a SSH client program, such as MobaXterm ssh -p1209 ssh –Y -p1209

First Thing on Shamu If you want to launch a interactive job:
module load sge qlogin It will bring you to one of the compute nodes Do Not Run Jobs on the Login Node!!! If you want to compile your code or to submit batch jobs, stay in the login node.

Module System on Shamu What are Environment Modules?
The Environment Modules package is a tool that lets users easily modify their environment during the session with modulefiles. Each modulefile contains the information needed to configure the shell for an application. Typically modulefiles instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. Modules can be loaded and unloaded dynamically Modules are useful in managing different versions of applications. Do not use absolute path in your Makefile.

Module System on Shamu – an Example
~]$ module load sge ~]$ module list Currently Loaded Modulefiles: 1) slurm/ 2) sge/ p1 ~]$ qlogin local configuration login01.cm.cluster not defined - using global configuration Your job ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job has been successfully scheduled. Establishing /cm/shared/apps/sge/var/cm/qlogin_wrapper session to host compute025.cm.cluster ... Last login: Fri Jan 11 10:54: from login02.cm.cluster ~]$ matlab bash: matlab: command not found... ~]$ module load matlab/R2018a MATLAB is selecting SOFTWARE OPENGL rendering. Note Always use modules instead of hardcoded paths, as paths may be changed.

Interactive Job – launch MatLab
MatLab for instance: module load sge qlogin -pe threaded #core module load matlab/R2017a matlab

Interactive Job – run python program
~]$ cp -r /work/training1/ . ~]$ qlogin ~]$ cd training1/python/ python]$ ls 26313.log log hello.py test.ps python]$ cat hello.py print("Hello, World!") python]$ module load python/3.6.1 python]$ python3 hello.py Hello, World!

Submit a Batch Job Prepare the job description file. Samples can be found at Submit the job to a queue abc123]$ module load sge abc123]$ qsub jobscript Check the status of the job bin]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID QLOGIN iqr224 r 05/19/ :30:34 1 Check the detail of the job bin]$ qstat -j jobID Monitor cluster usage bin]$ qhost Delete a job from the queue bin]$ qdel jobID Note Job submission can only be done on login node

Sequential Batch Job Description
#!/bin/bash #$ -q all.q #$ -N pHello #$ -cwd #$ -j y #$ -o $JOB_ID.log . /etc/profile.d/modules.sh # Load one of these module load python/3.6.1 python3 hello.py python]$ ls hello.py test.ps python]$ qsub test.ps Your job ("test.ps") has been submitted python]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID QLOGIN iqr224 r 05/19/ :30:34 1 test.ps iqr224 qw 05/19/ :47:22 1 hello.py test.ps test.ps.o20235 python]$ cat test.ps.o20235 Hello, World! python]$ Note Always include the following line: #$ -q all.q Other wise, the job might be sent to dev.q, and will be killed if it runs longer than 10 minutes

Parallel Batch Job [iqr224@login-0-0 mpihello]$ cat hello.job
mpihello]$ qsub hello.job Your job ("mpiHello") has been submitted mpihello]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID QLOGIN iqr224 r 05/19/ :30:34 1 mpiHello iqr224 r 05/19/ :32:58 4 QLOGIN iqr224 r 05/19/ :30:34 1 mpihello]$ cat log -catch_rsh /cm/local/apps/sge/var/spool/compute024/active_jobs/ /pe_hostfile compute024 0: We have 4 processors sum sum sum total time used: mpihello]$ cat hello.job #$ -N mpiHello #$ -q all.q #$ -cwd #$ -j y #$ -m aes ## emal when abort, end and suspended #$ -M #$ -o $JOB_ID.log #$ -pe openmpi 4 . /etc/profile.d/modules.sh # Load one of these module load openmpi/gcc/64/1.10.1 mpirun -n $NSLOTS ./a.out exit 0

Other Useful Commands #$ -l h_vmem=XXG #$ -l mem_free=XXG
#$ -pe openmpi_roundrobin 16 //fill up one node and move onto the others?

Queue Status Code Meaning qw waiting in the queue to start t
transferring to a node (about to start) r running h job held back by user request E error

Important Practices about Multi-threaded Jobs
a.out is 4-thread app. It is the users’ responsibility to tell SGE how many thread that the app will running. mthread_sum]$ qsub hello.job Your job ("Hello") has been submitted mthread_sum]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID Hello iqr224 qw 05/22/ :22:14 4 Hello iqr224 r 05/22/ :22:24 4 mthread_sum]$ #$ -N Hello #$ -q all.q #$ -cwd #$ -j y #$ -o $JOB_ID.log #$ -pe threaded 4 . /etc/profile.d/modules.sh ./a.out exit 0 mthread_sum]$ qsub hello.job Your job ("Hello") has been submitted mthread_sum]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID Hello iqr224 qw 05/22/ :22:14 1 Hello iqr224 r 05/22/ :22:24 1 mthread_sum]$ #$ -N Hello #$ -q all.q #$ -cwd #$ -j y #$ -o $JOB_ID.log . /etc/profile.d/modules.sh ./a.out exit 0

Jobs in queue QLOGIN prh169 r 07/12/ :35:56 1 QLOGIN quj690 r 07/12/ :52:15 1 ThaoAbaqus rgc620 qw 07/12/ :38:18 16 ThaoAbaqus rgc620 qw 07/12/ :38:18 16 ThaoAbaqus rgc620 qw 07/12/ :38:18 16 Shamu is a sharing resource. Dozens of users might be using it at the same time Jobs might be in “qw” state for hours. No panic. 256 cores for OpenMPI applications Single or multithreaded applications have no limitations other than the physical core count per node which is 32.

SLURM Slurm is a highly configurable workload manager and job scheduler. It is an open source software backed up by a large community, installed in many of the Top 500 supercomputers hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html

SLURM sinfo - gives an overview of the resources offered by the cluster, ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 2 drain* compute[041,064] defq* up infinite 1 drain compute059 defq* up infinite 77 idle compute[ , , , , , , , ] dev up infinite 1 idle compute042 bigmem up infinite 1 alloc compute009 gpu up infinite 2 idle gpu[01-02] ~]$ sinfo -N -l Wed Aug 28 09:35: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON compute001 1 defq* idle 40 2:10: (null) none compute002 1 defq* idle 40 2:10: (null) none squeue - shows to which jobs those resources are currently allocated. ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1770 bigmem sys/dash xce775 R 3-16:24:04 1 compute009 ~]$ squeue --user abc123 ~]$ scancel 12345. slurm]$ srun --pty bash slurm]$ srun -N 2 --ntasks-per-node=8 -- pty bash srun -p gpu --pty bash

SLURM a sample job script
#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res.txt #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100

SLURM - In the Slurm context, a task is to be understood as a process.
- Tasks are requested/created with the --ntasks option, - CPUs, for the multithreaded programs, are requested with the --cpus-per-task option A sample of MPI job script: #!/bin/bash #SBATCH --job-name=test_mpi ##SBATCH --output=out.ext #SBATCH --mail-type=ALL #SBATCH #SBATCH --ntasks=40 . /etc/profile.d/modules.sh module load openmpi/3.0.0 #srun a.out mpirun -n 40 mpi-host

Other Useful Linux Commands
find /home/username/ -name "*.err” (to find files in specified directory) qlogin -l h=compute021 (qlogin to compute021) qlogin –q gpu.q (qlogin to the GPU nodes if you have the permission) qdel job-id (delete a job ) ls (list the files in current directory) rm filename (remove a file) pwd (show the path of the current directory) cd (change directory)

Coming Trainings

An Introduction to High Performance Computing

Similar presentations

Presentation on theme: "An Introduction to High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Introduction to High Performance Computing

Similar presentations

Presentation on theme: "An Introduction to High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback