Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to High Performance Computing

Similar presentations


Presentation on theme: "An Introduction to High Performance Computing"— Presentation transcript:

1 An Introduction to High Performance Computing
Zhiwei Wang Research Computing Support Group of OIT University of Texas at San Antonio

2 Shamu – the HPC Cluster at UTSA
88 nodes 4952 Cores Tflops: 28T Memory 150 Terabyte Storage 2 Nvidia Tesla K80 GPU nodes 1 node with 72 CPU XEON cores and 1.5 TB RAM 40G infiniband 10G interface for login node 10G interface to TACC CentOS Linux

3 What is SGE (Sun Grid Engine)
Sun Grid Engine is a grid computing computer cluster software system (otherwise known as a batch-queuing system). It is typically used on a computer farm or high-performance computing (HPC) cluster and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs.

4 How SGE Works SGE support batch and interactive jobs.
Three kind of batch jobs – sequential batch, parallel batch and array batch

5 How to Access Shamu From Linux and Mac
ssh -p1209 ssh –Y -p1209 From Windows Install a SSH client program, such as MobaXterm ssh -p1209 ssh –Y -p1209

6 First Thing on Shamu If you want to launch a interactive job:
module load sge qlogin It will bring you to one of the compute nodes Do Not Run Jobs on the Login Node!!! If you want to compile your code or to submit batch jobs, stay in the login node.

7 Module System on Shamu What are Environment Modules?
The Environment Modules package is a tool that lets users easily modify their environment during the session with modulefiles. Each modulefile contains the information needed to configure the shell for an application. Typically modulefiles instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. Modules can be loaded and unloaded dynamically Modules are useful in managing different versions of applications. Do not use absolute path in your Makefile.

8 Module System on Shamu – an Example
~]$ module load sge ~]$ module list Currently Loaded Modulefiles:  1) slurm/   2) sge/ p1   ~]$ qlogin local configuration login01.cm.cluster not defined - using global configuration Your job ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job has been successfully scheduled. Establishing /cm/shared/apps/sge/var/cm/qlogin_wrapper session to host compute025.cm.cluster ... Last login: Fri Jan 11 10:54: from login02.cm.cluster ~]$ matlab bash: matlab: command not found... ~]$ module load matlab/R2018a MATLAB is selecting SOFTWARE OPENGL rendering. Note Always use modules instead of hardcoded paths, as paths may be changed.

9 Interactive Job – launch MatLab
MatLab for instance: module load sge qlogin -pe threaded #core module load matlab/R2017a matlab

10 Interactive Job – run python program
~]$ cp -r /work/training1/ . ~]$ qlogin ~]$ cd training1/python/ python]$ ls 26313.log  log  hello.py  test.ps python]$ cat hello.py  print("Hello, World!") python]$ module load python/3.6.1  python]$ python3 hello.py  Hello, World!

11 Submit a Batch Job Prepare the job description file. Samples can be found at Submit the job to a queue abc123]$ module load sge  abc123]$ qsub jobscript Check the status of the job bin]$ qstat job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID    QLOGIN     iqr224       r     05/19/ :30:34       1        Check the detail of the job bin]$ qstat -j jobID Monitor cluster usage bin]$ qhost Delete a job from the queue bin]$ qdel jobID     Note Job submission can only be done on login node

12 Sequential Batch Job Description
#!/bin/bash #$ -q all.q #$ -N pHello #$ -cwd #$ -j y #$ -o $JOB_ID.log . /etc/profile.d/modules.sh # Load one of these module load python/3.6.1 python3 hello.py python]$ ls hello.py  test.ps python]$ qsub test.ps Your job ("test.ps") has been submitted python]$ qstat job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID    QLOGIN     iqr224       r     05/19/ :30:34       1           test.ps    iqr224       qw    05/19/ :47:22                                    1         hello.py  test.ps  test.ps.o20235 python]$ cat test.ps.o20235  Hello, World! python]$ Note Always include the following line: #$ -q all.q Other wise, the job might be sent to dev.q, and will be killed if it runs longer than 10 minutes

13 Parallel Batch Job [iqr224@login-0-0 mpihello]$ cat hello.job
mpihello]$ qsub hello.job Your job ("mpiHello") has been submitted mpihello]$ qstat job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID    QLOGIN     iqr224       r     05/19/ :30:34       1           mpiHello   iqr224       r     05/19/ :32:58       4    QLOGIN     iqr224       r     05/19/ :30:34       1        mpihello]$ cat log -catch_rsh /cm/local/apps/sge/var/spool/compute024/active_jobs/ /pe_hostfile compute024 0: We have 4 processors sum sum sum total time used:       mpihello]$ cat hello.job #$ -N mpiHello #$ -q all.q #$ -cwd #$ -j y #$ -m aes  ## emal when abort, end and suspended #$ -M #$ -o $JOB_ID.log #$ -pe openmpi 4 . /etc/profile.d/modules.sh # Load one of these module load openmpi/gcc/64/1.10.1 mpirun -n $NSLOTS ./a.out exit 0

14 Other Useful Commands #$ -l h_vmem=XXG #$ -l mem_free=XXG
#$ -pe openmpi_roundrobin 16   //fill up one node and move onto the others?

15 Queue Status Code Meaning qw waiting in the queue to start t
transferring to a node (about to start) r running h job held back by user request E error

16 Important Practices about Multi-threaded Jobs
a.out is 4-thread app. It is the users’ responsibility to tell SGE how many thread that the app will running. mthread_sum]$ qsub hello.job Your job ("Hello") has been submitted mthread_sum]$ qstat job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID    Hello   iqr224       qw    05/22/ :22:14                                    4           Hello   iqr224       r     05/22/ :22:24      4         mthread_sum]$  #$ -N Hello #$ -q all.q #$ -cwd #$ -j y #$ -o $JOB_ID.log #$ -pe threaded 4 . /etc/profile.d/modules.sh ./a.out exit 0 mthread_sum]$ qsub hello.job Your job ("Hello") has been submitted mthread_sum]$ qstat job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID    Hello   iqr224       qw    05/22/ :22:14                                    1           Hello   iqr224       r     05/22/ :22:24       1         mthread_sum]$  #$ -N Hello #$ -q all.q #$ -cwd #$ -j y #$ -o $JOB_ID.log . /etc/profile.d/modules.sh ./a.out exit 0

17 Jobs in queue   QLOGIN     prh169       r     07/12/ :35:56       1           QLOGIN     quj690       r     07/12/ :52:15     1           ThaoAbaqus rgc620       qw    07/12/ :38:18                                   16           ThaoAbaqus rgc620       qw    07/12/ :38:18                                   16           ThaoAbaqus rgc620       qw    07/12/ :38:18                                   16        Shamu is a sharing resource. Dozens of users might be using it at the same time Jobs might be in “qw” state for hours. No panic. 256 cores for OpenMPI applications Single or multithreaded applications have no limitations other than the physical core count per node which is 32.

18 SLURM Slurm is a highly configurable workload manager and job scheduler. It is an open source software backed up by a large community, installed in many of the Top 500 supercomputers hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html

19 SLURM sinfo - gives an overview of the resources offered by the cluster, ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 2 drain* compute[041,064] defq* up infinite 1 drain compute059 defq* up infinite 77 idle compute[ , , , , , , , ] dev up infinite 1 idle compute042 bigmem up infinite 1 alloc compute009 gpu up infinite 2 idle gpu[01-02] ~]$ sinfo -N -l Wed Aug 28 09:35: NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON compute001 1 defq* idle 40 2:10: (null) none compute002 1 defq* idle 40 2:10: (null) none squeue - shows to which jobs those resources are currently allocated. ~]$ squeue              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)               1770    bigmem sys/dash   xce775  R 3-16:24:04      1 compute009 ~]$ squeue --user abc123 ~]$ scancel 12345. slurm]$ srun  --pty bash slurm]$  srun -N 2 --ntasks-per-node=8 -- pty bash srun -p gpu --pty bash

20 SLURM a sample job script
#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res.txt #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100

21 SLURM - In the Slurm context, a task is to be understood as a process.
- Tasks are requested/created with the --ntasks option,  - CPUs, for the multithreaded programs, are requested with the --cpus-per-task option A sample of MPI job script: #!/bin/bash #SBATCH --job-name=test_mpi ##SBATCH --output=out.ext #SBATCH --mail-type=ALL #SBATCH #SBATCH --ntasks=40 . /etc/profile.d/modules.sh module load openmpi/3.0.0 #srun a.out mpirun -n 40 mpi-host

22 Other Useful Linux Commands
find /home/username/ -name "*.err” (to find files in specified directory) qlogin -l h=compute021 (qlogin to compute021) qlogin –q gpu.q (qlogin to the GPU nodes if you have the permission) qdel job-id (delete a job ) ls (list the files in current directory) rm filename (remove a file) pwd (show the path of the current directory) cd (change directory)

23 Coming Trainings


Download ppt "An Introduction to High Performance Computing"

Similar presentations


Ads by Google