HPC@UAntwerp introduction Stefan Becuwe, Franky Backeljauw, Kurt Lust, Bert Tijskens.

HPC@UAntwerp introduction
Stefan Becuwe, Franky Backeljauw, Kurt Lust, Bert Tijskens

Structure of the tutorial text
This lecture follows loosely the structure of the tutorial text. Questions Chapter Title Can it solve my computational needs? 1 Introduction to HPC How to get an account? 2 Getting a VSC account How do I connect to the HPC and transfer my files and programs? 3 Preparing the environment How to start background jobs? 4 Running batch jobs How to start jobs with user interaction? 5 Running interactive jobs Where to collect my results? 6 Running jobs with input/output data Can I speed up my program by exploring parallel programming techniques? 7 Multi-core jobs  Parallel programming Can I start many jobs at once? 8 Multi-job submission What are the policies? 9 HPC policies

Introduction to the VSC
CalcUA

VSC – Flemish Supercomputer Center
Vlaams Supercomputer Centrum (VSC) Established in December 2007 Partnership between 5 University associations: Antwerp, Brussels, Ghent, Hasselt, Leuven Funded by Flemish Government through the FWO (Research Fund – Flanders) Goal: make HPC available to all researchers in Flanders (academic and industrial) Local infrastructure (Tier-2) in Antwerp: HPC core facility CalcUA

VSC: An integrated infrastructure according to the European model
Tier-2 (regional) university Tier-1 (national) regional Tier-0 European Tier-3 desktop Tier-0 European PRACE EGI Tier-1 (national) regional VSC network Storage Job management Tier-2 (regional) university BEgrid Tier-3 desktop

UAntwerp hardware hopper (2014) leibniz (2017) 168 compute nodes
144*64GB, 24*256GB 144*128GB, 8*256GB 2 2.8GHz 10-core Ivy Bridge CPUs (E5-2680v2) 2 2.4GHz 14-core Broadwell CPUs (E5-2680v4) ±400 GB for swap and /tmp combined No swap, 10GB local tmp 2 GPU compute nodes (dual Tesla P100) 1 Xeon Phi coprocessor node (KNL) 4 login nodes 2 login nodes 1 visualisation node (hardware accelerated OpenGL) The storage is a joint setup for hopper and leibniz with 100TB capacity spread over a number of volumes

The processor hopper leibniz CPU Ivy Bridge E5-2680v2
Broadwell E5-2680v4 Base clock speed 2.8 GHz 2.4 GHz -14% Cores/node 20 28 +40% Vector instructions AVX AVX2+FMA DP flops/core/cycle 8 16 (e.g., FMA instruction) +100% Memory speed 8*64b*1866 MHz DDR3 8*64b*2400MHz DDR4 +29% DGEMM Gflops 50k x 50k 456 Gflops 974 Gflops +113% GMPbench 3875 3230/4409 -17%/+14% DGEMM: Used Intel MKL library, matrix size almost as large as possible on a 64GB Hopper node, checked over a couple of runs and different core numbers to get the highest result. GMPbench: Old GNU MP library benchmark. This is integer code that does not vectorize and runs single-threaded. Results on compute node and login node. The login nodes of leibniz have turbo mode enabled, the compute nodes haven’t, while all nodes of hopper have turbo mode enabled. Despite IPC improvements, some single core applications will run slower on leibniz Face it: CPU performance is going down in the future unless you exploit parallelism (SIMD/vector and multi-core) Recompile needed to use all features of the leibniz CPU

Our hardware

Tier-1 infrastructure BrENIAC (KU Leuven)
inaugurated October 2016 580 compute nodes with 2 14-core Broadwell processors (435 nodes) or 256 (145 nodes) GB memory, EDR-IB network Nodes are essentially the same as on Leibniz, but the interconnect is a half generation older 634 TB net storage capacity, peak bandwidth 20 GB/s 16,240 cores, theoretical peak performance 623 Tflops # 196 in Top 500 on June 2016

Tier-1 infrastructure BrENIAC (KU Leuven)

Characteristics of a HPC cluster
Shared infrastructure, used by multiple users simultaneously So you have to request the amount of resources you need And you may have to wait a little Expensive infrastructure So software efficiency matters Can’t afford idling for user input Therefore mostly for batch applications rather than interactive applications Built for parallel jobs. No parallelism = no supercomputing. Not meant for running a single single-core job Remote use model But you’re running (a different kind of) remote applications all the time on your phone or web browser… Linux-based so no Windows or macOS software, but most standard Linux applications will run.

Getting a VSC account CalcUA Tutorial Chapter 2

SSH keys Almost all communication with the cluster happens trough SSH = Secure SHell We use public/private key pairs rather than passwords to enhance security Private key is never passed over the internet if you do things right Keys: The private key is the part of the key pair that you have to keep very secure Protect with a passphrase! The public key is put on the computer you want to access A hacker can do nothing with your public key without the private key

/user/antwerpen/2xx/<vsc-username>/.ssh/authorized_keys
Create SSH keys ssh-keygen (OpenSSH) /home/<username>/.ssh/id_rsa this is your private key this is your public key /home/<username>/.ssh/id_rsa.pub Keep safe! /user/antwerpen/2xx/<vsc-username>/.ssh/authorized_keys On Windows: PuTTY with puttygen key generator: But one file with public+private key (extension .ppk), stays on the PC And a second file with the public key (extension .pub) Can convert to and from OpenSSH key files

your VSC username is vsc20xxx
Register your account Web-based procedure your VSC username is vsc20xxx

Connecting to the cluster
CalcUA Tutorial Chapter 3

A typical workflow Connect to the cluster
Transfer your files to the clusters Select software and build your environment Define and submit your job Wait while your job gets scheduled your job gets executed your job finishes Move your results 1919

Required software: Windows: PuTTY graphical SSH client, or Linux emulation + ssh command macOS and Linux: Terminal window + ssh command May need a VPN client to connect to the UAntwerpen network A cluster has login nodes: The part of the cluster to which you connect and where you launch your jobs compute nodes: The part of the cluster where the actual work is done storage and management nodes: As a regular user, you don’t get on them Generic names of the login nodes: VSC standard: login.hpc.uantwerpen.be => Hopper Hopper: login-hopper.uantwerpen.be Leibniz: login-leibniz.uantwerpen.be

Windows: Open PuTTY Follow the instructions on the VSC web site to fill in all fields macOS Open a terminal (Terminal which comes with macOS or iTerm2) Login via secure shell: Linux Open a terminal windows (Ubuntu: Applications > Accessories > Terminal or CTRL+Alt+T) Login via secure shell $ ssh $ ssh

Do not use the login nodes for running your jobs
Currently restricted access only accessible from within the campus network external access (e.g., from home) possible using UA-VPN (VPN requirement will likely be dropped in the future) Use for: editing and submitting jobs, also via some GUI applications small compile jobs – submit large compiles as a regular job visualisation (we even have a special login node for that purpose) Do not use the login nodes for running your jobs

The login nodes In rare cases you may need specific login names rather than the generic ones: login1-hopper.uantwerpen.be (and login2- and login3- and login4-) login1-leibniz.uantwerpen.be (and login2-) viz1-leibniz.uantwerpen.be (for hardware accelerated visualisation using TurboVNC) Some cases where you may need those: when using TurboVNC for GUI access Eclipse remote projects These are names for external access. Once on the cluster, you can also use internal names: ln01.hopper.antwerpen.vsc (and ln02 and ln03 and ln04) ln1.leibniz.antwerpen.vsc (and ln2) viz1.leibniz.antwerpen.vsc

Transfer your files sftp protocol – ftp-like protocol based on ssh, using the same keys Graphical ssh/sftp file managers for Windows, macOS, Linux Windows: PuTTY, WinSCP (PuTTY keys) Multiplatform: FileZilla (OpenSSH keys) macOS: CyberDuck Command line: scp - secure copy copy from the local computer to the cluster copy from the clusters to the local computer $ scp file.ext $ scp .

System software Operating system : CentOS 7.4 (leibniz, hopper update) or Scientific Linux 6.9 (hopper) Red Hat Enterprise Linux (RHEL) 6/7 clones Hopper upgrade to CentOS 7.4 ongoing Resource management and job scheduling Torque : resource manager (based on PBS) MOAB : job scheduler and management tools

Development software C/C++/Fortran compilers: Intel Parallel Studio XE Cluster Edition and GCC Intel compiler comes with several tools for performance analysis and assistance in vectorisation/parallelisation of code Fairly up-to-date OpenMP support in both compilers Support for other shared memory parallelisation technologies such as Cilk Plus, other sets of directives, etc. to a varying level in both compilers PGAS Co-Array Fortran support in Intel Compiler (and some in GNU Fortran) Due to some VSC policies, support for the latest versions may be lacking, but you can request them if you need them Message passing libraries: Intel MPI, Open MPI (standard and Mellanox) Mathematical libraries: Intel MKL, OpenBLAS, FFTW, MUMPS, GSL, … Many other libraries, including HDF5, NetCDF, Metis, ... Python

Additional software can be installed on demand
Application software ABINIT, CP2K, OpenMX, QuantumESPRESSO, Gaussian VASP, CHARMM, GAMESS-US, Siesta, Molpro, CPMD LAMMPS, Gromacs Telemac SAMtools, TopHat, Bowtie, BLAST, MaSuRCA Gurobi Comsol, R, MATLAB, … Often multiple versions of the same package Additional software can be installed on demand ABINIT: DFT, Hybrid MPI/OpenMP with CUDA support for some computations CP2K: DFT, but larger framework. Hybrid MPI/OpenMP with CUDA support OpenMX: DFT, Hybrid MPI/OpenMP QuantumESPRESSO: DFT, hybrid MPI/OpenMP Gaussian: Distributed memory + multithread VASP: Quantum mechanics from first principles ”ab initio”, DFT or Hartree-Fock. MPI, but limited shared memory support through the linear algebra libraries. CHARM: ab initio. MPI, some multithread support through the use of the MKL library Siesta: Ab-initio molecular dynamics and electronic structure calculations Molpro: Ab initio simulation and DFT: GlobalArrays/MPI-2 for distributed memory. OpenMP support is still incomplete CPMD: Ab-initio DFT for atomic simulations LAMMPS: Molecular dynamics, Distributed memory though several modules also support SMP Gromacs: Molecular dynamics, also hybrid MPI/OpenMP with support for CUDA and OpenCL. You’ll find all these terms in the Gromacs manual Telemac: CFD Bio-informatics tools: SAMtools etc

Application software The fine print
Restrictions on commercial software 5 licenses for Intel Parallel Studio MATLAB: Toolboxes limited to those in the campus agreement Other licenses paid for and provided by the users, owner of the license decides who can use the license (see next slide) Typical user applications don’t come as part of the system software, but Additional software can be installed on demand Take into account the system requirements Provide working building instructions and licenses Since we cannot have domain knowledge in all science domains, users have to do the testing with us Compilation support is limited Best effort, no code fixing Installed in /apps/antwerpen

Using licensed software
VSC or campus-wide license: You can just use the software Some restrictions may apply if you don’t work at UAntwerp but at some of the institutions we have agreements with (ITG, VITO) or at a company Restricted licenses Typically licenses paid for by one or more research groups, sometimes just other license restrictions that must be respected Talk to the owner of the license first. If no one in your research group is using a particular package, your group probably doesn’t contribute to the license cost. Access controlled via UNIX groups E.g., avasp : VASP users from UAntwerp To get access: Request group membership via account.vscentrum.be, “New/Join group” menu item. The group moderator will then grant or refuse access

Building your environment: The module concept
Dynamic software management through modules Allows to select the version that you need and avoids version conflicts Loading a module sets some environment variables : $PATH, $LD_LIBRARY_PATH, ... Application-specific environment variables VSC-specific variables that you can use to point to the location of a package (typically start with EB) Automatically pulls in required dependencies Software on our system is organised in toolchains System toolchain: Some packages are compiled against system libraries or installed from binaries Compiler toolchains: Most packages are compiled from source with up-to-date compilers for better performance Packages in the same compiler toolchain will interoperate and can be combined with packages from the system toolchain

Building your environment: Toolchains
Compiler toolchains: “intel” toolchain: Intel and GNU compilers, Intel MPI and math libraries. “foss” toolchain: GNU compilers, Open MPI, OpenBLAS, FFTW, ... VSC-wide and refreshed twice per year with more recent compiler versions: 2016b, 2017a, ... We sometimes do silent updates within a toolchain at UAntwerp intel/2017a is our main toolchain and kept synchronous on all systems. We have skipped 2017b at UAntwerp as it offers no advantages over our 2017a toolchain Software that did not compile correctly with the 2017a toolchain or showed a large performance regression, is installed in the 2016b toolchain There is a page listing all software installed in the 2017a toolchains

Building your environment: The module command
Module name is typically of the form: <name of package>/<version>-<toolchain info>-<additional info>, e.g. intel/2016a : The Intel toolchain package, version 2016a PLUMED/2.1.0-intel-2014a : PLUMED library, version 2.1.0, build with the intel/2014a toolchain SuiteSparse/4.5.5-intel-2017a-ParMETIS-4.0.3: SuiteSparse library, version 4.5.5, build with the Intel compiler, and using ParMETIS In <additional info>, we don’t add all additional components, but these that matter to distinguish between different available modules for the same package. No two modules with the same name can be loaded simultaneously (exception: hopper SL6)

Building your environment: Two kinds of modules
Cluster modules: leibniz/supported: Enables all currently supported application modules: up to 4 toolchain versions and the system toolchain modules hopper/2017a: Enables only the 2017a compiler toolchain modules and the system toolchain modules So it is a good practice to always load the appropriate cluster module first before loading any other module! Hint: module load $VSC_INSTITUTE_CLUSTER/supported 2 types of application modules Built with a specific version of a compiler toolchain, e.g., 2017a Modules in a subdirectory that contains the toolchain name Try to support a toolchain for 2 years Applications installed in the system toolchain Modules in subdirectory centos7.

Building your environment: The module command
module avail : list installed software packages module avail intel : list all available versions of the intel package module spider matlab : search for all modules whose name contains matlab, not case-sensitive module spider MATLAB/R2017a : display additional information about the MATLAB/R2017a module, including other modules you may need to load module help : Display help about the module command module help foss : show help about a given module module list : list all loaded modules in the current session module load hopper/2016b : load a list of modules module load MATLAB/R2017a : load a specific version module unload MATLAB : unload MATLAB settings module purge : unload all modules (incl. dependencies and cluster module) module –t av |& sort -f : produce a sorted list of all modules, case-insensitive

Exercises Getting ready (see also §3.1): Log on to the cluster
Go to your data directory cd $VSC_DATA Copy the examples for the tutorial to that directory: type cp -r /apps/antwerpen/tutorials/Intro-HPC/examples . Experiment with the module command as in §3.3 ”Preparing your environment: Modules” Which module offers support for VNC? Which command(s) does that module offer? Solution to the last question: module spider VNC module help vsc-vnc/0.1 or module spider vsc-vnc/0.1

Starting jobs CalcUA Tutorial Chapter 4

Why batch? A cluster is a large and expensive machine
So the cluster has to be used as efficiently as possible Which implies that we cannot loose time waiting for input as in an interactive program And few programs can use the whole capacity (also depends on the problem to solve) So the cluster is a shared resource, each simultaneous user gets a fraction of the machine depending on his/her requirements Moreover there are a lot of users, so one sometimes has to wait a little. Hence batch jobs (script with resource specifications) submitted to a queueing system with a scheduler to select the next job in a fair way based on available resources and scheduling policies.

Job workflow What happens behind the scenes
#!/bin/bash #PBS -o stdout.$PBS_JOBID #PBS -e stderr.$PBS_JOBID module load leibniz/supported module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Don’t worry too much who does what… Moab Torque Submit Job Query Cluster

The job script Specifies the resources needed for the job and other instructions for the resource manager: Number and type of CPUs How much compute time? How much memory? Output files (stdout and stderr) – optional Can instruct the resource manager to send mail when a job starts or ends Followed by a sequence of commands to run on the compute node Script runs in your home directory, so don’t forget to go to the appropriate directory Load the modules you need for your job Then specify all the commands you want to execute And these are really all the commands that you would also do interactively or in a regular script to perform the tasks you want to perform

The job script This is a bash script (but could be, e.g., Perl)
#!/bin/bash #PBS –L tasks=1:lprocs=7:memory=10gb:swap=10gb #PBS –l walltime=1:00:00 #PBS -o stdout.$PBS_JOBID #PBS -e stderr.$PBS_JOBID module load leibniz/supported module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Specify the resources for the job (and some other instructions for the resource manager), but look as comments to Bash Build a suitable environment for your job. The script runs in your home directory! The command(s) that you want to execute in your job.

Specifying resources in job scripts
example.sh #!/bin/bash #PBS –N name_of_job #PBS –L tasks=1:lprocs=1:memory=1gb:swap=1gb #PBS –l walltime=01:00:00 #PBS –m e #PBS –M <your address> #PBS –o stdout.file #PBS –e stderr.file cd $PBS_O_WORKDIR echo “Job started :” `/bin/date` ./hello_world echo “Job finished :“ `/bin/date` ← the shell (language) in which the script is executed ← “name” of the job (optional) ← resource specifications (e.g., tasks, processors per task), amount of resident and amount of virtual memory per task ← requested running time (required)–l ← mail notification at begin, end or abort (optional) ← your address (should always be specified) ← redirect standard output (–o) and standard error (–e) ← (optional) if omitted, a generic filename will be used ← change to the current working directory (optional, ← if it is the directory where you submitted the job) ← the program itself← finished $ qsub example.sh

Intermezzo Why do Torque commands start with PBS?
In the ’90s of the previous century, there was a resource manager called Portable Batch System, developed by a contractor for NASA. This was open-sourced. But that company was acquired by another company that then sold the rights to Altair Engineering that evolved the product into the closed-source product PBSpro (now open-source again since the summer of 2016). And the open-source version was forked by what is now Adaptive Computing and became Torque. Torque remained open-source. Torque = Terascale Open-source Resource and QUEue manager. The name was changed, but the commands remained. Likewise, Moab evolved from MAUI, an open-source scheduler. Adaptive Computing, the company behind Moab, contributed a lot to MAUI but then decided to start over with a closed source product. They still offer MAUI on their website though. MAUI used to be widely used in large USA supercomputer centres, but most now throw their weight behind SLURM with or without another scheduler.

Specifying resources in job scripts: Two versions of the syntax
Two versions of the syntax for specifying CPU and memory: Version 1 is what you’ll find in most job scripts of colleagues: -l Old and buggy implementation that doesn’t reflect well on modern hardware and cluster software Version 2 is the version we encourage to use on all CentOS 7 cluster nodes: -L It is less buggy It better mirrors modern cluster hardware as it supports the hierarchy in processing elements on a cluster node (hardware thread, core, NUMA node, socket) It was created for use with more advanced resource control mechanisms in new Linux kernels At some point we will likely enforce it as it enables better cluster management

Specifying resources in job scriptss: Version 2 concepts
Task: A job consists of a number of tasks. Think of it as a process (e.g., one rank of a MPI job) Must run in a single node (exception: SGI shared memory systems) In fact: Must run in a single OS instance Logical processor: Each task uses a number of logical processors Hardware threads or cores Specified per task In a system with hyperthreading enabled, an additional parameter allows to specify whether the system should use all hardware threads or run one lproc per core. Placement: Set locality of hardware resources for a task Feature: Each compute node can have one or more special features. We use this for nodes with more memory

Specifying resources in job scriptss: Version 2 concepts
Physical/resident memory: RAM per task In our setup, not enforced on a per-task but per-node basis Processes won’t crash if you exceed this, but will start to use swap space Virtual memory: RAM + swap space, again per task Processes will crash if you exceed this. We will enforce the use of the parameter related to virtual memory

Specifying resources in job scripts: Version 2 syntax: Cores
Specifying tasks and lprocs: #PBS –L tasks=4:lprocs=7 Requests 4 tasks using 7 lprocs each (cores in the current setup of hopper and leibniz) The scheduler should be clever enough to put these 4 tasks on a single 28-core node and assign cores to each task in a clever way Some variants: #PBS –L tasks=1:lprocs=all:place=node : Will give you all cores on a node. Does not work without place=node! #PBS –L tasks=4:lprocs=7:usecores : Stresses that cores should be used instead of hardware threads

Specifying resources in job scripts: Version 2 syntax: Memory
Adding physical memory: #PBS –L tasks=4:lprocs=7:memory=20gb Requests 4 tasks using 7 lprocs each and 20gb memory per task Adding virtual memory: #PBS –L tasks=4:lprocs=7:memory=20gb:swap=25gb Confusing name! This actually requests 25GB of virtual memory, not 25GB of swap space (which would mean 20GB+25GB=45GB of virtual memory). We have no swap space on leibniz, so it is best to take swap=memory (at least if there is enough physical memory). And in this case there is no need to specify memory as the default is memory=swap #PBS –L tasks=4:lprocs=7:swap=20gb Specifying swap will be enforced! Available memory: roughly 56GB per node on hopper, 112GB per node on leibniz, 240GB for the 256GB nodes

Specifying resources in job scripts: Version 2 syntax: Additional features
Adding additional features: #PBS –L tasks=4:lprocs=7:swap=50gb:feature=mem256 Requests 4 tasks using 7 lprocs each and 50GB virtual memory per task, and instructs the scheduler to use nodes with 256GB of memory so that all 4 tasks can run on the same node rather than using two regular nodes and leave the cores half empty. Most important features listed on the VSC website (UAntwerp hardware pages) Specifying multiple features is possible, both in an “and” or an “or” combination (see the manual)

Specifying resources in job scripts: Requesting cores or memory
What happens if you don’t use all memory and cores on a node? Other jobs – but only yours – can use it so that other users cannot accidentally crash your job and so that you have a clean environment for benchmarking without unwanted interference You’re likely wasting expensive resources! Don’t complain you have to wait too long until your job runs if you don’t use the cluster efficiently. If you really require a special setup where no other jobs of you are run on the nodes allocated to you, contact user support If you submit a lot of 1 core jobs that use more than 2.75 GB each on hopper or more than 4 GB each on leibniz: Please specify the “mem256” feature for the nodes so that they get scheduled on nodes with more memory and also fill up as many cores as possible And a program requiring 16 GB of memory but using just a single thread does not make any sense on today’s computers

Specifying resources in job scripts: Single node per user policy
Policy motivation : parallel jobs can suffer badly if they don’t have exclusive access to the full node sometimes the wrong job gets killed if a user uses more memory than requested and a node runs out of memory a user may accidentally spawn more processes/threads and use more cores than expected (depends on the OS and how the process was started) typical case: a user does not realize that a program is an OpenMP or hybrid MPI/OpenMP program, resulting in running 28 processes with 28 threads each. Remember : the scheduler will (try to) fill up a node with several jobs from the same user, but could use some help from time to time if you don’t have enough work for a single node, you need a good PC/workstation and not a supercomputer

Specifying resources in job scripts: Syntax 1
Specifying resources in job scripts: Syntax 1.0 (deprecated) - Requesting cores #PBS –l nodes=2:ppn=20 : Request 2 nodes, using 20 cores (in fact, hardware threads) per node Or in the terminology of the Torque manual: 2 tasks with 20 lprocs (logical processors) each. Different node types have a different number of cores per node. On hopper, all nodes have 20 cores, on leibniz, nodes have 28 cores. #PBS –l nodes=2:ppn=20:ivybridge : Request 2 nodes, using 20 cores per node, and those nodes should have the property “ivybridge” (which is currently all nodes on hopper) See the “UAntwerpen clusters” page in the “Available hardware” section of the User Portal on the VSC website for an overview of properties for various node types.

Specifying resources in job scripts: Syntax 1.0 (deprecated) - Requesting memory As with the version 2 syntax: 2 kinds of memory: Resident memory (RAM) Virtual memory (RAM + swap space) But now 2 ways of requesting memory: Specify quantity for the job as a whole and let Torque distribute evenly across tasks/lprocs (or cores): mem for resident memory and vmem for virtual memory (Torque manual advises against its use, but it often works best...). Specify the amount of memory per lproc/core: pmem for resident memory and pvmem for virtual memory. It is best not to mix between both options (total and per core). However, if both are specified, the Torque manual promises to actually take the least restrictive of (mem, pmem) for resident memory and of (vmem, pvmem) for virtual memory. However, the implementation is buggy and Torque doesn’t keep its promises...

Specifying resources in job scripts: Syntax 1.0 (deprecated) - Requesting memory Guidelines: If you want to run a single multithreaded process: mem and vmem will work for you (e.g., Gaussian). If you want to run a MPI job with single-threaded processes (#cores = # MPI ranks): Any of the above will work most of the time, but remember that mem/vmem is for the whole job, not one node! Hybrid MPI/OpenMP job: Best to use mem/vmem However, in more complicated resource requests than explained here, mem/vmem may not work as expected either So when enforcing, we’ll likely enforce vmem. Source of trouble: Linux has two mechanisms to enforce memory limits: ulimit : Older way, per process on most systems cgroups: The more modern and robust mechanism, whole process hierarchy Torque sets ulimit –m and –v often to a value which is only relevant for non-threaded jobs, or even smaller (the latter due to another bug), and smaller than the (correct) values for the cgroups.

Specifying resources in job scripts: Syntax 1.0 (deprecated) - Requesting memory Examples: #PBS –l pvmem=2gb : This job needs 2 GB of virtual memory per core See the “UAntwerpen clusters” page in the “Available hardware” section of the User Portal on the VSC website for the usable amounts of memory for each node type (which is always less than the physical amount of memory in the node) As swapping makes no sense (definitely not on leibniz), it is best to set pvmem = pmem or mem = vmem, but you still need to specify both parameters.

Specifying resources in job scripts: Requesting compute time
Same procedure for both versions of the resource syntax #PBS –l walltime=30:25:55 : This job is expected to run for 30h25m55s. #PBS –l walltime=1:6:25:55 What happens if I underestimate the time? Bad luck, your job will be killed when the wall time has expired. What happens if I overestimate the time? Not much, but… Your job may have a lower priority for the scheduler, or may not run because we limit the number of very long jobs that can run simultaneously. The estimates for the start time of other jobs that the scheduler produces will be even worse than they are. Maximum wall time is 3 days on leibniz and 7 days on hopper But remember what we said in last week’s lecture about reliability? Longer jobs need a checkpoint-and-restart strategy And most applications can restart from intermediate data

Specifying resources in job scripts
All resource specifications start with #PBS –l or #PBS –L These #PBS lines look suspiciously like command line arguments? That’s because they can also be specified on the command line of qsub. Full list of options in the Torque manual, search for “Requesting resources” and “-L NUMA Resource Request” We discussed the most common ones. Remember that available resources change over time as new hardware is installed and old hardware is decommissioned. Share your scripts But understand what those scripts do (and how they do it) And don’t forget to check the resources from time to time! And remember from yesterday’s lecture: The optimal set of resources also depends on the size of the problem you are solving! Share experience

Non-resource-related #PBS parameters Redirecting I/O
The output to stdout and stderr are redirected to files by the resource manager Place: By default in the directory where you submitted job (“qsub test.pbs”), when the job finishes and if you have enough quota Default naming convention stdout: <job-name>.o<job-ID> stderr: <job-name>.e<job-ID> The default job name is the name of the job script. To change it: #PBS -N my_job : Change the job name to my_job stdout: my_job.o<job-ID> stderr: my_job.e<job-ID> Rename output files: #PBS –o stdout.file : Redirect stdout to the file stdout.file instead #PBS –e stderr.file : Redirect stderr to the file stderr.file instead Other files are put in the current working directory. This is your home directory when your job starts!

Non-resource-related #PBS parameters Notifications
The cluster can notify you by mail when a job starts, ends or is aborted. #PBS –m abe –M Send mail when the job is aborted (a), begins (b) or ends (e) to the given mail address. Without –M mail will be sent to the account linked to your VSC account You may be bombarded with mails if you submit a lot of jobs, but it is useful if you need to monitor a job All other qsub command line arguments can also be specified in the job script, but these are the most useful ones. See the qsub manual page in the Torque manual. example job scripts in /apps/antwerpen/examples

Submitting jobs If all resources are specified in the job script, submitting a job is as simple as qsub returns with a unique jobid consisting of a unique number that you’ll need for further commands and a cluster name. But all parameters specified in the job script in #PBS lines can also be specified on the command line (and overwrite those in the job script), e.g., x : the number of tasks y : the number of logical cores per task (maximum: 20 for hopper, 28 for leibniz until we enable hyperthreading) z : the amount of virtual memory needed per task $ qsub jobscript.sh $ qsub –L tasks=x:lprocs=y:swap=zgb example.sh

Environment variables in job scripts Predefined variables
The resource manager passes some environment variables $PBS_O_WORKDIR : directory from where the script was submitted $PBS_JOBID : the job identifier $PBS_NUM_PPN : when using resource syntax 1.0: number of cores/hardware threads per node available to the job, i.e., the value of the ppn parameter; otherwise 1 $PBS_NP : total number of cores/hardware threads for the job $PBS_NODES, $PBS_JOBNAME, $PBS_QUEUE, ... Look for “Exported batch environment variables” in the Torque manual (currently version 6.1.1) Some VSC-specific environment variables Variables that start with VSC: Point to several file systems that you can use (see later), name of the cluster, etc. Variables that start with EB: installation directory of loaded modules

Environment variables in job scripts Passing variables
With the -v argument, it is possible to pass variables from the shell to a job or define new variables. Example: $ export myvar="This is a variable” $ myothervar="This is another variable" $ qsub -I -L tasks=1:lprocs=1 -l walltime=10:00 -v myvar,myothervar,multiplier= Next on the command line once the job starts: $ echo $myvar This is a variable $ echo $myothervar $ echo $multiplier What happens? myvar is passed from the shell (export) myothervar is an empty variable since it was not exported in the shell from which qsub was called multiplier is a new variable, its value is defined on the qsub command line qsub doesn’t like multiple -v arguments, only the last one is used

Dependent jobs It is also possible to instruct Torque/Moab to only start a job after finishing one or more other jobs: will only start the job defined by jobdepend.sh after the jobs with jobid1 and jobid2 have finished Many other possibilities, including Start another job only after a job has ended successfully Start another job only after a job has failed And many more, check the Moab manual for the section “Job Dependencies” Useful to organize jobs, e.g., Need to run a sequence of simulations, each one using the result of the previous one, but bumping into the 3-day wall time limit so not possible in a single job Need to run simulations using results of the previous one, but with a different optimal number of nodes Extensive sequential pre- or postprocessing of a parallel job Powerful in combination with -v qsub –W depend=afterany:jobid1:jobid2 jobdepend.sh

Dependent jobs Example
A job whose result is used by 2 other jobs job_depend.sh #!/bin/bash #PBS -L tasks=1:lprocs=1:swap=1gb #PBS -l walltime=10:00 cd $PBS_O_WORKDIR mkdir mult-$multiplier ; cd mult-$multiplier number=$(cat ../outputfile) echo $(($number*$multiplier)) >outputfile sleep 300 job_first.sh #!/bin/bash #PBS -L tasks=1:lprocs=1:swap=1gb #PBS -l walltime=10:00 cd $PBS_O_WORKDIR echo "10" >outputfile sleep 300 job_launch.sh #!/bin/bash first=$(qsub -N job_leader job_first.sh) qsub -N job_mult_5 -v multiplier=5 -W depend=afterany:$first job_depend.sh qsub -N job_mult_10 -v multiplier=10 -W depend=afterany:$first job_depend.sh

Dependent jobs Example
After start of the first job – qstat –u vsc20259 Some time later: When finished: Output of ls: cat outputfile 10 cat mult-5/outputfile 50 cat mult-10/outputfile 100

Queue manager Submit Job Query Cluster Torque #!/bin/bash
#PBS -o stdout.$PBS_JOBID #PBS -e stderr.$PBS_JOBID module load leibniz/supported module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Torque Submit Job Query Cluster

Queue manager: Torque Jobs are submitted to the queue manager
The queue manager provides functionality to start, delete, hold, cancel and monitor jobs on the queue Some commands : qsub — submit a job qdel — delete a submitted (queued and/or running) job qstat — get status of your jobs on the system Under the hood: There are multiple queues, but qsub will place the job in the most suitable queue On our system, queues are based on the requested wall time There are limits on the amount of jobs a user can submit to a queue or have running from one queue To avoid that a user monopolises the system for a prolonged time

Queue manager Submit Job Query Cluster MOAB #!/bin/bash
#PBS -o stdout.$PBS_JOBID #PBS -e stderr.$PBS_JOBID module load leibniz/supported module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo MOAB Submit Job Query Cluster

Job scheduling: Moab Job scheduler
Decides which job will run next based on (continuously evolving) priorities Works with the queue manager Works with the resource manager to check available resources Tries to optimise (long-term) resource use, keeping fairness in mind The scheduler associates a priority number to each job the highest priority job will (usually) be the next one to run, not first come, first served jobs with negative priority will (temporarily) be blocked backfill: allows some jobs to be run “out of order” with respect to the assigned priorities provided they do not delay the highest priority jobs in the queue Lots of other features : QoS, reservations, ...

Job scheduling Scheduler priorities
Currently, the priority calculation is based on : the time a job is queued: The longer a job is waiting in the queue, the higher the priority fairshare: jobs of users who have recently used a lot of compute time get a lower priority compared to other jobs And jobs may get blocked for a while if your fair share passes a critical value Effective Fairshare Fairshare Depth (7 days) Past Present Fairshare Usage – 35 – 30 – 25 – 20 – 15 – 10 – 5 | 6 5 4 3 2 1 Fairshare Interval (24 hours) Fairshare Decay (80%)

Getting information from the scheduler
showq : Displays information about the queue as the scheduler sees it (i.e., taking into account the priorities assigned by the scheduler) 3 categories: starting/running jobs, eligible jobs and blocked jobs You can only see your own jobs! checkjob : Detailed status report for specified job showstart : Show an estimate of when the job can/will start This estimate is notoriously inaccurate! A job can start earlier than expected because other jobs take less time than requested But it can also start later if a higher-priority job enters the queue showbf : Shows the resources that are immediately available to you (for backfill) should you submit a job that matches those resources

Information shown by showq
active jobs are running or starting and have resources assigned to them eligible jobs are queued and are considered eligible for scheduling and backfilling blocked jobs jobs that are ineligible to be run or queued job states in the ”STATE” column: idle : job violates a fairness policy userhold or systemhold : user or administrative hold deferred : a temporary hold when unable to start after a specified number of attempts batchhold : requested resources are not available, or repeated failure to start the job notqueued : e.g., a dependency for a job is not yet fulfilled when using job chains

Queue manager Submit Job Query Cluster Torque #!/bin/bash
#PBS -o stdout.$PBS_JOBID #PBS -e stderr.$PBS_JOBID module load leibniz/supported module load MATLAB cd $PBS_O_WORKDIR matlab -r fibo Torque Submit Job Query Cluster

Resource manager The resource manager manages the cluster resources (e.g., CPUs, memory, ...) This includes the actual job handling at the compute nodes Start, hold, cancel and monitor jobs on the compute nodes Should enforce resource limits (e.g., available memory for your job, ...) Logging and accounting of used resources Consists of a server component and a small daemon process that runs on each compute node 2 very technical but sometimes useful resource manager commands: pbsnodes : returns information on the configuration of all nodes tracejob : looks back into the log files for information about the job and can be used to get more information about job failures

Working interactively
CalcUA Tutorial Chapter 5 and more

Interactive jobs Starting an interactive job : qsub –I : open a shell on compute node And add the other resource parameters (wall time, number of processors) qsub –I –X : open a shell with X support must log in to the cluster with X forwarding enabled (ssh -X) or start from a VNC session works only when you are requesting a single node When to start an interactive job to test and debug your code to use (graphical) input/output at runtime only for short duration jobs to compile on a specific architecture We (and your fellow users) cannot appreciate idle nodes waiting for interactive commands while other jobs are waiting to get compute time!

GUI programs The cluster is not meant to be your primary GUI working platform. However, sometimes running GUI programs is necessary: Licensed program with a GUI for setting parameters of a run that cannot be run locally due to licensing restrictions (e.g., NUMECA FINE-Marine) Visualisations that need to be done while a job runs or of data that is too large to transfer to a local machine Leibniz has one visualization node with GPU and a VNC & VirtualGL-based setup VNC is the software to create a remote desktop. Simple desktop environment with the xfce desktop/window manager. VirtualGL: To redirect OpenGL calls to the graphics accelerator and send the resulting image to the VNC software. Start OpenGL programs via vglrun. Same software stack minus the OpenGL libraries on all login nodes of hopper and leibniz. Need a VNC client on your PC. TurboVNC gives good results.

GUI programs How does VNC work?
Basic idea: Don’t send graphics commands over the network, but render the image on the cluster and send images. Step 1: Load the vsc-vnc module and start the (Turbo)VNC server withe the vnc-xfce command. You can then log out again. Step 2: Create an SSH tunnel to the port given when starting the server. Step 3: Connect to the tunnel with your VNC client. Step 4: When you’ve finished, don’t forget to kill the VNC server. Disconnecting will not kill the server! We’ve configured the desktop in such a way that logging out on the desktop will also kill the VNC server, at least if you started through the vnc-* commands shown in the help of the module (vnc- xfce). Full instructions: Page “Remote UAntwerp” on the VSC web site (

GUI programs Using OpenGL:
The VNC server currently emulates a X-server without the GLX extension, so OpenGL programs will not run right away. On the visualisation node: Start OpenGL programs through the command vglrun vglrun will redirect OpenGL commands from the application to the GPU on the node and transfer the rendered 3D image to the VNC server E.g.: vglrun glxgears will run a sample OpenGL program Our default desktop for VNC is xfce. GNOME is not supported due to compatibility problems.

Interactive jobs Do the exercises in Chapter 5 “Interactive jobs”
Files in examples/Running-interactive-jobs Text example in §5.2 GUI program in §5.3 (assuming you have an X server installed)

Files on the cluster CalcUA Tutorial Chapter 6

Directories $VSC_SCRATCH $VSC_DATA $VSC_HOME
/scratch/antwerpen/2xx/vsc2xxyy fast scratch for temporary files per cluster/site highest performance /data/antwerpen/2xx/vsc2xxyy to store long-time data exported to other VSC sites slower than /scratch /user/antwerpen/2xx/vsc2xxyy only for account configuration files do not run your jobs in this directory! $VSC_SCRATCH $VSC_DATA $VSC_HOME

Directories: Quota block quota file quota /scratch 25 GB 100,000 /data
Block quota: Limits the amount of data you can store on a file system File quota: Limits the number of files you can store on a file system Current defaults: block quota file quota /scratch 25 GB 100,000 /data /home 5 GB 20,000

Best practices for file storage
The cluster is not for long-term file storage. You have to move back your results to your laptop or server in your department. There is a backup for the home and data file systems. We can only continue to offer this if these file systems are used in the proper way, i.e., not for very volatile data Old data on scratch can be deleted if scratch fills up (so far this has never happened at UAntwerp, but it is done regularly on some other VSC clusters) Remember that the cluster is optimised for parallel access to big files, not for using tons of small files (e.g., one per MPI process). If you run out of block quota, you’ll get more without too much motivation. If you run out of file quota, you’ll have a lot of explaining to do before getting more. Text files are good for summary output while your job is running, or data that you’ll read into a spreadsheet, but not for storing 1000x1000-matrices. Use binary files for that!

Exercises Do the exercises in Chapter 6 ”Jobs with input/output”.
Files in examples/Running-jobs-with-input-output-data §6.3 – Writing output files : Optional §6.4 – Reading an input file : Homework, not in this session (and most useful if you understand Python scripts)

Multi-core jobs CalcUA Tutorial Chapter 7

Why parallel computing?
Faster time to solution distributing code over N cores hope for a speedup by a factor of N Larger problem size distributing your code over N nodes increase the available memory by a factor N hope to tackle problems which are N times bigger In practice gain limited due to communication and memory overhead and sequential fractions in the code optimal number of nodes is problem-dependent but no escape possible as computers don’t really become faster for serial code… Parallel computing is here to stay.

Starting a shared memory job
Shared memory programs start like any other program, but you may need to tell the program how many threads it can use. Depends on the program, and the autodetect feature of a program usually only works when the program gets the whole node. e.g., Matlab: use maxNumCompThreads(N) Many OpenMP programs use the environment variable OMP_NUM_THREADS: export OMP_NUM_THREADS=7 will tell the program to use 7 threads. Check the manual of the program you use! There is also a command line option to run single-threaded Matlabs. But do you need a supercomputer when you use only one thread?

Starting a MPI job generic-mpi.sh (Intel MPI) example script :
#!/bin/bash #PBS –N mpihello #PBS –L tasks=56:lprocs=1:swap=1gb ← 56 MPI processes, i.e., 2 nodes with 28 processes each on leibniz #PBS –l walltime=00:05:00 module load leibniz/supported ← load the cluster module module load intel/2017a ← load the Intel toolchain (for the MPI libraries and mpirun) cd $PBS_O_WORKDIR ← change to the current working directory mpirun ./mpihello ← run the MPI program (mpihello) mpirun communicates with the resource manager to acquire the number of nodes, cores per node, node list etc.

Starting a MPI job The module vsc-mympirun contains an alternative mpirun command (mympirun) that makes it easier to start hybrid programs Command line option --hybrid=X to specify the number of processes per node Check mympirun --help Not the ideal solution but the best we have right now since Torque doesn’t communicate enough information to the MPI starter. And may be best to set tasks to the number of nodes and lprocs to the number of cores you want just to be sure that you get full nodes

Exercises Do the exercises in Chapter 7 ”Multi-core jobs”
Scripts in examples/Multi-core-jobs-Parallel-Computing §7.2 – Example pthreads program (shared memory) §7.3 – Example OpenMP program (shared memory) : Only look at the first example (omp1.c) §7.4 – Example MPI program (distributed memory)

Policies CalcUA Tutorial Chapter 9

Some site policies Our policies on the cluster:
Single user per node policy Priority based scheduling (so not first come – first get). We reserve the freedom to change the parameters of the priority computation at will to ensure the cluster is used well Fairshare mechanism – Make sure one user cannot monopolise the cluster Crediting system – Dormant, not enforced Implicit user agreement: The cluster is valuable research equipment Do not use it for other purposes than your research for the university No and similar initiatives Not for private use And you have to acknowledge the VSC in your publications (see user portal on the VSC web site) Don’t share your account

Credits At UAntwerp we monitor cluster use and send periodic reports to group leaders. On the Tier-1 system at KU Leuven, you get a compute time allocation (number of node days). This is enforced through project credits (Moab Accounting Manager). A free test ride (motivation required) Requesting compute time through a proposal At KU Leuven Tier-2, you need compute credits which are enforced as on the Tier-1. Credits can be bought via the KU Leuven Charge policy (subject to change) based upon : fixed start-up cost used resources (number and type of nodes) actual duration (used wall time)

Advanced topics CalcUA Part B

Agenda part B Multi-job submission Compiling Program examples
Good practices

Multi-Job submission CalcUA Tutorial Chapter 10

Worker framework Many users run many serial jobs, e.g., for a parameter exploration Problem : submitting many short jobs to the queueing system puts quite a burden on the scheduler, so try to avoid it Solution: Worker framework parameter sweep : many small jobs and a large number of parameter variations. Parameters are passed to the program through command line arguments. job arrays : many jobs with different input/output files Alternatives Torque array jobs Torque pbsdsh command GNU parallel (module name: parallel) We may not be able to support GNU parallel in the future for jobs that span multiple nodes

Worker framework Example 1: Parameter exploration
temperature, pressure, volume 293.0, 1.0e05, 87 ..., ..., , 1.0e05, 75 data.csv #!/bin/bash –l #PBS –L tasks=28:lprocs=1 #PBS –l walltime=04:00:00 cd $PBS_O_WORKDIR weather –t $temperature –p $pressure –v $volume weather.sh (data in CSV format) $ module load worker $ wsub -data data.csv -batch weather.sh wsub weather will be run for all data, until all computations are done Can also run across multiple nodes

jobarray.sh #!/bin/bash –l #PBS –L tasks=28:lprocs=1 –l
#PBS –l walltime=04:00:00 cd $PBS_O_WORKDIR INPUT_FILE="input_${PBS_ARRAYID}.dat" OUTPUT_FILE="output_${PBS_ARRAYID}.dat" program –input ${INPUT_FILE} –output ${OUTPUT_FILE} –l ← for every run, there is a ← separate input file and ← an associated output file jobarray.sh $ module load worker $ wsub –t 1–100 –batch jobarray.sh program will be run for all input files (100)

Exercises Do the exercises in Chapter 10 “Multi-job submission”
Files in the subdirectories of examples/Multi-job-submission Have a look at the parameter sweep example “weather” from §10.1 (subdirectory par_sweep) Have a look at the job array example from §10.2 (subdirectory job_array) (Time left: Have a look at the map-reduce example from §10.3 in subdirectory map_reduce)

Best practices CalcUA Tutorial Chapter 13

Debug & test Before starting you should always check :
are there any errors in the script ? are the required modules loaded ? is the correct executable used ? Check your jobs at runtime login on node and use, e.g., “top”, “htop”, “vmstat”, ... monitor command (see User Portal information on running jobs) showq –r (data not always 100% reliable) alternatively: run an interactive job (“qsub -I”) for the first run of a set of similar runs If you will be doing a lot of runs with a package, try to benchmark the software for scaling issues when using MPI I/O issues If you see that the CPU is idle most of the time that might be the problem

Hints & tips Job will logon to the first assigned compute node and start executing the commands in the job script start in your home directory ($VSC_HOME), so go to current directory with “cd $PBS_O_WORKDIR” don’t forget to load the software with “module load” Why is my job not running ? use “checkjob” commands might timeout with overloaded scheduler Submit your job and wait (be patient) … Using $VSC_SCRATCH_NODE (/tmp on the compute node) might be beneficial for jobs with many small files. But that scratch space is node-specific and hence will disappear when your job ends So copy what you need to your data directory or to the shared scratch

Warnings what a cluster is not
Avoid submitting many small jobs by grouping them using a job array or using the Worker framework Runtime is limited by the maximum wall time of the queues for longer wall time, use checkpointing Requesting many processors could imply long waiting times (though we’re still trying to favour parallel jobs) a computer that will automatically run the code of your (PC) application much faster or for much bigger problems what a cluster is not

User support CalcUA

User support VSC CalcUA www.vscentrum.be uantwerpen.be/hpc
Try to get a bit self-organized limited resources for user support best effort, no code fixing contact and be as precise as possible VSC documentation : vscentrum.be/en/user-portal External documentation : docs.adaptivecomputing.com manuals of Torque and MOAB – We’re currently at version of Torque and version of Moab mailing-lists VSC CalcUA uantwerpen.be/hpc

User support mailing-lists for announcements : (for official announcements and communications) contacts : phone : 3860 (Stefan), 3855 (Franky), 3852 (Kurt), 3879 (Bert) office : G (CMI) Don’t be afraid to ask, but being as precise as possible when specifying your problem definitely helps in getting a quick answer.

HPC@UAntwerp introduction Stefan Becuwe, Franky Backeljauw, Kurt Lust, Bert Tijskens.

Similar presentations

Presentation on theme: "HPC@UAntwerp introduction Stefan Becuwe, Franky Backeljauw, Kurt Lust, Bert Tijskens."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HPC@UAntwerp introduction Stefan Becuwe, Franky Backeljauw, Kurt Lust, Bert Tijskens.

Similar presentations

Presentation on theme: "HPC@UAntwerp introduction Stefan Becuwe, Franky Backeljauw, Kurt Lust, Bert Tijskens."— Presentation transcript:

Similar presentations

About project

Feedback