Raijin Essentials Resources These slides:

nci.org.au @NCInews Raijin Essentials Resources These slides: http://nci.org.au/user-support/training/using-raijin-course/http://nci.org.au/user-support/training/using-raijin-course/ NCI guides: http://nci.org.au → User Supporthttp://nci.org.au Training material: http://nci.org.au/services-support/training/http://nci.org.au/services-support/training/

nci.org.au 2/86 Outline I. ConnectingConnecting II. Resource QuotasResource Quotas III. Software EnvironmentSoftware Environment IV. FilesystemsFilesystems V. Scheduling JobsScheduling Jobs

nci.org.au 3/86 Raijin: Unix cluster (CentOS 6.6), 3592 compute nodes. Interactive terminal (text only): ssh -l abc123 raijin.nci.org.au, or ssh abc123@raijin.nci.org.au Windows ssh clients: Putty, MobaXterm, Xming. Graphics-enabled session (i.e., X-Windows/X11): ssh -X... (PC) ssh -Y... (Mac) Remote file transfer: scp / sftp / rsync commands, or graphical ftp client. It’s good practice to logout of xterm sessions (or ctrl-d, or exit ). Connecting: Basics

nci.org.au 4/86 Connecting: Login nodes If you can't connect to raijin.nci.org.au, try raijin1-raijin6. Use login nodes for small tasks: Editing files, submitting jobs, small file transfers, compiling small programs, etc. 'Intensive' tasks killed automatically (>2GB ram, >30 mins cpu time).

nci.org.au 5/86 Connecting: Data-mover nodes Small file-transfers: raijin.nci.org.au Small/Large transfers: r-dm.nci.org.au Can use scp / rsync / sftp. Syntax ( scp / rsync ): scp [source-file/dir] [destination-file/dir] Server: Machine running scp server (usually r-dm.nci.org.au ). Client: Machine that initiates the copy (usually your PC). Push: Client → Server. Pull: Client ← Server.

nci.org.au 6/86 Connecting: Using scp (in your own time) Examples. Use your PC as client (run scp in local terminal, not on Raijin). Push, i.e., Client (your PC)→ Server (Raijin): scp myfile abc123@r-dm.nci.org.au:mydir mydir must exist in your home ( ~ ) dir. scp myfile abc123@r-dm.nci.org.au: Copies to home dir, don't forget colon. scp -r mydir abc123@r-dm.nci.org.au:parentdir parentdir must exist in your home dir. Pull, i.e., Client (your PC)← Server (Raijin): Swap the order of the arguments, e.g., scp abc123@r-dm.nci.org.au:myfile mydir parentdir must be in the current dir.

nci.org.au 7/86 Connecting: Using rsync (in your own time) rsync uses same basic syntax as scp. Many more options: rsync -avPS myfile abc123@r-dm.nci.org.au:mydir -a : Archive. (Recursive copy, preserves permissions/owner/group/mtime,... ) -P : Resume partial file transfers. -S : Handle sparse files. Also, -z to enable compression, --progress to show progress, etc. Consult the manual pages: man scp and man rsync.

nci.org.au 8/86 Connecting: Passphrase-less access Sometimes necessary, e.g.. automation of remote data-transfers. Don't put password in file/script; use passphrase-less ssh keys.passphrase-less ssh keys ssh keys can be figured to…...allow only certain commands....restrict arguments, such as directory names. Passphrase-less file transfer: use rrsync instead of rsync (also rscp ). Strongly discouraged. Weakens security of NCI and your system.

nci.org.au 9/86 Exercise 1: Connecting (1/2) Login to Raijin: ssh your-course-id@raijin.nci.org.au Service disruptions are reported in the 'Message of the Day'. Who/Where am I? whoami man hostname# 'man' is Unix's help system, q to exit. hostname pwd# shows name of current directory. echo.# '.' refers to the current directory. echo ~# '~' is an alias for your home directory. cat /etc/motd You might wish to refer to Handy Unix Commands.Handy Unix Commands

nci.org.au 10/86 Exercise 1: Connecting (2/2) Try to run the xeyes or xclock commands (they won’t work). logout (or ctrl-d, or exit ), reconnect with x11 forwarding: (Mac) ssh -Y... (PC) ssh -X... Now try, e.g., xeyes (ctrl-c to stop). Remote file transfer: (if time permits) logout and ‘push’ a file from your computer to Raijin, scp myfile your-course-id@r-dm.nci.org.au: (You must include the colon!) Log back in and use ls or ls -la to check the result.

nci.org.au 12/86 The two resources that are metred are storage and compute time. Compute grant divided into quarterly amounts. Resource usage is accounted against projects. Projects implemented as Unix groups. Users can belong to multiple projects. Resource Quotas

nci.org.au 13/86 Usage is ‘charged’ to your default project, unless otherwise specified. Edit your.rashrc to change your default project:.rashrc is hidden in your home folder ( ls -a lists all files). Settings in.rashrc are applied each time you login. Modifying.rashrc is the best way to set your project. Resource Quotas: Default project setenv PROJECT c25 setenv SHELL /bin/bash

nci.org.au 14/86 $PROJECT : Name of active project (see environment variables).environment variables newgrp doesn't update $PROJECT, instead use...... switchproj. Changes active project for the current session.... nfnewgrp. Changes active project for specified command. For more info, run switchproj / nfnewgrp without arguments. Most commands relating to project accounting/job-scheduling let you override default project. Resource Quotas: Overriding default project Never put switchproj in login scripts. (More on these later).

nci.org.au 15/86 Storage grant has two components: Project’s data usage is based on file ownership. Location of file has no bearing on quota usage. Use chgrp to change which project owns a file: chgrp NameOfNewProject myfile Output files produced by jobs submitted to scheduler belong to your default project...... unless you specify otherwise (more on jobs later). Resource Quotas: Data grant (1/2) 1. Amount of data. 2. # of files/dirs (‘inodes’).

nci.org.au 16/86 Projects only have storage grant for /short, /g/data, massdata. /home capped at 2GB and 80,000 inodes. Project dirs: /short/projcode, /g/data/projcode, /massdata/projcode To access massdata : mdss command or ftp to r- dm.nci.org.au. /g/data is symbolic link to /g/data1 or /g/data2. E.g., /g/data/c25 → /g/data2/c25 Resource Quotas: Data grant (2/2)

nci.org.au 17/86 Resource Quotas: Data usage (1/2) nci_account : Summary of all resources used by a project. Total disk usage (data, inodes) for main filesystems. Project ( -P ) and time period ( -p ) options: nci_account -P c25 -p 2014.q4 Lustre filesystem stats can be ~30 minutes old. lquota : Quotas and overall usage for Lustre filesystems ( /home, /short, /g/data ). Shows usage for all of your projects. Unlike nci_account, lquota queries filesystem directly. To estimate amount of data in current dir: du. -sh, find.|wc -l

nci.org.au 18/86 Resource Quotas: Data usage (2/2) short_files_report : Project’s disk usage for /short only. -G and -P options show breakdown by user. Stats can be up to ~1 day old. Example 1. To show where the data owned by project c25 is: short_files_report -G c25 Locating misplaced files/files with incorrect ownership. Project’s overall disk usage. Who is using the most space. Example 2. To show which projects own the data in /short/c25 : short_files_report -P c25 Locating misplaced files/files with incorrect ownership. Also gdata1_files_report, gdata2_files_report, etc.

nci.org.au 19/86 Resource Quotas: Compute policy (1/2) You are only charged for cpu time used by ‘batch jobs’, i.e., jobs submitted to the scheduler. This applies to both compute and copy jobs. All other cpu usage is free. Login nodes subject to cpu/mem limits. * ssh can issue noninteractive commands to r-dm.nci.org.au.

nci.org.au 20/86 Resource Quotas: Compute policy (2/2) Remote transfers to r-dm.nci.org.au ‘unlimited’. In general, can’t transfer files remotely to/from massdata. All other large file-operations must use job scheduler, e.g., /g/data → massdata

nci.org.au 21/86 Resource Quotas: Compute usage Compute grant specified as number of ‘Service Units’ (SUs). SU = one hour of walltime on one cpu. walltime = real-world time. Each cluster node has 16 cpus (‘cpu’ = core). E.g., a job that uses 3 compute nodes for 3 hrs costs 3 × 16 × 3 = 144SUs (excludes ‘express’ jobs, discussed later). Project’s compute usage is updated after each job finishes. Use nci_account to view project’s overall compute usage. Also shows costs of running and queued jobs. -v option: breakdown of compute usage by user.

nci.org.au 22/86 Exercise 2: Accounting Amount of data in your /home ( ~ ) dir: du ~ -sh Approx. # of files in your /home dir: find ~ | wc -l (The pipe ‘ | ’ directs output of find to input of wc. See man wc and man find. Try the find command by itself). nci_account, also try -v and -vv options for information overload. lquota short_files_report -G c25 short_files_report -P c25 Also gdata1_files_report and gdata2_files_report. Notice that c25 doesn’t have gdata1 allocation.

nci.org.au 24/86 Software Environment: The shell Each terminal runs a separate shell. The shell interprets and executes commands. (Handy Unix Commands).Handy Unix Commands Many shells to choose from. bash is the most popular (default), followed by tcsh. Edit your.rashrc to change your default shell: Shell commands can be grouped into scripts. Each script runs in a subshell. Append & to run a command in background (related to wait, ctrl-z, fg, bg. See, e.g., man wait ). setenv PROJECT c25 setenv SHELL /bin/bash

nci.org.au 25/86 Software Environment: Shell variables Can define shell variables that can be used from the command line. N=10 OUTPUTFILE=myfile.out ( bash syntax) set N=10 OUTPUTFILE=myfile.out ( tcsh syntax) Retrieve the value by prepending a $. echo “The value of N is $N” To make existing variable visible to subshells such as scripts: ( bash syntax) N=10 export N export OUTPUTFILE=myfile.out ( tcsh syntax) setenv OUTPUTFILE myfile.out Useful pre-defined vars: $PROJECT, $USER, $HOME, $0 (shell type). Also see Canonical Environment Variables.Canonical Environment Variables

nci.org.au 26/86 Software Environment: Shell scripts First line of a script invokes a new shell. Usually #!/bin/bash or #!/bin/tcsh Next come user-commands. E.g., (contents of myscript.sh ) #!/bin/bash N=10 # Anything after a ‘#’ is a comment. echo $N To make script executable: chmod +x myscript.sh To run script:./myscript.sh When script finishes, original values of variables are restored. To make changes persistent, use source (or ‘. ’). E.g.,. myscript.sh source replaces parent shell with subshell running the script.

nci.org.au 27/86 System-scripts for setting default environment (‘dot files’)...... run each time a shell is created/destroyed.... hidden in your home ( ~ ) directory. When shell is created: When you logout:.bash_logout (bash),.logout (tcsh). Raijin: BASH_ENV=.bashrc. Default.profile executes.bashrc. Keep # of commands in your dot files to a minimum. Avoids conflicts/recursive execution of dot files. Software Environment: Defaults Type of shellbashtcsh Login.profile.login,.cshrc Non-interactive $BASH_ENV.cshrc Interactive.bashrc.cshrc

nci.org.au 28/86 Software Environment: Editors Several editors are installed on Raijin. Convenient for modifying job scripts. Editors with text-based interfaces: vi/vim emacs nano vi/vim/emacs are powerful but not intuitive at first. Editors with graphical interface: nedit is a simple graphical editor. emacs (unless -nw option is specified). Require X-Windows enabled session ( ssh -X or -Y... ). MS Windows uses slightly different text file format. Can use dos2unix / unix2dos to convert your scripts.

nci.org.au 29/86 Software Environment: Modules (1/2) Many software packages available on Raijin (software catalogue).software catalogue Configuring your environment for each package isn’t trivial. Modules take care of this for you: module load / unload SoftwareName module avail shows list of available modules. module avail SoftwareName shows available versions. module show SoftwareName shows changes module makes. module list lists the modules you have currently loaded. Some packages require more than one module to be loaded. E.g., many packages require OpenMPI. Prerequisites documented in software catalogue.software catalogue

nci.org.au 30/86 Software Environment: Modules (2/2) Can put module commands in dot files....preferably in.profile /.login instead of.bashrc /.cshrc. Default dot files contain a small number of 'core' modules. Putting module commands in dot files can lead to conflicts. It’s better to put such commands in your job scripts instead. Putting module purge in dot files can result strange errors. You can define your own modules: Module user-guide.Module user-guide

nci.org.au 31/86 Exercise 3: Software environment (1/4) As always, Handy Unix Commands.Handy Unix Commands Inspect some predefined variables. E.g., echo $HOME (compare with echo ∼ ).predefined variables printenv shows list of defined environment variables. Which shell are you running? echo $0 When you enter a command that isn’t built-in, the shell searches directories named in $PATH : echo $PATH PATH=$PATH:~/mydir echo $PATH Inspect your login scripts: ls -la ~# What happens if you omit the 'a'? cat ~/.profile# Note the 'default' modules. cat ~/.bashrc#...or ~/.login and ~/.cshrc if using tcsh.

nci.org.au 32/86 Exercise 3: Software environment (2/4) Try the following module commands: module avail module avail python module show python# Shows what environment vars will be set echo $PYTHON_BASE module load python# Loads the default version echo $PYTHON_BASE which python Use module list to check which version is loaded. Try loading the module for a different version of python. You must unload the previous version first: module unload python# No need to specify which version module list echo $PYTHON_BASE which python

nci.org.au 33/86 Exercise 3: Software environment (3/4) (If time permits) Make a simple script: nano myscript.sh (... or use, e.g., nedit if X-Windows is enabled). Insert the following lines: #!/bin/bash echo This script is running in a non-interactive subshell. Save the script (ctrl-o) and exit nano (ctrl-x). ls -l myscript.sh# Look at file permissions. chmod +x myscript.sh# Make script executable. Check the file permissions once more.

nci.org.au 34/86 Exercise 3: Software environment (4/4) Add the following to the end of.profile : echo Starting a new login shell. Add the following to the end of.bashrc : if [ -z “$PS1” ]; then# Don’t forget the spaces! echo This shell is interactive. else echo Either default.profile was executed, or this is a non- interactive shell and BASH_ENV=.bashrc. fi# This isn’t a typo. For each of the next three steps, which, if any, dot files are executed? 1. logout (or ctrl-d) and then log back in. 2. Run./myscript.sh. Try echo $BASH_ENV. 3. Start an interactive subshell by typing bash. Type exit to close the subshell and return to the parent shell.

nci.org.au 36/86 Gratituitous slide showing disk capacities. /g/data1 and /g/data2 together have more capacity than 20,000 laptop hard drives (@700GB ea) combined. Not too shabby! Filesystems: Capacities

nci.org.au 37/86 Filesystems: Purpose and performance The purpose of each filesystem is reflected in that FS’s performance. /short : Freq. accessed files, esp. large IO files of running/recent jobs, source-code/libs. /g/data : Data sets that must be available on demand. Global: visible to the NCI cloud. /home : Source-code, scripts, local packages. JOBFS : Node-local scratch space for each job. massdata : Archive files. Reads are slow and not immediate if file isn’t in disk cache. File IO of running jobs usually directed to /short or /jobfs. The particular choice depends on the IO pattern (more in a moment).

nci.org.au 38/86 The real-world read/write speeds per-job are: *Lustre filesystems, conditions apply. Lustre FS speeds achievable because it...... is highly-parallel.... communicates over ‘Infiniband’ network (56GbE). massdata : Accessed over 10GbE link. Read speed depends on whether file is in cache or on tape. Filesystems: Speeds /short ~1GB/s* /g/data ~500MB/s* JOBFS ≤ 100MB/s /massdata 0.5-1TB/hr (write)

nci.org.au 39/86 Filesystems: Use and abuse of Lustre Lustre filesystems ( /short, /g/data, /home ) are distributed: Files ‘striped’ across multiple disks for parallel IO. Each file potentially accessible to many users/cpus. Consistent view maintained across 1000’s of Lustre clients. Each file operation generates a lot of metadata. Filesystem bandwidth is shared by all users. IO-intensive tasks should......issue file operations no more than once per second....read/write in ‘large’ chunks (>1MB). Lustre performance plummets if file ops are small and frequent. Misusing Lustre degrades performance for everyone.

nci.org.au 40/86 Each cluster node has 396GB of local scratch space ( JOBFS ). Only available to ‘batch jobs’.batch jobs Allocated when job starts. Deleted when job finishes. Not visible by other jobs/users. Very frequent, small file-ops......can degrade Lustre performance for everyone....should be avoided, but handled much better by JOBFS. Filesystems: High-frequency IO JOBFS/short Typical speed≤ 100MB/s~1GB/s Suitable for frequent IO YesNo

nci.org.au 41/86 Intended for archiving large data files. Increases access times. Avg. size should be ≥ 1MB. Bundle small files into archives (.tar files) first. Use archive option ( -t ) of netcp / netmv commands (discussed later). Compress option ( -z ) is recommended. Supplying mdss get with a large file-list facilitates parallel retrieval. Filesystems: Use and abuse of massdata Storing large numbers of small files is a misuse of massdata.

nci.org.au 42/86 Snapshot of /home taken every ~2 days. /home snapshots thinned-out over time. Duplicate copy of massdata kept in separate building. Backing-up data is otherwise your responsibility. Use tar liberally! Many groups neglect to use massdata because...... they don’t have a plan for organising/managing data.... turning large data sets into archive files is laborious.... no one wants to touch other people’s files, esp. large dirs.... they are unsure how to use massdata.... it’s easier to leave files as they are on /short or /g/data. If you are unsure, ask us for assistance. Filesystems: Data-backup policies Only /home and massdata are backed-up automatically.

nci.org.au 43/86 Filesystems: Accessing (1/2) Lustre filesystems are used like regular directories: ls, cd, cp, mkdir, rm, etc. JOBFS exists for duration of job. Its location will be stored in $PBS_JOBFS (discussed later). Accessing massdata : mdss command. Provides put / get, rm, ls, etc. netcp / netmv commands. More details in a moment!

nci.org.au 44/86 Filesystems: Accessing (2/2) Project directories: filesystem/projcode filesystem = /short, /g/data, /massdata, etc. All project members have read/write/execute permissions for these dirs. Each person has their own subdirectory: filesystem/projcode/$USER To check filesystem status: /opt/rash/bin/modstatus -n opt opt = gdata1_status, gdata2_status or mdss_status Message of the Day ( cat /etc/motd ). Emergency downtime noticeEmergency downtime notice.

nci.org.au 45/86 Filesystems: Accessing massdata (1/3) Method 1: mdss command. Provides familiar file operations as subcommands: mdss ls, mdss mkdir mydir, mdss rm -r mydir, etc. Login and datamover nodes only. Latter usually requires job scheduler. mdss assumes filenames are relative to /massdata/projcode. Use -P option to specify project other than default.

nci.org.au 46/86 Filesystems: Accessing massdata (2/3) Method 1: mdss command (cont'd). put, get, stage subcommands use the same syntax. E.g., mdss put myfile target, mdss put -r mydir. target optional, and must already exist if it specifies a directory. stage transfers files to cache for later use. get stages and retrieves. mdss dmls similar to mdss ls. Also indicates state: REG = cached, OFL = on tape, DUL = both mdss creates checksums ( mdss -v to verify). See man mdss !

nci.org.au 47/86 Filesystems: Accessing massdata (3/3) Method 2: The netcp / netmv commands. cp / mv commands for massdata. E.g., netcp myfile target target is optional. Can’t be used to retrieve files from massdata. Can push files to remote ssh servers (requires passphrase-less access).passphrase-less access Options to archive ( -t ) and compress ( -z ): netcp -z -t myfile.tar mydir Implemented as copy job: Requires familiarity with job scheduler (more later). Uses default resource limits unless -l is specified. Produces job summary files.

nci.org.au 48/86 Exercise 4: Filesystems (1/4) For the training project, /g/data is a link to /g/data2 : cd /g/data/$PROJECT # All group members can access this. pwd ls -la /g/data/$PROJECT cd /short/$PROJECT/$USER; pwd # Only you can access this. Try the ls and du subcommands for mdss ( mdss assumes filenames are relative to /massdata/$PROJECT ): mdss ls mdss ls..# '..' is the parent directory. mdss ls -la mdss du -h# Also mdss du -sh

nci.org.au 49/86 Exercise 4: Filesystems (2/4) Create two test files, and bundle them into a tar file: cd /short/$PROJECT/$USER rm * # Remove existing files. touch file1.$USER file2.$USER tar cvf testfiles.tar file* # See man tar for c, v, f options. ls tar --list -f testfiles.tar # Check contents of the archive. Create a user directory on massdata: mdss rm -r $USER # Delete the old directory, if it exists. mdss mkdir $USER mdss ls Next, put testfiles.tar into your massdata directory. Syntax: mdss put [-r] myfile [targetname] targetname and -r optional, -r copies directory (man mdss) Check the result using mdss ls $USER

nci.org.au 50/86 Exercise 4: Filesystems (3/4) Where is the file stored? mdss dmls -l#REG = cached, OFL = tape, DUL = both Remove files before retrieving archived copies: rm file1.$USER file2.$USER testfiles.tar Use mdss get to retrieve testfiles.tar. Syntax same as mdss put. See man mdss. Check the result using ls. Unpack the archive: tar xvf testfiles.tar # extract, verbose, filename, see man tar. ls

nci.org.au 51/86 Exercise 4: Filesystems (4/4). If time permits... Use netcp to copy files to massdata : cd /short/$PROJECT/$USER mkdir ex4; cp file* ex4 mdss rm -r ex4 # Remove the old copy from massdata. netcp ex4 $USER/ex4 Notice that netcp returns a job ID (creates a copy job). mdss ls# 'ex4' won’t appear until job finishes. qstat jobid # Displays job status. watch -n 4 qstat jobid# Might help, see 'man watch'. When the job finishes, check the result: mdss ls $USER/ex4 Inspect the contents of the job’s output (.o ) and error files (.e ). Repeat the copy, this time using -t (archive) and -z (compress): mdss rm $USER/ex4/* #clean the directory netcp -z -t myfile.tar /short/$PROJECT/$USER/ex4 $USER Check that the copy was successful, then mdss get and ‘untar’ the file.

nci.org.au 53/86 Scheduling Jobs: Overview Tasks that are too large for login nodes must be submitted to job scheduler (modified version of PBS Pro).PBS Pro ‘large’ means >30 mins cpu time or >2GB mem. Only tasks submitted to scheduler are charged for compute time! Scheduler optimises throughput and gives fair share to each project.

nci.org.au 54/86 Scheduling Jobs: Cluster nodes (1/2) Compute nodes only accessible via job scheduler. Datamover nodes accessible remotely or via scheduler. Remote transfers to r-dm.nci.org.au aren’t charged for compute time. Long remote transfers are permitted.

nci.org.au 55/86 Scheduling Jobs: Cluster nodes (2/2) Raijin: 3592 compute nodes, 5 datamover (‘ copyq ’) nodesRaijin: 3592 compute nodes, 5 datamover (‘ copyq ’) nodes. Each node comprises dual 8-core Intel Xeon Sandy Bridge 2.6 GHz processors, i.e., 16 cores (core = ‘CPU’). High-speed communication between cluster nodes (Infiniband).Infiniband Compute node memory capacities: Each copyq node has 32GB ram. MemHostname 32GBr1..r239567% of all nodes 64GBr2396..r352031% of all nodes 126GBr3521..r35922% of all nodes

nci.org.au 56/86 Scheduling Jobs: Compute jobs vs copy jobs A job can use compute nodes, or datamover nodes, but not both. Compute jobs (compute nodes)...... can’t access massdata filesystem.... shouldn’t be used for tasks that are mostly disk-based.... can’t access the internet. Copy jobs (datamover nodes)...... disk intensive tasks: moving/compressing/tarring large data.... copying input/output files to/from massdata.... can only use a single CPU.... can access the internet ( wget, sftp, svn, git, etc.).

nci.org.au 57/86 Scheduling Jobs: Job queues Jobs submitted from login nodes (use qsub command, more later). Three job queues: normal, express, copyq. Compute jobs: normal, express Copy jobs: copyq. normal is the default. Job waits in queue until resources become available...... at which point job is executed on compute or dm nodes.

nci.org.au 58/86 Scheduling Jobs: Which queue? (1/2) normal (default): Can request large # of CPUs (10,000+). Can request any memory type (32/64/126GB nodes). express : High priority jobs. Small jobs often start shortly after submitted. Charged at three times the rate of other queues. E.g., a 5 hr, 2 CPU express job costs 5 × 2 × 3 = 30SUs.SUs Small per-job resource limits: 1 node ≤ 24 hours. 2 - 8 nodes ≤ 5 hours. Can request any memory type.

nci.org.au 59/86 Scheduling Jobs: Which queue? (2/2) copyq : Intended for manipulation of large files. Only queue that can access massdata /internet. Single CPU only, ≤ 32GB ram. nf_limits shows project’s walltime limits for the specified # of CPUs. Mem limit equal to maximum available. We can extend walltime limits on a per job/user/project basis.

nci.org.au 60/86 Scheduling Jobs: Job costs Cost of job (in SUs, i.e., service units) calculated as:service units walltime × # CPUs × W walltime = real-world time. normal / copyq queues: W = 1. express queue: W = 3. Charged for walltime used, not walltime requested. Try not to request far more than needed. Charged for # of CPUs (i.e., cores) requested, not # used. Project’s SU quota is updated after each job finishes. nci_account also shows...... project’s total SU usage.total SU usage... (with -v option) breakdown of SU usage by user.... cost of running and queued jobs.

nci.org.au 61/86 Scheduling Jobs: Queue times (1/2) Scheduler doesn’t use FIFO policy. However,...... older jobs given higher priority. Jobs wait for resources. Requesting more than needed...... increases job’s queue time.... delays other jobs.... wastes resources and compute grant. express queue jobs often start soon after being submitted. Requesting higher-mem nodes increases queue time, esp. 126GB nodes. Jobs won’t start if storage grant exceeded.

nci.org.au 62/86 Scheduling Jobs: Queue times (2/2) The larger the cpu request, the higher priority because...... large jobs tie-up too many resources while waiting to start.... it’s hard for scheduler to fit other jobs around a large job (the TETRIS effect). Priority decreases if project has large # of running jobs. If you use your allocation too quickly (slowly), priority decreases (increases). Jobs run with lower priority once grant is exhausted (‘bonus’ jobs). Detailed scheduling policy. Load on Raijin spikes at end of each quarter. Don’t leave it to the last minute to use your quarterly SU grant!

nci.org.au 63/86 Scheduling Jobs: Submitting jobs Jobs are submitted using qsub options. Returns job ID #. Use -q normal (default)/ express / copyq to specify queue: qsub -q express... Use -l option to specify resources: qsub -l walltime=01:00:00 -l ncpus=32 -l mem=2GB Licensed software requires -l software=packagename.software To override default project: -P projectcode. Non-interactive job: qsub options scriptname Alternatively, qsub options -- ListOfCommands Avoid using ‘ -- ’ syntax. Doesn’t ‘ source ’ dot files.dot files Interactive job: qsub -I options

nci.org.au 64/86 Scheduling Jobs: Non-interactive jobs Non-interactive job: qsub options scriptname. Script will be executed on first (‘head’) node assigned to job. Most options can be placed in job script. When job ends, scheduler creates two summary files (more later). Job scripts have fixed structure: 1.Shell invocation. 2.PBS directives (essentially qsub options). 3.User commands (must come last). Contents of myscript.sh: #!/bin/bash #PBS -l walltime=20:00:00 #PBS -l mem=100MB #PBS -l ncpus=16 #PBS (other pbs directives) echo "This job does very little."

nci.org.au 65/86 Scheduling Jobs: Interactive jobs Interactive job: qsub -I options If walltime not specified, uses queue defaults. ctrl-c to cancel job before it starts. Prompted when job starts...... commands typed into terminal are executed on compute/dm nodes.... be sure to use exit command to close session. For programs that require X-Windows, use qsub ’s -X option. NB. Scheduler won’t save output/error msgs to file.

nci.org.au 66/86 Scheduling Jobs: CPU/Mem requests -l ncpus= Single-node job: ncpus ≤ 16. Job can share a node if there is enough free memory and CPUs. Multinode job: Must request whole nodes ( ncpus multiple of 16). Nodes won’t be shared; can request all available memory. -l mem= Specifies total amount of memory required. Nodes assigned to job will have same memory capacity. Per-node memory request calculated as mem /#nodes. E.g., mem=80GB, ncpus=32 (2 nodes) ⇒ 40GB mem per node. ∴ job will be assigned to the 64GB memory nodes. Multinode job, might as well use mem=128GB, i.e., 2×64GB. Try to choose mem and ncpus so that job uses 32GB mem nodes.

nci.org.au 67/86 Scheduling Jobs: Handy PBS directives (In your own time. See man qsub for more options). -l wd At the start of job, working dir set to submission dir. Job’s ‘.o ’ and ‘.e ’ summary files placed in this dir.‘.o ’ and ‘.e ’ Otherwise working directory defaults to home dir ( ∼ ). Suppresses execution of login and logout files.login and logout files -o filename, -e filename Tells PBS where to put ‘.o ’ and ‘.e ’ summary files. -m EmailEvent Send email notifications for specific events: a : job aborted, b : job began, e : job ended. E.g., -m abe (default -m a ). -M Email1,Email2,... Recipients for job notification emails.

nci.org.au 68/86 Scheduling Jobs: Job environment (1/2) When job starts, PBS...... saves job parameters as environment variables.... executes ‘dot files’, except.bash_logout /.logout.dot files Logout Logout scripts executed when job ends. -l wd -l wd option suppresses execution of login/logout files.login/logout PBS environment variables... Useful for programs/scripts that require info about execution environment. Only visible to job script or terminal running interactive job. -V option copies predefined environment variables to job environment.

nci.org.au 69/86 Scheduling Jobs: Job environment (2/2) Some useful PBS variables: $PBS_JOBID Job identifier (!). $PBS_NCPUS # of cpus requested, i.e., ncpus. $PBS_NODEFILE File that lists nodes assigned to job. $PBS_JOBFS Job’s assigned JOBFS (scratch) directory. $PBS_VMEM Memory request, i.e., mem, not Vmem. $PBS_O_WORKDIR Name of job submission directory. qstat -f JobId shows PBS variables for specified job. Also see PBS Pro manual (there might be small discrepancies).PBS Pro manual

nci.org.au 70/86 Scheduling Jobs: Postmortems (1/2) PBS captures standard output/error produced by non-interactive jobs. stdout: Jobname.oJobId stderr: Jobname.eJobId Automatically copied over to working dir when job ends.working dir Summary of resource usage is appended to ‘.o ’ file. If PBS detects an error, PBS appends message to ‘.e ’ file. Check these files if job terminates abnormally! Sometimes OS kill s job before PBS realises there’s a problem, especially if mem usage spikes.

nci.org.au 71/86 Scheduling Jobs: Postmortems (2/2) Contents of myscript.sh.o123456: ========================================================== Resource Usage on 2013-07-20 12:48:04.355160: JobId: 123456.r-man2 Project: abc Exit Status: 0 (Linux Signal 0) Service Units: 32.00 NCPUs Requested:CPUs Used: 32 CPU Time Used: 18:50:43 Memory Requested: 900mbMemory Used: 80mb Vmem Used: 94mb Walltime requested: 02:00:00Walltime Used: 01:00:00 jobfs request: 100mbjobfs used: 1mb ========================================================== Memory Used Mem used by the head node. Vmem Ignore this. jobfs usedJOBFS used by all nodes. Details to come. CPU utilisation is low if CPU Time ≪ Walltime Used x NCPUS.

nci.org.au 72/86 Scheduling Jobs: JOBFS requests (1/2) JOBFS : Node-local scratch space, 396GB/node. Slow. Can outperform /short, /g/data for small/frequent IO.Can outperform /short, /g/data for small/frequent IO Only lasts for duration of job. Don’t write checkpoint files to JOBFS. qsub option/PBS-directive is -l jobfs=amount. amount is the total jobfs request. E.g., 100MB, 25GB,... Per-node jobfs request calculated as amount/ #nodes. PBS stores path to JOBFS in $PBS_JOBFS. $PBS_JOBFS only visible to your job!

nci.org.au 73/86 Scheduling Jobs: JOBFS requests (2/2) Example JOBFS usage: Contents of myscript.sh: #!/bin/bash #PBS -l ncpus=64 #PBS -l jobfs=2GB (OTHER PBS DIRECTIVES) echo The JOBFS directory for this job is $PBS_JOBFS cp my_input_file $PBS_JOBFS myprogram $PBS_JOBFS/my_input_file $PBS_JOBFS/my_output_file cp $PBS_JOBFS/my_output_file /short/c25/$USER The effective per-node JOBFS request is 2GB/(64/16) = 512MB. Script executed on head node only. ∴ cp copies to/from head node only. mdss mdss and netcp / netmv commands don’t work for JOBFS. netcp / netmv Also see ‘What is the JOBFS filesystem?’.What is the JOBFS filesystem?

nci.org.au 74/86 Scheduling Jobs: Other filesystems To prevent job running if /g/data or massdata offline: #PBS -l other=filesystem filesystem = gdata1, gdata2, mdss (i.e., massdata ). Not mandatory, but good practice. massdata not available to compute jobs, i.e., normal / express queues*. * mdss command only works from copyq jobs and login nodes. You can also use modstatus to check filesystem availability: /opt/rash/bin/modstatus -n status status = gdata1_status, gdata2_status, or mdss_status.

nci.org.au 75/86 Scheduling Jobs: Modifying jobs qalter JobId : Change resource request of jobs waiting to start. walltime, mem, ncpus, project,... qdel JobId : Delete queued or running jobs. exit : Stop currently-running interactive job. qhold : Prevent queued job from starting, e.g., job dependencies. qselect : Lists jobs that meet criteria, e.g., belong to project X. qmove : Move a waiting job to a different queue. We can increase walltime of running jobs (32, 64GB mem nodes only).

nci.org.au 76/86 Scheduling Jobs: Job status To display job status: qstat options JobId1 JobId2... Some useful options (see man qstat for many more): -u username List user’s queued/running jobs. -q queuename Show jobs for the specified queue. -x Include jobs that have finished in the last day. -f Show all information about job(s). Resource usage is aggregate of all nodes. -s System comments. Good for troubleshooting. -n List hostname of nodes assigned to job. -w Use wider output fields. nqstat, nqstat_anu : status of jobs belonging to your projects*. *new jobs might not show up immediately.

nci.org.au 77/86 Scheduling Jobs: Job progress qps : Resources used by job’s processes. Same options as ps. qstat –n, qstat –f : Show list of nodes assigned to job. qstat -f, nqstat_anu : Give rough indication of cpu utilisation %. pbs_rusage : Summary of resource usage, as given in ‘.o ’ file..o qls : List contents of running job’s JOBFS dir. qcat : Show job script or std output/error produced so far (‘.o ’ and ‘.e ’ files).‘.o ’ and ‘.e ’ files qcp : Copy files to/from running job’s JOBFS dir.

nci.org.au 78/86 Long run-times expose jobs to system/program instabilities. You won’t be reimbursed for lost SUs. Consider implementing a checkpoint mechanism. Don’t save checkpoint files to JOBFS. JOBFS Self-submitting jobsSelf-submitting jobs can resume automatically if interrupted. Job dependencies: -W depend=type:JobID1 JobId2... Also on, before, beforeok, etc. See PBS Pro manual.PBS Pro manual Multiple levels of dependencies can fail if jobs take too long. Scheduling Jobs: Checkpoints/Automation type after Start after dependencies have started. afterok Start if dependencies finish successfully. afternotok Start if dependencies finish with errors. afterany Start after all dependencies finish.

nci.org.au 79/86 Scheduling Jobs: Note on parallelism (1/3) Many packages take advantage of parallelism automatically. Options for parallelising custom code: Option 1. Job-script starts multiple copies of your program. (a) ‘ for ’ loop to start processes in background ( & ), then wait. (b) pbsdsh, pbsdsh_anu (like ssh ) : Can detect multiple nodes. (c) pbs_tmrsh (like ssh ) : Flexible, but must give it node names from $PBS_NODEFILE. pbsdsh, etc., only work from within job script (or interactive job). Option 1 works for serial code. Contention when multiple processes access same file. 1000’s of simultaneous IO ops can degrade Lustre speed.degrade Lustre speed Work and memory are replicated unnecessarily.

nci.org.au 80/86 Scheduling Jobs: Note on parallelism (2/3) Option 2. Shared-memory parallelism via OpenMP. (Not to be confused with OpenMPI). CPUs must reside on same node. ∴ limited to 16 CPUs. Imposes parallelism onto serial code via embedded compiler directives. Can combine with Option 1 to overcome node limit (cumbersome).

nci.org.au 81/86 Scheduling Jobs: Note on parallelism (3/3) Option 3. Distributed parallelism via MPI library. Arbitrary number of CPUs/nodes. Overcomes limitations of previous two options. Many programs can be implemented using just the basic MPI calls. Highly-optimised version of OpenMPI installed on Raijin. Once you’re accustomed to MPI you’ll never look back...

nci.org.au 82/86 Exercise 5. Using job scheduler (1/3) Create the following job script, and call it, e.g., exercise5.sh : #!/bin/bash #PBS -q express #PBS -l walltime=00:04:00 #PBS -l ncpus=2,mem=10MB,jobfs=10MB #PBS -l wd echo "ncpus = $PBS_NCPUS, total mem = $PBS_VMEM bytes" echo "jobfs dir = $PBS_JOBFS" echo "Contents of node file:" cat $PBS_NODEFILE NUM_NODES=$(cat $PBS_NODEFILE | wc –l) #NB. $(command) is replaced by output of command. NODE_NAMES=$(uniq $PBS_NODEFILE)# See man uniq. echo "# of nodes: $NUM_NODES" echo "Hostnames of nodes: $NODE_NAMES" sleep 300 # Sleep for 5 minutes. echo "Some things just aren't meant to be."

nci.org.au 83/86 Exercise 5. Using job scheduler (2/3) Make the script executable and submit it to the scheduler: chmod +x exercise5.sh qsub exercise5.sh Experiment with qstat / nqstat / anu_nqstat, e.g., qstat -Q, qstat normal, qstat -u $USER, qstat -saw JobId, nqstat_anu -P $PROJECT Once job starts, check progress using, e.g., qstat -f JobId, qps JobId, qcat -o JobId Wait for job to finish: watch -n 4 qstat JobId # ctrl-c to stop watching. Did job finish successfully? Inspect.o and.e files, e.g.,.o and.e files cat exercise5.sh.oJobId

nci.org.au 84/86 Exercise 5. Using job scheduler (3/3) If you like, submit interactive job with X-Windows option for qsub ( -X ): Must be connected to Raijin using ssh with -X (PC) or -Y (Mac) option. Then, qsub -I -X -q express -l walltime=00:02:00,ncpus=1 (when job starts) xeyes ctrl-c to close xeyes. Make sure you use exit command to end the job! Just to be certain: qdel JobId

nci.org.au 85/86 Finally... Raijin fun facts! Time-lapse video of Raijin being assembled. Watch our tape robot at work.

Raijin Essentials Resources These slides:

Similar presentations

Presentation on theme: "Raijin Essentials Resources These slides:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Raijin Essentials Resources These slides:

Similar presentations

Presentation on theme: "Raijin Essentials Resources These slides:"— Presentation transcript:

Similar presentations

About project

Feedback