Carrie Brown, Adam Caprez

Carrie Brown, Adam Caprez
June Workshop Series June 27th: All About SLURM University of Nebraska – Lincoln Holland Computing Center Carrie Brown, Adam Caprez

Setup Instructions Please complete these steps before the lessons start at 1:00 PM: Setup instructions: If you need to use a demo account – please speak with one of the helpers If you need help with the setup, please put a red sticky note at the top of your laptop. When you are done with the setup, please put a green sticky note at the top of your laptop.

June Workshop Series Schedule
June 6th: Introductory Bash June 13th: Advanced Bash and Git June 20th: Introductory HCC June 27th: All about SLURM Learn all about the Simple Linux Utility for Resource Management (SLURM), HCC's workload manager (scheduler) and how to select the best options to streamline your jobs. Upcoming Software Carpentry Workshops UNL: HCC Kickstart Bash, Git and HCC Basics September 5th and 6th UNO: Software Carpentry Bash, Git and R October 16th and 17th

Logistics Name tags, sign-in sheet
Sticky notes: Red = need help, Green = all good Link to Workshop Materials: Etherpad: Terminal commands are in this font Any entries surrounded by <brackets> need to be filled in with information Example: becomes if your username=demo01. Today we will be using the reservation “hccjune” for all jobs Make sure your submit scripts include the line: #SBATCH --reservation=hccjune

What is a Cluster?

Exercises If you aren’t already, connect to the Crane cluster
Navigate to your $WORK directory If you were not here last week, or do not have the tutorial directory, clone the files to your $WORK directory with the command: git clone Make a new directory inside the tutorial directory (./HCCWorkshops/) named slurm – this is where we will put all of our tutorial files for today. Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

SLURM Simple Linux Utility for Resource Management
Open source, scalable cluster management and job scheduling system Used on ~60% of the TOP500 supercomputers 3 key functions Allocates exclusive or non-exclusive access to resources Framework for starting, executing and monitoring work Manages a queue of pending jobs Uses a best fit algorithm to assign tasks Fair Tree Fairshare Algorithm

Slurm vs PBS To… PBS/SGE Command Slurm Equivalent Submit a job qsub <script_file> sbatch <script_file> Cancel a job qdel <job_id> scancel <job_id> Check the status of a job qstat <job_id> squeue <job_id> Check the status of all jobs by user qstat –u <user_name> squeue –u <user_name> Hold a job qhold <job_id> scontrol hold <job_id> Release a job qrls <job_id> scontrol release <job_id> More commands and schedulers:

Max 158 IB CPU’s and 2000 OPA CPU’s per user
sinfo Shows a listing of all partitions on a cluster Use #SBATCH --partition=<partition_name> All partitions have a 7 day run-time limitation Publically available partitions: Partition Description Limitations Clusters batch Default partition 2000 max CPUs per user Crane, Tusker guest Uses free time on owned or leased Intellaband (IB) or Omni-Path Architecture (OPA) nodes Pre-emptable Max 158 IB CPU’s and 2000 OPA CPU’s per user Crane highmem High memory nodes (512 and 1024 GB) 192 max CPUs per user Tusker gpu_k20 GPU nodes – include 3x Tesla K20m per node with IB 48 max CPUs per user gpu_m2070 GPU nodes – include 2x Tesla M2070 per node, non-IB gpu_p100 GPU nodes – include 2x Tesla P100 per node with OPA 40 max CPUs per user

Fair Tree Fairshare Algorithm
Fair Tree prioritizes users such that if accounts A and B are siblings and A has a higher fairshare factor than B, all children of A will have higher fairshare factors than all children of B Benefits: All users in a higher priority account receive a higher fair share factor than all users from a lower priority account Users in a more active group have lower priority than users in a less active group Users are sorted and ranked to prevent precision loss Priority is calculated based on rank, not directly off of Level FS value New jobs are immediately assigned a priority User ranking is calculated at 5 minute intervals

Calculation of Level FS (LF)
𝐿𝐹= 𝑆 𝑈 0 ≤𝐿𝐹 Where: S = Shares Norm assigned shares normalized to the shares assigned to itself and its siblings 𝑆= 𝑆 𝑟𝑎𝑤 𝑠𝑒𝑙𝑓 𝑆 𝑟𝑎𝑤 𝑠𝑒𝑙𝑓+𝑠𝑖𝑏𝑙𝑖𝑛𝑔𝑠 0 ≤𝑆 ≤1 U = Effective Usage usage normalized to the account’s usage 𝑈= 𝑈 𝑟𝑎𝑤 𝑠𝑒𝑙𝑓 𝑈 𝑟𝑎𝑤 𝑠𝑒𝑙𝑓+𝑠𝑖𝑏𝑙𝑖𝑛𝑔𝑠 0 ≤𝑈 ≤1

Fairshare Algorithm Uses a rooted plane tree (aka rooted ordered tree)
Users Groups root gProf1 uProf1 uStudent3 gProf2 uProf2 uCollab78 uPhd17 Uses a rooted plane tree (aka rooted ordered tree) sorted by Level FS descending from left to right Tree is traversed depth-first – users are assigned rank and given a fairshare factor Process: Calculate Level FS for subtree’s children Sort children of the subtree Visit children in descending order and assign fairshare factor 𝑓𝑎𝑖𝑟𝑠ℎ𝑎𝑟𝑒 𝑓𝑎𝑐𝑡𝑜𝑟= 𝑟𝑎𝑛𝑘 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑢𝑠𝑒𝑟𝑠

Exercises You can check on the share division and usage on Holland clusters with the sshare command. The output of this command can be quite long, combine it with head or grep to see individual portions of it. Can you write a command so you only see the first 10 lines of output? Modify the previous command to use grep to find your user and group information Compare the amount of your EffectvUsage to your NormShares – Have you used more than your NormShares? How about your group overall? How does the group’s EffectvUsage compare to the NormShares? The sshare argument -l shows extended output, including the current calculated LevelFS values. Repeat the steps in #1, but with the -l argument this time. How does your LevelFS value compare to your group’s LevelFS value? Does the calculated LevelFS value correspond to the differences you observed EffectvUsage? Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

sbatch Used to asynchronously submit a batch job to execute on allocated resources. Sequence of events: User submits a script via sbatch When resources become available they are allocated to the job The script is executed on one node (the master node) The script must launch other tasks on allocated nodes STDOUT and STDERR are captured and redirected to the output file(s) When script terminates, the allocation is released Any non-zero exit will be interpreted as a failure

Submit Scripts Name of the submit file
This can be anything. Here we are using “invert_single.slurm” the .slurm makes it easy to recognize that this is a submit file. Shebang The shebang tells Slurm what interpreter to use for this file. This one is for the shell (Bash) SBATCH options These must be immediately after the shebang and before any commands. The only required SBATCH options are time, nodes and mem, but there are many that you can use to fully customize your allocation. Commands Any commands after the SBATCH lines will be executed by the interpreter specified in the shebang – similar to what would happen if you were to type the commands interactively

Submit Files Best Practices
Put all module loads immediately after SBATCH lines Quickly locate what modules and versions were used. Specify versions on module loads Allows you to see what versions were used during the analysis Use a separate submit file for each analysis Instead of editing and resubmitting a submit files, copy a previous one and make changes to it Keep a running record of your analyses Redirect output and error to separate files Allows you to see quickly whether a job completes with errors or not Separate individual workflow steps into individual jobs Avoid putting too many steps into a single job

Shebang! - Interpreters
Must be included in the first line of the submit script Must be an absolute path Specifies which program is used to execute the contents of the script The shebang in the submit file can be one of the following: #!/bin/bash The most common shell and also the default shell at HCC #!/bin/csh - symlink to tcsh #!/usr/bin/perl #!/usr/bin/python Using Perl or Python interpreters can make loading modules difficult Scripts that return anything but 0 will be interpreted as a failed job by Slurm

Common SBATCH Options For more information: Command What it does
--nodes Number of nodes requested --time Maximum walltime for the job – in DD-HHH:MM:SS format – maximum of 7 days on batch partition --mem Real memory (RAM) required per node - can use KB, MB, and GB units – default is MB Request less memory than total available on the node - The maximum available on a 512 GB RAM node is 500, for 256 GB RAM node is 250 --ntasks-per-node Number of tasks per node – used to request a specific number of cores --mem-per-cpu Minimum of memory required per allocated CPU – default is 1 GB --output Filename where all STDOUT will be directed – default is slurm-<jobid>.out --error Filename where all STDERR will be directed – default is slurm-<jobid>.out --job-name How the job will show up in the queue For more information: sbatch –help SLURM Documentation:

scancel Use to cancel jobs prior to completion
Usage: scancel <job_id> Use other arguments to cancel multiple jobs at once or combine both to prevent accidentally canceling the wrong job Other arguments: Argument Cancel… --name=<job_name> jobs with this name --partition=<partition> jobs in this partition --user=<user_name> jobs of this user --state=<job_state> jobs in this state Valid states: PENDING, RUNNING, and SUSPENDED

Short qos Increases a jobs priority, allowing it to run as soon as possible Useful for testing and developmental work Limitations: 6 hour runtime 1 job of 16 CPU’s or fewer Max of 2 jobs per user Max of 256 CPU’s in use for all short jobs from all users To use, include this line in your submit script: #SBATCH --qos=short For more information:

Exercise Write a submit script from scratch. (No copying previous ones!) The script should use the following parameters: Uses 1 node Uses 10 GB RAM 10 minutes Runtime Executes the command: echo “I can write submit scripts!” Submit your script and watch for output. If you run into errors, copy the error to Etherpad. If you were able to fix the error, add a brief note explaining how you did. Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

Exercise Solution

squeue Job ID The ID number assigned to your job by Slurm Name
The name you gave the job as specified in the submit script Time The length of time the job has been running Nodes The number of nodes the job is running on State The current status of the job. Common states include: CD – Completed CA – Canceled F – Failed PD – Pending R – Running Nodelist If the job is running: the names of the nodes the job is running on If the job is pending: the reason the job is pending Partition The partition the job is running on or assigned to User The user that owns the job For more information:

Common Reason Codes More information: squeue --help
Job Reason Codes Description Dependency This job is waiting for a dependent job to complete. NodeDown A node required by the job is down. PartitionDown The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintainance. Note that this message may be displayed for a time even after the system is back up. Priority One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours. ReqNodeNotAvail No nodes can be found satisfying your limits, for instance because maintainance is scheduled and the job can not finish before it Reservation The job is waiting for its advanced reservation to become available. More information: squeue --help

Common squeue Options Option Displays information about…
-u <user_name> --user=<user_name> jobs owned by the specified user_name(s) * -j <job_list> specified job(s) * -p <part_list> jobs in a specified partition(s) * -t <state_list> jobs in the specified state(s) – {PD, R, S, CG, CD, CF, CA, F, TO, PR, NF} * -i <interval> --interate= <interval> jobs repeatedly reported at intervals (in seconds) -S <sort_list> --sort=<sort_list> jobs sorted by specified field(s) * --start pending jobs and scheduled start times * Indicates arguments that can take a comma-separated list For more options:

Exercise Edit the submit script you made previously. sleep 120
Use the squeue command to determine the following. Hint: Don’t forget about wc –l How many jobs are currently Running? How many jobs are currently Pending? The grid partition is composed of resources that are made available to the Open Science Grid. How many jobs are currently in the queue for this partition? How many jobs are currently in queue for the user “root”? Edit the submit script you made previously. Add the following command to execute after the echo command: sleep 120 Submit the updated script file and monitor its progress with squeue. If it is pending for a while, use --start to see how much longer until it is expected to start. How accurate was the estimate? Can you guess what sleep does just by how your job changes? If not, take a look at the documentation (sleep --help). Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

Customizing squeue output
Use the --Format argument (must be capitalized) Fields you want displayed are specified in a comma-separated list without spaces after the argument Fields of note: priority reason dependency eligibletime endtime state / statecompact submittime Even more customization options are available for --Format and the -- format flag – check out man squeue for more information.

Environmental Variables and Replacement Symbols
Can be used in the command section of a submit file (passed to scripts or programs via arguments) Cannot be used within an #SBATCH directive Use Replacement Symbols instead Symbol Value %A Job array’s master job allocation number %a Job array ID (index) number %j Job allocation number (job id) %N Node name – will be replaced by the name of the first node in the job (the one that runs the script) %u User name %% The character “%” Environment Variable Description SLURM_JOB_ID batch job id assigned by Slurm upon submission SLURM_JOB_NAME user-assigned job name SLURM_NNODES number of nodes SLURM_NODELIST list of nodes SLURM_NTASKS total number of tasks SLURM_QUEUE queue (partition) SLURM_SUBMIT_DIR directory of submission SLURM_TASKS_PER_NODE number of tasks per node A number can be placed between % and the following character to zero-pad the result For example: job%j.out would create job out for job_id= job%9j.out would create job out for job_id=

Additional sbatch Options
Argument Details --begin:<time> The controller will wait to allocate the job until the specified time Specific Time: HH:MM:SS Specific Date: MMDDYY or MM/DD/YY or YYY-MM-DD Specific Date and Time: YYYY-MM-DD[THH:MM:SS] Keywords can be used – now, today, tomorrow – Can also be relative in format “now+<time>” --deadline=<time> Remove the job if it cannot finish before the deadline Valid time formats: HH:MM[:SS] [AM|PM] MMDD[YY] or MM/DD[/YY] or MM.DD[.YY] MM/DD[/YY]-HH:MM[:SS] YYYY-MM-DD[THH:MM[:SS]]] --hold Will hold the job in “held state” until released manually using the command scontrol release <job_id> --immediate Will only release the job if the resources are immediately available --mail-type=<type> Notify user by when certain event types occur. Valid type include: BEGIN, END, FAIL, ALL, TIME_LIMIT, TIME_LIMIT_X (When X% of the time is up, where X is 90, 80, or 50) --mail-user=<user_ > Specify an to send event notifications to --open-mode=<append|truncate> Specify how to open output files – default is truncate --test-only Validates the script and returns a starting estimate based on the current queue and job requirements Does not submit the job --tmp=<MB> Minimum amount of temporary disk space on the allocated node

Exercises Edit the submit script you created previous to:
Include at least two of the additional options we discussed. Submit the script to see how they work. Try changing some of the parameters (number of nodes, memory, or time) and use the #SBATCH --test- only argument to see how the estimated start time changes. Which parameter seems to affect it the most? Using the cd command, navigate to the matlab directory inside of HCCWorkshops. Use less to view the contents of the invertRand.submit file. Can you find all of the environmental variables and replacement symbols used? What role do each of them play in this script? \ Navigate back into the directory which contains the submit script you made today. Edit the script to include one environmental variable and one replacement symbol. Submit the script and check to see if your changes worked the way you expected. Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

Array Job Submissions Submits a specified number of identical jobs
Use environmental variables and replacement symbols to separate output Usage: #SBATCH --array=<array numbers or ranges> Array list can any combination of the following: a comma separated list of values. #SBATCH --array=1,5,10 : submits 3 array jobs with array ids 1, 5, 10 a range of values with a – separator. #SBATCH --array=0-5 : submits 6 array jobs with array ids 0, 1, 2, 3, 4, 5 A range of values with a : to indicate step value #SBATCH –array=1-9:2 : submits 5 array jobs with array ids 1, 3, 5, 7, 9 A % can be used to specify the maximum number of simultaneous tasks (default is 1000) #SBATCH --array=1-10%4 : submits 10 array jobs with 4 simultaneous running jobs To cancel array jobs: Usage: scancel <job_id>_<array numbers> Cancel all array jobs: scancel <job_id> Cancel single array ids: scancel <job_id>_<array id>

Exercises Specify how many jobs these commands will create. What are they’re array id’s? How many will run simultaneously? #SBATCH --array=5-10 #SBATCH --array=0-4,15-20 #SBATCH --array=1,3-10:2 #SBATCH --array=0-20:2%10 When we looked at the output of the example array job, the output is not in numeric order. Can you think of a reason why that happens? Edit the example array job to do the following: Run 15 array tasks, each one with an odd array id Run 5 array tasks, each one with a unique 3 digit id Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

Job Dependencies Allows you to queue multiple jobs that depend on the completion of one or more previous jobs When submitting the job, use the -d argument followed by specification of what jobs and when to execute <when_to_execute>:<job_id> After successful completion – afterok:<job_id> After non-successful completion – afternotok:<job_id> Multiple job ids can be specified, separate with colons – afterok:<job_id1>:<job_id2> Dependent jobs can use output and files created from previous jobs

Exercises Copy the JobB.submit script, calling the new one JobC.submit and edit the contents accordingly (replace all instances of “B” with “C”). Using sbatch, queue JobA. Then queue JobB and JobC, setting them both to begin after the successful completion of JobA. Using the previous three submit scripts, create a new submit script which will do the following: Combine the output from both JobB and JobC into a text file called “JobD.txt” Add the line “Sample job D output” to this new text file Using these four submit scripts, Run them so the jobs trigger in the order according to the diagram to the right → Once you have finished, put up your green sticky note. If you have issues, put up your red sticky note and one of the helpers will be around to assist.

Exercise Solution

srun Used to synchronously submit a single command
Commonly used to start interactive sessions Sequence of Events: User submits a command for execution May include command line arguments – will be executed exactly as specified If allocation exists, the job executes immediately Otherwise, the job will block until a new allocation is established n identical copies of the command are run simultaneously on allocated resources as individual tasks --pty induces pseudo-terminal mode – input and output is directed to the users shell Once all tasks terminate, the srun session will terminate If the allocation was created with srun, it will be released

Using srun to monitor batch jobs
Connect to the node running the job: srun -j <job_id> --pty bash {or top} srun -nodelist=<node_id> --pty bash {or top} Monitor: top (if not already running) Use to monitor core use – ideal for multi-core processes Press ‘u’ to search for your username cat /cgroup/memory/slurm/uid_<uid>/job_<job_id>/memory.max_usage_in_bytes Use to monitor memory use To determine your uid use: id -u <user_name> Match with watch -n to specify a refresh interval - default is 2 seconds CTRL + C to exit

Carrie Brown, Adam Caprez

Similar presentations

Presentation on theme: "Carrie Brown, Adam Caprez"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carrie Brown, Adam Caprez

Similar presentations

Presentation on theme: "Carrie Brown, Adam Caprez"— Presentation transcript:

Similar presentations

About project

Feedback