High-Performance Computing Survival Guide

1 High-Performance Computing Survival Guide
James R. Knight Yale Center for Genome Analysis Department of Genetics Yale University January 14, 2015

2 1950’s – The Beginning...

3 2015 – Looking very similar...

4 ...but there are differences
Not a single computer but thousands of them, called a cluster Hundreds of physical “computers”, called nodes Each with 4-64 CPU’s, called cores Nobody works in the server rooms anymore IT is there to fix what breaks, not to run computations (or help you run computations) Everything is done by remote connections Computation is performed by submitting jobs for running This actually hasn’t changed...but how you run jobs has...

5 90 compute nodes for general use.
A Compute Cluster 300+ Users. 90 compute nodes for general use. 300TB disk space. You are here! Login-0-1 Compute-3-2 Compute-3-1 Compute-1-1 Network Compute-1-2 Compute-2-1 Compute-2-2

6 You Use a Compute Cluster! Surfing the Web
You are here! Return the webpage Click on a link Compute Compute Compute Network Compute Construct the webpage contents Compute Compute

7 How you’ll be using Louise 300+ Users. 90 compute nodes for general use. 300TB disk space. You are here! Connect by ssh Login-0-1 Compute-3-2 Compute-3-1 Compute-1-1 Connect by qsub -I Network Run commands on compute nodes (and submit qsub jobs to the rest of the cluster) Compute-1-2 Compute-2-1 Compute-2-2

8 1970’s – Terminals, In the Beginning...

9 2015 – Pretty much the same... Terminal app on Mac
Look in the “Other” folder in Launchpad

10 Your “New” User Interface – Hunt and Peck!
Type a command at the prompt, hit the return key program arguments... This runs the program, which will read the arguments, read inputs, perform computations and produce outputs When it completes, the prompt is displayed, telling you it is ready for the next command Key commands to learn: ssh qsub -I


The faster you can type, the faster you will be done Select and learn a text editor Vi or Emacs Select and learn a programming language Perl, Python or R Ask these questions to keep you oriented What computer am I on? What directory am I in? Where are the files for my analysis? What program(s) do I have running? What jobs do I have running?

13 Directories and Paths Linux directory structure same as Mac/Windows folder structure Folders/directories containing files and other sub-folders/sub-dirs “Easy-to-access” directories: HOME directory A path is a string naming a file or directory in the structure The slash character (‘/’) is separator for directories /Users/jamesknight/Desktop/hpc_survival_guide_jan_2015.pptx

14 The Shell When you type commands and run programs, you are actually running a program called a shell Designed to take user input, run programs and display output Started automatically when Terminal app started or when you log into a computer Linux runs the bash shell, by default Maintains useful environment variables $PWD, which holds your current working directory path $HOME or ~, which holds your home directory path $PATH, which holds locations of programs Powerful tool for organizing and executing commands Useful to combine programs or redirect inputs and outputs, without having to write a program to do that Full-fledged programming language, used to write shell scripts to run sets of commands

15 The Program’s Viewpoint
Programs start knowing nothing, and must figure out what to do Lines of code are generalized instructions Specifics come from reading the program’s environment Command-line Arguments (what you typed) Standard Input (keyboard) Standard Output (screen) The Program Standard Error (screen) Files to read Files to write

16 Shell Redirection, Piping and Multiple Commands
The shell lets you redirect stdin, stdout and stderr to configure how your program communicates myprog < inFile > outFile 2> errFile “< inFile” redirects stdin so that program reads contents of “inFile” “> outFile” redirects stdout so that program writes standard output to “outFile” “2> errFile” redirects stderr so that program writes standard error to “errFile” echo Hello | sed s/Hello/Goodbye/ The “|” (called a pipe) redirects the echo program’s standard output so that it writes to the standard input of the sed program This command writes “Goodbye” to the screen echo Hello ; echo Goodbye The semi-colon separates commands, allowing multiple programs to run from one command-line This command writes “Hello” then “Goodbye” to the screen

17 Writing Scripts Sometimes Linux’s built-in programs, and existing bioinformatics programs, are not enough To combine programs together in a specific way To run programs on many different files/datasets To perform custom statistical analyses on data files Scripting languages make it easy to write your own programs bash, perl, python, R Write the lines of the script using a text editor Use the language’s program to run the script perl myscript arguments... Then, test, debug and rewrite...

18 Writing Scripts A script is like a lab protocol
Instructions on how to perform a task Executed in order, from beginning to end Just as protocol steps can have sub-steps, repeated steps and sub-protocols, script statements can have sub-statements, loops and function calls Types of statements in a script Computation (assignment, input/output), if-then-else, for and while loops, functions Each programming language has its own unique syntax that you must follow REMEMBER: You are the protocol writer writing for someone very, very, very stupid

19 Writing Scripts Instead of reagents, tubes and plates, scripts operate on values, variables, data structures and files Values: numbers (1, 2, 87.5), strings (“I am a string!”) Variables: holder for a value Data structures: holder for collections of values Files: Series of strings (text files) or numbers (binary files) stored on disk Important data structures: List or Array – ordered collection of values [ 1, 2, 4, 3 ] Hash or Dictionary – collection of “name, value” pairs, like a telephone book Record or Struct – collection of named variables/data-structures Matrix – two-dimensional collection of numbers

20 That’s fine, but how do you do this, really???
My best recommendation: Think about it, and write it down, as a protocol, then translate it into the programming language Make the step descriptions comments in the script Comments are lines beginning with ‘#’, which are ignored when executing the script Refine into sub-steps when translation is difficult Example: writing echo in Perl Echo takes the command-line arguments and writes them to standard output ~]$ echo Hello from the cluster! Hello from the cluster! ~]$

Attempt #1: Implement that description Perl has list with the command-line arguments Perl has a print statement to write to standard output Program: # # Write the command-line arguments to stdout. ~]$ perl Hello from the cluster! ~]$

Attempt #2: Refine, write each argument separately, so that the output can be formatted better. Perl can loop over the values of list The print statement can write string values like “ “ (a space) Program: # # 1. for each command-line argument, # a. write the argument # b. write a space for $arg { print $arg; print “ “; } ~]$ perl Hello from the cluster! Hello from the cluster! ~]$

Attempt #3: Fix where the prompt is shown. (Ignore the extra space.) Printing a special “\n” string value outputs a newline character Program: # # 1. for each command-line argument, # a. write the argument # b. write a space # 2. write a newline for $arg { print $arg; print “ “; } print “\n”; ~]$ perl Hello from the cluster! Hello from the cluster! ~]$

Attempt #4: Try a different approach, construct the string to be output, then print it. Perl has a join function that combines a list of strings into a string, and can include a separator. Program: # # 1. Combine the command-line arguments into a # string, separating them by spaces # 2. Write that string # 3. write a newline my $line = join(“ print $line; print “\n”; ~]$ perl Hello from the cluster! Hello from the cluster! ~]$

Why scripting/programming is hard: You must think of everything Use testing, iteration and refinement to make sure that you have thought of everything You can get to “good enough” You have to write everything in a foreign language, with no allowance for error My best recommendation: Think about it, and write it down, as a protocol, then translate it into the programming language Design what you want the program to do as you would a protocol, in English (or your favorite language) Match program statements to the steps, refining the steps so that they can be translated

26 Running Jobs on the Cluster
You must make reservations! Cluster is a shared resource, so you must ask for exclusive use of nodes and cores The job request goes into a queue, and is granted when resources are available How to do this? qsub! Interactive jobs “qsub –I” – request 1 core on 1 node “qsub –I –l nodes=1:ppn=8” – request 1 node, with 8 cores Batch jobs “qsub myjob.pbs” – Request to run the bash script myjob.pbs Louise’s cluster runs PBS/Torque to manage the queues, so “.pbs” suffix is marking this as a script that can be submitted to the cluster

Lines containing options for the job request Example myjob.pbs file #PBS –q general #PBS -l nodes=3:ppn=8 #PBS –o myjob_outFile.txt #PBS -e myjob_errFile.txt source ~/.bashrc cd /data/scratch/firstjob_Jan2015 echo Hello echo Goodbye Just do this Set working directory The lines of your script

What if I have to run a program on 100 datasets? You could make 100 scripts, or you could use Simplequeue! Write a text file, where each line is a one-line shell command Use the program to make a PBS script Submit the PBS script Perl program that can write the text file (let’s call it “”) Commands to run use Cwd; my $pwd = cwd(); for $arg { print "source ~/.bashrc ; cd $pwd ; perl myscript $arg\n"; } perl dataset*.gz > runit.smplq general 3.2 jk2269 myscript runit.smplq > runit.pbs qsub runit.pbs

How to get into the cluster, and back out again. How to run commands in the shell. How to type statements into R. How to navigate around the directories (and make and remove them). How to create, look at and edit text files. How to write scripts to do the computations you need to do. How to submit jobs, to run things on the compute nodes.

The faster you can type, the faster you will be done Select and learn a text editor Vi or Emacs Select and learn a programming language Perl, Python or R Ask these questions to keep you oriented What computer am I on? What directory am I in? Where are the files for my analysis? What program(s) do I have running? What jobs do I have running?

What computer am I on? Look at the prompt, ‘hostname’ What directory am I in? Look at the prompt and window top ‘pwd’, ‘cd’ Where are the files for my analysis? ‘ls’ ‘mkdir’, ‘rm’, ‘rmdir’ ‘more’ or ‘less’, ‘head’, ‘tail’ What program(s) do I have running? ‘ps’, ‘top’, ‘screen’ What jobs do I have running? ‘qstat’

Never, ever, ever read and write SAM files. Always pipe it through samtools to convert from SAM to BAM, if the software doesn’t support native BAM files.

