More advanced BASH usage
What we know already How to run existing programs Those which are in the PATH Those which are somewhere else How to modify the options for a program using switches How to supply data to programs using file paths and wildcards
What else can we do Record the output of programs Check for errors in programs which are running Link programs together into small pipelines Automate the running of programs over batches of files All of these are possible with some simple BASH scripting
Recording the output of programs One of the aspects of POSIX is a standard system for sending data to and from programs. Three data streams exist for all Linux programs (though they don't have to use them all) STDIN (Standard Input - a way to send data into the program) STDOUT (Standard Output - a way to send expected data out of the program) STDERR (Standard Error - a way to send errors or warnings out of the program) By default STDOUT and STDERR are connected to your shell, so when you see text coming from a program, it's coming from these streams.
Recording the output of programs STDIN STDOUT STDERR Rather than leaving these streams connected to the screen, you can link them to either files, or other programs to either create logs, or to build small pipelines
Redirecting standard streams You can redirect using arrows at the end of your command > [file] Redirects STDOUT < [file] Redirects STDIN 2> [file] Redirects STDERR 2>&1 Sends STDERR to STDOUT so you only have one output stream $ find . -print > file_list.txt 2> errors.txt $ ls Data Desktop Documents Downloads errors.txt examples.desktop file_list.txt Music Pictures Public Templates Videos $ head file_list.txt . ./Downloads ./Pictures ./Public ./Music ./.bash_logout ./.local ./.local/share ./.local/share/icc ./.local/share/icc/edid-33d524c378824a7b78c6c679234da6b1.icc
Linking programs together Part of the original UNIX design was to have lots of small programs doing specific jobs, and then to link them together to perform more advanced tasks. There are a couple of mechanisms to do this Pipes Backticks
Linking programs together using pipes Pipes are a mechanism to connect the STDOUT of one program to the STDIN of another. You can use them to build small pipelines To create a pipe just use a pipe character | between programs pr1 | pr2 is the equivalent to pr1 > temp; pr2 < temp $ ls | head -2 Data Desktop
Useful programs for pipes Whilst you can theoretically use pipes to link any programs, there are some which are particularly useful, these are things like: wc - to do word and line counting grep - to do pattern searching sort - to sort things uniq - to deduplicate things less - to read large amounts of output zcat/gunzip/gzip - to do decompression or compression
Small example pipeline Take a compressed fastq sequence file, extract from it all of the entries containing the telomere repeat sequence (TTAGGG) and count them zcat file.fq.gz | grep TTAGGGTTAGGG | wc -l $ zcat file.fq.gz | wc -l 179536960 $ zcat file.fq.gz | grep TTAGGGTTAGGG | wc -l 3925
Using backticks to capture output Pipes are great to link streams of data together, but they need to be supported by both programs In some cases you want to use the output of one program as an argument to another one. You can do this using backticks - these are like quotes, but their contents are executed, and the contents are replaced with the STDOUT of the executed code
Backticks example I want to find out where a program (fastqc) is installed, and I want to find out whether that location is a real file, or a symlink to somewhere else. I can do this in 2 separate steps, but this involves copy/paste which fastqc ls -l [whatever came out of step 1] Backticks let you do this in a single step. $ ls -l `which fastqc` -rwxr-xr-x 1 andrewss bioinf 14400 Dec 21 2017 /bi/apps/fastqc/0.11.7/fastqc
Iterating over files When processing data it is common to need to re-run the same command multiple times for different input/output files. Some programs will support being provided with multiple input files, but many will not. You can use the automation features of the BASH shell to automate the running of these types of programs
The BASH for loop Simple looping construct Loop over a set of files Loop over a set of values Creates a temporary environment variable which you can use when creating commands
Examples of for loops for file in *txt do echo $file grep .sam $file | wc -l done for value in {5,10,20,50} do run_simulation --iterations=$value > ${value}_iterations.log 2>&1 done for value in {10..100} do run_simulation --iterations=$value > ${value}_iterations.log 2>&1 done