Using the Unix Shell There is No ‘Undelete’
The Unix Shell “A Unix shell is a command-line interpreter or shell that provides a traditional user interface for the Unix operating system and for Unix-like systems. Users direct the operation of the computer by entering commands as text for a command line interpreter to execute or by creating text scripts of one or more such commands.” - Wikipedia
Things to Keep in Mind There is no ‘undelete’ Shell commands are case-sensitive (CaPitaLizaTIoN mAttErs) Do NOT use space, ?, *, \, / or $ in file names because these have special meanings to the shell Filenames that begin with. are ‘hidden’ There is no ‘undelete’
The Importance of Being ‘Root’ ‘Root’ or ‘Superuser’ is the administrator account, which has phenomenal cosmic power. The ‘sudo’ command allows you to “do as superuser” from an account with ‘sudo privileges’. As root in the shell, you can literally ‘delete’ the operating system or operating system files (like choosing to delete Microsoft Windows while using Windows)… and then watch the stars go out… – Moral of the story: If you don’t know what a file is… it’s better to ask or leave it alone. – Installing software can require use of ‘sudo’
Unix Tutorial Science.txt file location for tutorial: – – Unix command: wget Additional help/tutorial/walkthrough
Grep grep science science.txt grep science science.txt > newfile1.txt grep -B 1 -A 2 science science.txt > newfile1.txt Use man grep to learn more about grep A ‘redirect’ symbol that sends output which would normally go to the screen to a text file instead. Command line ‘options’ that change the behavior of the ‘grep’ program, with numerical parameters that specify the new behavior.
Permissions Type ls -l *note: those are both lower-case L characters -rw-r--r-- 1 krmerrill staff Feb 2 13:00 AJB_Merrill-d _au.doc drwxr-xr-x 47 krmerrill staff 1598 Jul My Pictures - means regular file, d means directory, l (lower-case L) means link first triplet is the user read, write, and execute permissions second triplet is the group permissions last triplet is permissions for everyone else, or ‘other’ ls -al shows above information for all files, including hidden files chmod = change permissions u = user; g = group; o = other;a = all (user, group, and other) r = read; w = write; x = execute chmod u+x filename adds user execute permission on filename chmod g-wx filename removes group write and execute permissions from filename Permissions that are not mentioned in this format chmod command are not affected
Useful Shell Commands See the Linux Command Line Reference document on the course website Directory commands Change to sub-directory within the current directory: cd xyz Change to sub-directory in another part of the directory tree: cd /path/to/filename Create directory: mkdir newdir Remove empty directory: rmdir xyz Wildcard characters: ? matches any single character, * matches zero or more characters Example: rm *.txt will remove all files with a name ending in.txt rm file?.fastq will remove file1.fastq, file2.fastq, …, filex.fastq
Regular Expressions See the RegularExpressions.pdf document on the course website for an overview of literal characters and metacharacters Regular expressions are useful within grep, awk, sed and other command-line tools as well as in Java, Perl, Python, and other scripting languages. Some text editor programs in Linux also use regular expressions, (also called regexps or regex). We will use nedit as an example. Replacing a space character with a new-line character in a file of barcodes – find ‘(OWB\d+) ’ and replace with ‘\1\n’ – note the trailing space in the first expression.
Command-line example Testing analyses on a small random sample of a sequence dataset is a good idea – find and fix problems quickly How to randomly sample the same reads from a set of paired- end files? A one-line command is saved on the course website to do this. time paste file1.fastq file2.fastq |awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf | head | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fastq"}‘ Let’s look at this step by step
time this tells the system to display the time required to execute the command paste Bigfile1.fastq Bigfile2.fastq | this joins two files of paired-end sequence reads as tab-delimited columns, line by line – the files should have the same number of lines, with reads in the same order in both files awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | this uses the ‘awk’ program to convert the four lines of FASTQ format to tab-separated fields on a single line per sequence record shuf | this utility sorts lines in a file into a random order head | this utility takes the first 2 million lines of the re-ordered file sed 's/\t\t/\n/g' | this uses the ‘sed’ stream editor to convert the tab delimiters back into new-line characters to restore the 4-line FASTQ format awk '{print $1 > “Subfile1.fastq"; print $2 > “Subfile2.fastq"}' this uses ‘awk’ to split the two tab-delimited columns back into two separate files Command-line example
How do you come up with this stuff?
Someone else has probably had this problem
Search for help on SeqAnswers or StackExchange The Bioinformatics Forum on SeqAnswers:
SolexaQA.pl This Perl script assumes that header lines of sequence files are written in one of several formats The code uses regular expressions to sort out formats: if( $line =~ /\S+\s\S+/ ){# Cassava 1.8 variant if( $line =~ ){ $number_of_tiles = $1 + 1;# Sequence Read Archive variant }elsif( $line =~ ){ $number_of_tiles = $1 + 1;}# All other variants }elsif( $line =~ ){ $number_of_tiles = $1 + 1;}
Alternate Formats This Perl script assumes that header lines of sequence files are written in one of several formats The code uses regular expressions to sort out formats: if( $line =~ /\S+\s\S+/ ){# Cassava 1.8 variant – does the header line contain a space surrounded by non-space $line =~ ) # NCBI SRA variant – does the header line contain a string with –, _,or. before the first _SLXA-EAS1_s_7:5:1:817:345 length=36
SolexaQA.pl $line =~ ) # Two other variants – 1.does first field contain –,., or _ followed by two more colon- delimited fields? $line =~ ) 2.does first field contain –,., :, or _ followed by four colon-delimited fields, followed by., /, or # at the end of the line? Example header line from GSL sequence This would be described by $line =~