1 Welcome to lecture 3: An introduction to programming in PERL IGERT – Sponsored Bioinformatics Workshop Series Michael Janis and Max Kopelevich, Ph.D. Dept. of Chemistry & Biochemistry, UCLA

2 Last time… We covered a bit of material… Try to keep up with the reading – it’s all in there! How’s it coming along? –regex examples? (TATA box, palindrome)… > grep -E --color 'TA(TAAA|TAAT|TATT|ATAA|ATAT)' *.fsa > grep -E --color '(.)(.).\2\1' –Using emacs? –Let’s ignore the long version of the prosite match for now… we’ll deal with that soon…

3 Shell scripting is useful, but… It does not port or scale well; complex data structures may be somewhat challenging. Having said that, Shell scripting skills have many applications, including: –Ability to automate tasks, such as Backups Administration tasks Periodic operations on a database via cron Any repetetive operations on files –Increase your general knowledge of UNIX Use of environment Use of UNIX utilities Use of features such as pipes and I/O redirection

4 For bioinformatics, we need a fully featured programming language There’s a problem with our search of fasta files – can you guess what? We’ll be dealing with this using a programming language with arbitrarily complex data structures Perl is a scriptable, portable, interpreted and compiled language: –Scriptable and portable and networks well The code remains in text format The code is interpreted and compiled at runtime The interpreter has been written for use on every (?) platform Can control a vast number of other devices (files, programs, either local or remote) –Drawbacks of the language Since it’s compiled to C code, it will always run slower than C code There’s a double edged sword called TMTOWTDI Not truly OO; not the most elegant language for algorithm implementation (arguable!)

5 PERL: starting point for bioinformatics Easy to learn (a bit forgiving) Easy to process text files; good language for pattern searching –Most biological file formats are text files –Most sequence analysis tasks deal with pattern finding at some point Easy to run other programs and process their results –Similar to shell programming in this regard!

6 Extending the shell: Creating Our Own Commands Use programming language to create the new command We will use perl TASK: write a PERL program that –A.) reads a fasta sequence file –B.) reverse complements the sequence –C.) prints the output to STDOUT –D.) Then modify program to write to a file 1. Using command line REDIRECTION 2. Using PERL to open and write to OUTPUT FILE

7 PERL vocabulary – similar to bash functionality print chomp while open close $ARGV[0], $ARGV[1] $_ if...else =~ /^>/

8 PERL vocabulary...EXPLAINED print works like echo command chomp removes the ‘newline character’ while repetitive loop until breaking condition met open,used to open a file close used to close a file $ARGV[0], $ARGV[1] command line arguments $_ variable that holds current line from in-file if...else [if true perform a, else perform b] =~ binding operator (compare text w/ reg. exp) /^>/ match “>” at very beginning of line ONLY

9 Running a perl script 1.Create a file –Specify location of perl –Write program 2.Make it executable 3.Run it!

10 Example: “Hello world!” Write the program: #!/usr/bin/perl print("Hello, world!\n"); >chmod 744 > > Hello, world! > Tells the computer to allows the user to read, write AND execute it. Others can only read it. The location of PERL A PERL command Run the program The output Make it executable: Run it:

11 Data Data is stored in variables. A variable is like a box. We put values in it. There are three ways of storing data: –Scalar variables –Arrays –Hashes A single variable (a ‘scalar variable’) can be called anything, but must start with a ‘$’

12 Scalar variables: example #!/usr/bin/perl $dna = “TGACT”; Print(“$dna\n”); Using it Defining a variable > TGACT >

13 Scalar variables (cont.) PERL doesn’t differentiate between strings (e.g. “Fred”), integers (e.g. “13”) or floating point numbers (e.g. “16.9”). If there’s one piece of information, it’s a scalar variable. PERL understands the context you’re working in.

14 Scalar variables (cont.) #!/usr/bin/perl $dna = “TGACT”; print(“$dna\n”); $dna = 11; print($dna+2.”\n”); Using it Defining a variable (here it’s a string) > TGACT 13 > Redefine variable Use it in an integer context Perl worked out what to do

15 Limitations of scalar variables Imagine we want to find the average of a list of numbers we could do it like this: program 1 $number1 = 5.4; $number2 = 7.3; $number3 = 4.1; $average = ( $number1 + $number2 + $number3 ) / 3; but this is obviously extremely limited

16 Lists Of course there is a way to make lists in Perl. You can always enclose a list of items in parentheses... ( 5.6, 8.22, 14.9 ); # list of floating point numbers ( "hello", "Canada" ); # list of strings ( "hello", $country ); # mixed list ( "blah", 18, 22, 'x', 3.14 ); # mixed list ( 0.. 5 ); # list of integers between 0 and 5 ( 'a'.. 'z' ); # list of strings a,b,c,d......

17 Array variables There is a special type of variable in perl which can hold lists - The array Perl knows a variable is an array when we use a special character @ –Remember, scalars (single valued variables) start with a dollar ($) sign, arrays start with an @ sign. Arrays can have as many elements as you need (up to the limits of your available memory, anyway) @numbers = (5.6, 8.22, 14.9); # list of floating point numbers

18 Printing arrays @words = ("Hello", "Canada!"); print "@words" # prints Hello Canada! print @words # prints HelloCanada! Double quoted strings will print array elements with spaces in between them. –No quotes will print array elements all smashed together. !

19 Accessing array elements An array wouldn't be very useful if we couldn't look at the individual members of the list. print "Enter an index number between 0 and 25\n"; $index = ; chomp $index; @letters = ('A'..'Z'); print "letter index $index = $letters[$index] \n"; What does it mean?

20 Accessing array elements Arrays are stored in perl's memory in order. –Each position (element) in the array has a number –This number is called the index Each element in an array is a single (scalar) value There is magic syntax for addressing individual array elements. –This syntax can be a bit bewildering. To access an element we type: –$array_name[element_number] Elements are numbered starting at zero, not one!!

21 Setting the values in an array Remember ‘ls –1’? We’ll use that here… @files=`ls –1 *.CEL`; # BACKQUOTE here -this is an \n separated list -Any delimiter is ok -Any element can be accessed as a scalar and any function that acts upon a scalar can be introduced ($file=$files[2];)

22 Indexing arrays with negative numbers You can index from the end of an array backwards by using negative numbers: @letters = ('A'..'Z'); print "last letter = $letters[-1] \n"; print "penultimate letter = $letters[-2] \n";

23 Getting the length of an array You can use the function scalar to turn an array into a single valued scalar variable; –the value of this variable will be the number of elements in the array. @numbers = (0..100); print scalar(@numbers); # prints 101

24 Functions that act on arrays push Adds a value (or values) to the end of an array @numbers = (1, 2, 3); push(@numbers, 4, 5); print "@numbers \n"; # prints 1 2 3 4 5

25 Functions that act on arrays pop Removes a single value from the end of an array @words = ('the', 'quick', 'brown', 'fox'); print pop(@words); # fox print pop(@words); # brown print pop(@words); # quick

26 Functions that act on arrays shift Removes a single value from the beginning of an array @words = ('the', 'quick', 'brown', 'fox'); print shift(@words); # the print shift(@words); # quick

27 Functions that act on arrays unshift Pushes a value (or values) onto the front of an array

28 Functions that act on arrays reverse @words = ('the', 'quick', 'brown', 'fox'); print reverse(@words), "\n"; # foxbrownquickthe

29 Functions that act on arrays sort sort does what you think it does. You give it a list (or array), and it returns a list that is sorted in some way. @words = ('The', 'quick', 'brown', 'fox', 'jumped'); @sorted = sort(@words); print "sorted words = @sorted\n"; # The brown fox jumped quick

30 Functions that act on arrays join @words = ('The', 'quick', 'brown', 'fox', 'jumped'); print join("+", @words), "\n"; # The+quick+brown+fox+jumped You specify what string you want to join with as the first argument. You can use anything.

31 Array summary An array is a variable that has multiple values simultaneously. We refer to the different values using a number called the index.

32 Array example #!/usr/bin/perl $dna[0] = “TATA”; $dna[1] = “ATG”; print(“$dna[0]\n”); print(“$dna[1]\n”); Defining different entries of an array Print them both > TATA ATG > Note square brackets enclose index

33 What is a hash? Hashes are similar to arrays in many respects. Remember, arrays are simple lists stored as a series of elements, and each element has a number (index). The elements are stored in numeric order. It is a bit like a shopping list. Arrays are limited, in that you need to know which index position contains your value of interest. It might be nice if we could give these index positions names of our choice.

34 What is a hash? Perl has a way to do this, it is called a hash. Perl denotes a hash with a % (percent) sign. If arrays are shopping lists, hashes are telephone directories. You look up phone numbers by a person's name, not a unique number. They look something like this %astronomy value key to get the value: --------------------------------- | 'string' | 'word' | $astronomy{'word'}

35 Making a hash %re_lookup = ( 'Eco47III'=> 'AGCGCT', 'EcoNI' => 'CCTNNNNNAGG', 'EcoRI' => 'GAATTC', 'EcoRII' => 'CCWGG', 'HincII' => 'GTYRAC', 'HindII' => 'GTYRAC', 'HindIII' => 'AAGCTT', 'HinfI' => 'GANTC' );

36 Accessing a hash print "Enter restriction enzyme name\n"; $re= ; chomp $re; $seq = $re_lookup{$re}; if (defined($seq)) { print "RE sequence for $re is: $seq\n"; } else { print "Sorry, I don't know about \"$re\""; }

37 Changing values in a hash Just like we can change individual elements in an array by referring to them by number, we can change values in a hash by referring to them by their key. $space{'moon'} = 'Titan'; # change "Luna" to "Titan"

38 Useful Hash Functions The keys function takes a hash as argument and returns a list of keys in that hash The values function takes a hash as argument and returns a list of values in that hash

39 Useful Hash Functions KEYS %accession_hash = ( "BACR01A01" => "AC005555", "BACR48E02" => "AC005577", "BACR24K17" => "AC005101", ); # get all the keys in the hash @clones = keys %accession_hash; print "Clone IDs: @clones\n"; # prints BACR01A01 BACR48E02 BACR24K17

40 Useful Hash Functions VALUES # get all the values in the hash (hash is a lookup for accessions): @accs = values %accession_hash; print "GenBank Accessions: @accs\n"; # prints AC005555 AC005577 AC005101

41 Removing elements from a hash To remove a key value pair from a hash, you can use the delete function delete $re_lookup{"EcoRI"} If you just want to delete a value, but keep the key, you could do this: $re_lookup{"EcoRI"} = “”; # set value to the empty string

42 Counting things with a hash One of the most popular things to do with a hash is to count the number of times something has been seen.

43 Counting things with a hash @things = qw(YOR382W YML383W YML280W); # a list of accession numbers %counting = (); # initialize a hash foreach $item (@things){ $counting{$item}++; # increment the value associated with the key } foreach $key (keys %counting) { print "$key is found $counting{$key} times \n";}

44 Hashes summary Hashes are like arrays except instead of a numerical index, we use keys. A key can have any value. It can be a string, an integer – anything. Until you learn to use hashes, you aren’t really using Perl!

45 Hashes: example #!/usr/bin/perl $wife{“Fred”} = “Hannah”; $wife{“Bill”} = “Josephine”; print($wife{“Bill”}.”\n”); print($wife{“Fred”}.”\n”); Defining different entries of the hash > Josephine Hannah > Note curly braces enclose key

46 More stuff on variables We’ve used the ‘$’ to talk about individual entries for hashes or arrays. But referring to the whole array, we use ‘@’. Referring to the whole hash, we use ‘%’.

47 More stuff on variables This becomes useful when looking at properties of an entire array or hash For example, the length of an array: #!/usr/bin/perl $names[0] = “Bill”; $names[1] = “Fred”; $names[2] = “Bartholomew”; print(scalar(@names).”\n”); ‘@’ means we’re referring to the whole array > 3 >

48 Control structures All out programs so far have run from start to finish. Each line has been executed in turn. What if we only want to run some lines some of the time? This is where control structures come in.

49 Control structures PERL has a number of control structures. I’ll talk about four: –if –while –for & foreach There are others (e.g. unless)

50 ‘if’ control structure #!/usr/bin/perl $name = “Bill”; if ($name eq “Bill”) { print(“The name is Bill!\n”); } else { print(“The name isn’t Bill!\n”); } > The name is Bill! >

51 ‘if’ control structure #!/usr/bin/perl $name = “Fred”; if ($name eq “Bill”) { print(“The name is Bill!\n”); } else { print(“The name isn’t Bill!\n”); } > The name isn’t Bill! >

52 Perl has great regular expression support Usually, we compare two strings of characters using an equality test: #!/usr/bin/perl if ($name eq “Bill”) { print(“The name is Bill!\n”); }

53 The real world is fuzzier… Maybe we want to see if the name is ‘Bill’ OR ‘bill’. The if statement would need to be more complex: #!/usr/bin/perl if (($name eq “Bill”) || ($name eq “bill”)) { print(“The name is Bill!\n”); }

54 This is where regular expressions come in. Regular expressions describe generalised patterns of strings instead of exact strings. For example, the first problem was: if (($name eq “Bill”) || ($name eq “bill”)) { print(“The name is Bill!\n”); } But can be re-written: if ($name =~ /[Bb]ill/) { print(“The name is Bill!\n”); }

55 Another example… The phone number pattern from before (using GREP) problem can also easily be tackled in perl: (clearly the pattern syntax is very similar… we only need to specify to perl that the syntatical expression should be a regular expression) –We do this by prepending and appending ‘/’ (forward slashes) to the expression if ($number =~ /([0-9]{3} ){0,1}[0-9]{3} [0-9]{4}/) { print(“The number is a valid phone number!\n”); }

56 First principles of regex in perl if ($name =~ /red/) { print(“Name contains the text ‘red’!\n”); } Variable Regular expression

57 Special characters (metachars) (the following is a review of what we learned for grep!) ‘.’ is a wildcard and matches any character $input = $ARGV[0]; if ($input =~ /.ed/) { print(“Yes!\n”); } > bed Yes! > red Yes! > head > edward Yes! >

58 Special characters (‘metacharacters’) ‘*’ means ‘zero or more of the previous character’. $input = $ARGV[0]; if ($input =~ /be*d/) { print(“Yes!\n”); } > bed Yes! > red > beeeed Yes! > bd Yes! >

59 Special characters (‘metacharacters’) ‘+’ means ‘one or more of the previous character’. $input = $ARGV[0]; if ($input =~ /be+d/) { print(“Yes!\n”); } > bed Yes! > red > beeeed Yes! > bd >

60 Start and end of line ‘^’ is designates the start of the line, ‘$’ the end. $input = $ARGV[0]; if ($input =~ /bed/) { print(“Yes!\n”); } > bed Yes! > bedbed Yes! > xxxbedxxx Yes! > $input = $ARGV[0]; if ($input =~ /^bed$/) { print(“Yes!\n”); } > bed Yes! > bedbed > xxxbedxxx >

61 Grouping with parentheses Parentheses group characters $input = $ARGV[0]; if ($input =~ /(bed)+/) { print(“Yes!\n”); } > bed Yes! > bedbed Yes! > beddd >

62 Character classes The square brackets are used to denote whole groups of characters $input = $ARGV[0]; if ($input =~ /[brf]ed/) { print(“Yes!\n”); } > bed Yes! > red Yes! > led >

63 Character classes (cont) A hyphen designates a range: $input = $ARGV[0]; if ($input =~ /[a-z]ed/) { print(“Yes!\n”); } > bed Yes! > fed Yes! > Bed >

64 Character class shortcuts Some character classes are so common there are in-built shortcuts: –[0-9]=\d –[A-Za-z0-9]=\w –[\f\t\n\r ]=\s

65 Negating a character ‘^’ negates a character. Note the context determines whether ‘^’ is negation or start-of-line! $input = $ARGV[0]; if ($input =~ /[^b]ed/) { print(“Yes!\n”); } > red Yes! > bed > $input = $ARGV[0]; if ($input =~ /^bed/) { print(“Yes!\n”); } > red > bed Yes! >

66 Quantifying Curly brackets quantify repeats better than ‘*’ (0+) or ‘+’ (1+) a{3,5}=three, four or five ‘a’’s. $input = $ARGV[0]; if ($input =~ /la{3,5}d/) { print(“Yes!\n”); } > laaaad Yes! > laaaaaaad >

67 Using parentheses as memory Remember that parentheses group things? What they match is stored in variables $1, $2, $3… $input = $ARGV[0]; if ($input =~ /^(.*)e(.)$/) { print(“$1\n$2\n”); } > fred fr d > bad >

68 Interpolating variables We can place variables inside regular expressions $input = $ARGV[0]; $name = “fred”; if ($input =~ /$name/) { print(“Contains $name!\n”); } > fred Contains fred! > bill >

69 Using regular expressions to substitute parts of strings. Another useful thing with regular expressions is to use them to substitute parts of a string for other parts. My favourite use: strip trailing backslashes from a path: $input = $ARGV[0]; $input =~ s/\/$//; print(“$input\n”); > /usr/bin/tmp/ /usr/bin/tmp

70 The ‘for’ control structure The ‘for’ control structure is ideal for looping through arrays

71 For Loops Consider the standard while loop in pseudocode: initialization code while ( Test code ) { Code to execute in body } continue { Update code }

72 For Loops This can be generalized into the concise for loop: for ( initialization code; test code; update code ) { body code }

73 ‘for’ example #!/usr/bin/perl $name[0] = “Bill”; $name[1] = “Fred”; $name[2] = “Bartholomew”; For ($nameIndex = 0; $nameIndex < scalar(@name); $nameIndex++) { print(“$name[$nameIndex]\n”); } > Bill Fred Bartholomew >

74 Foreach Loop has similar application foreach will process each element of an array or list: foreach $loop_variable ('item1','item2','item3') { print $loop_variable,"\n"; }

75 ‘foreach’ example #!/usr/bin/perl $name[0] = “Bill”; $name[1] = “Fred”; $name[2] = “Bartholomew”; foreach $currentName (@name) { print(“$currentName\n”); } > Bill Fred Bartholomew > $currentName is assigned each value in the array @name in turn.

76 Opening files We can open other files with our PERL script. This is the real strength of PERL: processing text files. It’s easy!

77 Opening files (cont.) To open a file, we need to assign it a ‘file handle’ – this is the unique identifier we use to refer to the file with: open(INPUTFILE, “names.txt”); Filehandle The name of the file we want to open and assign to the filehandle close(INPUTFILE); When we’re finished, we should close the file:

78 While Loops A while loop has a condition at the top. The code within the body will execute until the code becomes false. while ( TEST ) { Code to execute } continue { Optional code to execute at the end of each loop }

79 The ‘while’ control structure The ‘while’ control stucture keeps looping while a given condition is satisfied #!/usr/bin/perl while (1 == 1) { print(“This is a really annoying infinite loop\n”); } > This is a really annoying infinite loop Ad nauseum…

80 Combining while loops with opening files ‘while’ and open files go together very well: #!/usr/bin/perl open(INPUTFILE, “names.txt”); while ($inputLine = ) { print(“$inputLine\n”); } close(INPUTFILE); Fred Bill Bartholomew (names.txt looks like this) > Fred Bill Bartholomew >

81 split A good use for regular expressions is to use them to define delimiting character(s). My favorite use: separating tab-delimited lines into an array: $input = ; @lineContents = split(/\t/, $input); Print($lineContents[0].”\n”); > < data.txt X Y Z > X1Y3Z6X1Y3Z6 (data.txt)

82 Until Loops Sometimes you want to loop until some condition becomes true, rather than until some condition becomes false. The until loop is easier to read than the equivalent while (!TEST). my $counter = 5; until ( $counter < 0 ) { print $counter--,"\n"; }

83 Executing external programs Another strength of PERL is that it can be used to run external programs. For example, say we have a C++ program that takes a PDB file and calculates inter-Cα distances, outputting them like this: 1109.23 One Cα The other Cα Distance between them in angstroms (tab seperated)

84 Example We could write a PERL script to calculate the average inter-Cα distances: #!/usr/bin/perl $PDBFile = “1a8l.pdb”; @results = `getDistances $PDBFile`; $total = 0; $count = 0; foreach $line (@results) { chomp; ($carbon1, $carbon2, $distance) = split(/\t/, $line); $total = $total + $distance; $count++; } print(“Average distance: “. ($total / $count). “\n”); These little reverse quotes tell PERL to execute the program and collect the results in the array ‘@results’ The ‘split’ command splits the line at every tab.

85 Our FASTA pattern problem Our problem with pattern matching across FASTA files is the lack of cohesive sequence (it runs across many lines) Furthermore, our DNA sequence download only has one strand direction (why? Think programmatically!) We need to solve that –To do so, we need to read in the file and choose a data structure appropriate for our needs –Which one should we use?

86 PERL data stuctures we can use $stringName –scalars – strings, perl handles datatype conversions @arrayName –arrays – indexed by position, starting at 0 Function(@arrayName) –manipulation of arrays $($array) –scalar conversion of an array element % hashes –index non-sequentially (aka “associative arrays”) – we’ll talk more about these in coming lectures

87 Basic concept for our task Read Command Line Arguments Open Fasta File While open { –Read each line of Fasta File –If line starts with “>”, print to out file –Else, reverse complement the line } Close Fasta File Use Control Structures To Impose Logic

88 Emacs commands (in your reading material -> copious emacs cmds)

89 Emacs text editor Use either term or GUI –(‘> emacs –nw’) –(‘> emacs’) Able to load ASCII and binary files and show metadata (windows conversions) Spell check, search, replace (see readings) Markup language handling for all file types, formatting (LaTeX, etc.)

90 date & version program description explanation of major steps place holder for the remaining steps Write Seq File & Program

91 WATCH OUT FOR TYPOS!! /^>/ vs. /^>?


93 Homework problem 3 Finish writing the perl program for reverse complementing a fasta sequence Use cat “file_of_fields” | awk... –To reorder the first and last field on each line –To select just the 1 st and 5 th fields of each line –To select 1 st and 5 th field and add “human” as a field between the 1 st and 5 th fields Use cat “file_of_fields” | awk... | grep... To select only lines containing ‘trans_factors’ Use redirection operator to write the output to a file called “human disease genes” Estimated time –perl 15 – 90 mins - cat,awk,grep 5 to 15 mins

94 Homework Set 4 Use STDIN instead of command line argument to read file, make the program work using STDIN. (Hint. cat seq.fa | while( ) {. } (Estimated time: 15 – 60 minutes)

95 Homework Set #5 Modify the output portion of the program to make a 2 nd command line argument ($ARGV[1]) provide the name of an output file for the reverse complemented sequence. open (OUTPUT, “>$out_put_file_name”); print OUTPUT “$_\n”; close (OUTPUT); (Estimated time: 15 – 60 mins)

96 Important Advice!!! Save your program frequently!! cp Save intermediate versions –cp –cp –Etc……

