Download presentation
Presentation is loading. Please wait.
1
Introduction to Perl for Bioinformatics Thursday, April 7
2
Programming languages Self-contained language –Platform-independent –Used to write O/S –C (imperative, procedural) –C++, Java (object-oriented) –Lisp, Haskell, Prolog (functional) Scripting language –Closely tied to O/S –Perl, Python, Ruby Domain-specific language –R (statistics) –MatLab (numerics) –SQL (databases) An O/S typically manages… –Devices (see above) –Files & directories –Users & permissions –Processes & signals
3
Bioinformatics “pipelines” often involve chaining together multiple tools
4
Perl is the most-used bioinformatics language Most popular bioinformatics programming languages Bioinformatics career survey, 2008 Michael Barton
5
Perl overview Interpreted, not compiled –Fast edit-run-revise cycle Procedural & imperative –Sequence of instructions (“control flow”) –Variables, subroutines Syntax close to C (the de facto standard minimal language) –Weakly typed (unlike C) –Redundant, not minimal (“there’s more than one way to do it”) –High-level data structures & algorithms –Hashes, arrays –Operating System support (files, processes, signals) –String manipulation
6
Perl basics Basic syntax of a Perl program: #!/usr/local/bin/perl # Elementary Perl program print "Hello World\n"; " \n " means new line print statement tells Perl to print the following text to the screen Single or double quotes enclose a "string literal" (double quotes are "interpolated") All statements end with a semicolon Lines beginning with " # " are comments, and are ignored by Perl Hello World
7
Variables We can tell Perl to "remember" a particular value, using the assignment operator “ = “: The $x is referred to as a "scalar variable". Variable names can contain alphabetic characters, numbers (but not at the start of the name), and underscore symbols "_" Scalar variable names are all prefixed with the dollar symbol. $x = 3; print $x; 3 $x = "ACGCGT"; print $x; ACGCGT Binding site for yeast transcription factor MCB
8
Arithmetic operations Basic operators are + - / * % Can also use += -= /= *= ++ -- $x = 14; $y = 3; print "Sum: ", $x + $y, "\n"; print "Product: ", $x * $y, "\n"; print "Remainder: ", $x % $y, "\n"; Sum: 17 Product: 42 Remainder: 2 $x = 5; print "x started as $x\n"; $x = $x * 2; print "Then x was $x\n"; $x = $x + 1; print "Finally x was $x\n"; x started as 5 Then x was 10 Finally x was 11 Could write $x *= 2; Could write $x += 1; or even ++$x;
9
String operations Concatenation. and.= Can find the length of a string using the function length($x) $a = "pan"; $b = "cake"; $a = $a. $b; print $a; pancake $a = "soap"; $b = "dish"; $a.= $b; print $a; soapdish $mcb = "ACGCGT"; print "Length of $mcb is ", length($mcb); Length of ACGCGT is 6
10
More string operations $x = "A simple sentence"; print $x, "\n"; print uc($x), "\n"; print lc($x), "\n"; $y = reverse($x); print $y, "\n"; $x =~ tr/i/a/; print $x, "\n"; print length($x), "\n"; A simple sentence A SIMPLE SENTENCE a simple sentence ecnetnes elpmis A A sample sentence 17 Convert to upper case Convert to lower case Reverse the string Transliterate "i"'s into "a"'s Calculate the length of the string
11
Concatenating DNA fragments $dna1 = "accacgt"; $dna2 = "taggtct"; print $dna1. $dna2; "Transcribing" DNA to RNA accacguuaggucu $dna = "accACgttAGGTct"; $rna = lc($dna); $rna =~ tr/t/u/; print $rna; Make it all lower case DNA string is a mixture of upper & lower case Transliterate "t" to "u" accacgttaggtct
12
Conditional blocks The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Perl, this looks like this: if (condition) { action } else { alternative } $x = 149; $y = 100; if ($x > $y) { print "$x is greater than $y\n"; } else { print "$x is less than $y\n"; } 149 is greater than 100 These braces { } tell Perl which piece of code is contingent on the condition.
13
Conditional operators Numeric: > >= < <= != == String: eq ne gt lt ge le $x = 5 * 4; $y = 17 + 3; if ($x == $y) { print "$x equals $y"; } 20 equals 20 "equals" "does not equal" "is alphabetically greater than" "is alphabetically less than" "is alphabetically greater-or-equal" "is alphabetically less-or-equal" Note that the test for "$x equals $y" is $x==$y, not $x=$y ($x, $y) = ("Apple", "Banana"); if ($y gt $x) { print "$y after $x "; } Banana after Apple "does not equal" Shorthand syntax for assigning more than one variable at a time
14
Logical operators Logical operators : && means " and ", || means " or " An exclamation mark ! is used to negate what follows Thus !($x = $y) In computers, the value zero is often used to represent falsehood, while any non-zero value (e.g. 1) represents truth. Thus: if (1) { print "1 is true\n"; } if (0) { print "0 is true\n"; } if (-99) { print "-99 is true\n"; } 1 is true -99 is true $x = 222; if ($x % 2 == 0 and $x % 3 == 0) { print "$x is an even multiple of 3\n"; } 222 is an even multiple of 3
15
Loops Here's how to print out the numbers 1 to 10: This is a while loop. The code is executed while the condition is true. $x = 1; while ($x <= 10) { print $x, " "; ++$x; } 1 2 3 4 5 6 7 8 9 10 The code inside the braces is repeatedly executed as long as the condition $x<=10 remains true Equivalent to $x = $x + 1;
16
A common kind of loop Let's dissect the code of the while loop again: This form of while loop is common enough to have its own shorthand: the for loop. $x = 1; while ($x <= 10) { print $x, " "; ++$x; } Initialization Test for completion Continuation for ($x = 1; $x <= 10; ++$x) { print $x, " "; } Initialization Test for completion Continuation
17
defined and undef The function defined($x) is true if $x has been assigned a value: A variable that has not yet been assigned a value has the special value undef Often, if you try to do something "illegal" (like reading from a nonexistent file), you end up with undef as a result if (defined($newvar)) { print "newvar is defined\n"; } else { print "newvar is not defined\n"; } newvar is not defined
18
Reading a line of data To read from a file, we first need to open the file and give it a filehandle. Once the file is opened, we can read a single line from it into the scalar $x : This code snippet opens a file called "sequence.txt", and associates it with a filehandle called FILE open FILE, "sequence.txt"; $x = ; This reads the next line from the file, including the newline at the end, " \n ". if the end of the file is reached, $x is assigned the special value undef
19
Reading an entire file The following piece of code reads every line in a file and prints it out to the screen: A shorter version of this is as follows: open FILE, "sequence.txt"; while (defined ($x = )) { print $x; } close FILE; open FILE, "sequence.txt"; while ($x = ) { print $x; } close FILE; This reads a line of data into $x, then checks if $x is defined. If $x is undef, then the file must have ended. this is equivalent to defined($x= )
20
The default variable, $_ Many operations that take a scalar argument, such as length($x), are assumed to work on $_ if the $x is omitted: So we can also read a whole file like this: $_ = "Hello"; print; print length; Hello5 open FILE, "sequence.txt"; while ( ) { print; } close FILE; This line is equivalent to while (defined($_= )) {
21
Summary: scalars and loops Assignment operator Arithmetic operations String operations Conditional tests Logical operators Loops defined and undef Reading a file $x = 5; $y = $x * 3; if ($y > 10) { print $s; } $s = "Value of y is ". $y; if ($y>10 && $s eq "") { exit; } for ($x=1; $x<10; ++$x) { print $x, "\n"; }
22
Pattern-matching A more sophisticated kind of logical test is to ask whether a string contains a pattern e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT? $name = "YBR007C"; $dna="TAATAAAAAACGCGTTGTCG"; if ($dna =~ /ACGCGT/) { print "$name has MCB!\n"; } 20 bases upstream of the yeast gene YBR007C The pattern binding operator =~ The pattern for the MCB binding site YBR007C has MCB!
23
FASTA format A format for storing multiple named sequences in a single file This file contains 3' UTRs for Drosophila genes CG11604, CG11455 and CG11488 >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT Name of sequence is preceded by > symbol NB sequences can span multiple lines Call this file fly3utr.txt
24
Printing all sequence names in a FASTA database The key to this program is this block: open FILE, "fly3utr.txt"; while ($x = ) { if ($x =~ />/) { print $x; } } close FILE; >CG11604 >CG11455 >CG11488 if ($x =~ />/) { print $x; } This pattern matches (and returns TRUE) if the default variable $_ contains the FASTA sequence-name symbol > This line prints $_ if the pattern matched
25
Pattern replacement open FILE, "fly3utr.txt"; while ( ) { if (/>/) { s/>//; print; } } close FILE; CG11604 CG11455 CG11488 New statement removes the ">" The new statement s/>// is an example of a replacement. General form: s/OLD/NEW/ replaces OLD with NEW Thus s/>// replaces ">" with "" (the empty string) $_ is the default variable for these operations
26
Finding all sequence lengths Open file Read line End of file? Line starts with “>” ? Remove “\n” newline character at end of line Sequence nameSequence data Add length of line to running total Record the name Reset running total of current sequence length First sequence? Print last sequence length Stop noyes no Start Print last sequence length
27
Finding all sequence lengths open FILE, "fly3utr.txt"; while ( ) { chomp; if (/>/) { if (defined $len) { print "$name $len\n"; } $name = $_; $len = 0; } else { $len += length; } print "$name $len\n"; close FILE; >CG11604 58 >CG11455 83 >CG11488 68 The chomp statement trims the newline character " \n " off the end of the default variable, $_. Try it without this and see what happens – and if you can work out why >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT
28
Reverse complementing DNA $dna = "accACgttAGgtct"; $revcomp = lc($dna); $revcomp = reverse($revcomp); $revcomp =~ tr/acgt/tgca/; print $revcomp; agacctaacgtggt Start by making string lower case again. This is generally good practice Reverse the string Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a' A common operation due to double-helix symmetry of DNA
29
Arrays An array is a variable holding a list of items We can think of this as a list with 4 entries @nucleotides = ('a', 'c', 'g', 't'); print "Nucleotides: @nucleotides\n"; Nucleotides: a c g t acgt element 0 element 1 element 2 element 3 the array is the set of all four elements Note that the element indices start at zero.
30
Array literals There are several, equally valid ways to assign an entire array at once. @a = (1,2,3,4,5); print "a = @a\n"; @b = ('a','c','g','t'); print "b = @b\n"; @c = 1..5; print "c = @c\n"; @d = qw(a c g t); print "d = @d\n"; a = 1 2 3 4 5 b = a c g t c = 1 2 3 4 5 d = a c g t This is the most common: a comma- separated list, delimited by parentheses
31
Accessing arrays To access array elements, use square brackets; e.g. $x[0] means "element zero of array @x " Remember, element indices start at zero! If you use an array @x in a scalar context, such as @x+0, then Perl assumes that you wanted the length of the array. @x = ('a', 'c', 'g', 't'); print $x[0], "\n"; $i = 2; print $x[$i], "\n"; agag @x = ('a', 'c', 'g', 't'); print @x + 0; 4
32
Array operations You can sort and reverse arrays... You can read the entire contents of a file into an array (each line of the file becomes an element of the array) @x = ('a', 't', 'g', 'c'); @y = sort @x; @z = reverse @y; print "x = @x\n"; print "y = @y\n"; print "z = @z\n"; x = a t g c y = a c g t z = t g c a open FILE, "sequence.txt"; @x = ;
33
push, pop, shift, unshift @x = (‘A’, ‘T’, ‘W’); print "I started with @x\n"; $y = pop @x; push @x, ‘G’; print "Then I had @x\n"; $z = shift @x; unshift @x, ‘C’; print "Now I have @x\n"; print "I lost $y and $z\n"; I started with A T W Then I had A T G Now I have C T G I lost W and A pop removes the last element of an array push adds an element to the end of an array shift removes the first element of an array unshift adds an element to the start of an array
34
foreach Finding the total of a list of numbers: Equivalent to: @val = (4, 19, 1, 100, 125, 10); $total = 0; foreach $x (@val) { $total += $x; } print $total; 259 @val = (4, 19, 1, 100, 125, 10); $total = 0; for ($i = 0; $i < @val; ++$i) { $total += $val[$i]; } print $total; 259 foreach statement loops through each entry in an array
35
The @ARGV array A special array is @ARGV This contains the command-line arguments when the program is invoked at the Unix prompt It's a way for the user to pass information into the program
36
Exploding a sequence into an array The programming language C treats all strings as arrays $dna = "accggtgtgcg"; print "String: $dna\n"; @array = split( //, $dna); print "Array: @array\n"; String: accggtgtgcg Array: a c c g g t g t g c g The split statement turns a string into an array. Here, it splits after every character, but we can also split at specific points, like a restriction enzyme
37
Taking a slice of an array The syntax @x[i,j,k... ] returns a (3-element) array containing elements i,j,k... of array @x @nucleotides = ('a', 'c', 'g', 't'); @purines = @nucleotides[0,2]; @pyrimidines = @nucleotides[1,3]; print "Nucleotides: @nucleotides\n"; print "Purines: @purines\n"; print "Pyrimidines: @pyrimidines\n"; Nucleotides: a c g t Purines: a g Pyrimidines: c t
38
Finding elements in an array The grep command is used to select some elements from an array The statement grep(EXPR,LIST) returns all elements of LIST for which EXPR evaluates to true (when $_ is set to the appropriate element) e.g. select all numbers over 100: @numbers = (101, 235, 10, 50, 100, 66, 1005); @numbersOver100 = grep ($_ > 100, @numbers); print "Numbers: @numbers\n"; print "Numbers over 100: @numbersOver100\n"; Numbers: 101 235 10 50 100 66 1005 Numbers over 100: 101 235 1005
39
Applying a function to an array The map command applies a function to every element in an array Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST Example: multiply every number by 3 @numbers = (101, 235, 10, 50, 100, 66, 1005); @numbersTimes3 = map ($_ * 3, @numbers); print "Numbers: @numbers\n"; print "Numbers times 3: @numbersTimes3\n"; Numbers: 101 235 10 50 100 66 1005 Numbers times 3: 303 705 30 150 300 198 3015
40
Review: pattern-matching The following code: prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the default variable, $_ Instead of using $_ we can "bind" the pattern to another variable (e.g. $dna ) using this syntax: We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax: We can replace all occurrences by appending a ' g ': if (/ACGCGT/) { print "Found MCB binding site!\n"; } if ($dna =~ /ACGCGT/) { print "Found MCB binding site!\n"; } $dna =~ s/ACGCGT/_MCB_/; $dna =~ s/ACGCGT/_MCB_/g;
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.