Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters.

Similar presentations


Presentation on theme: "Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters."— Presentation transcript:

1 Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters 1-4, Tisdall

2 Multiple platforms, multiple languages Windows, Mac, UNIX, Linux –UNIX remains the standard for bioinformatics software development, while PC’s and Mac’s are typically end-users. Java, Python, CORBA, C++, Ruby, Perl –There’s more than one way of doing things. –Uniformity continues to be one of the biggest problems faced in bioinformatics

3 Why Perl? Ease of use by novice programmers Fast software prototyping –Flexible language –Compact code (sometimes) Powerful pattern matching via “regular expressions” Availability of program and modules (BioPerl) Portability Open Source – easy to extend and customize No Licensing fees

4 Perl is easy to get… Many computers come with Perl already installed –Check by typing perl –v in a Unix, Linux, MacOSX shell, or Windows MS-DOS shell If not, simply go to www.perl.com, or www.activestate.com to download a recent version of Perl (download binary whenever possible, source code requires compiling)www.perl.com www.activestate.com ActiveState provides several tools for Perl developers (Although some think Perl is an “old” language, it is constantly undergoing revision and improvement

5 What is Perl? Practical Extraction Report Language An interpreted programming language optimized for scanning text files, extracting information, and printing reports The string-based language of DNA and protein sequence data makes this an obvious choice

6 What is a Perl program? A program consists of a text file containing a series of Perl statements –Perl programs can be written in a variety of text editors including MS Word, WordPad, NotePad, or as you will use Komodo from ActiveState Perl statements are separated by semi-colons (;) Multiple spaces, tabs, and blank lines are ignored Anything following a # is ignored (comment line) Perl is case sensitive

7 Perl has three data types $ - Scalar: holds a single value, which can be a number or string, $EcoRI = ‘GAATTC’; @ - Array: stores multiple scalar values [0, 1, 2, etc.] % - Hash: An associative array with keys and values

8 Using Scalar Variables Example 4-1 Tisdall provides a simple example, a thorough description of this exercise is supplied both in the text

9 Some additional comments regarding strings: Quotes: –‘XYZ’ Text between a pair of single quotes is interpreted literally –To get a single-quote in a string precede it by a backslash –To get a backslash into a single quoted string, precede backslash with backslash ‘hello’ #hello ‘can\’t’ #can’t ‘http:\\\\www’ # http:\\www

10 Double quotes interpolate variables “” variable names within the string are replaced by their current values –$x = 1; print ‘$x’; #will print out $x print “$x”; # will print out 1

11 Arithmetic operators + Addition - Subtraction * Multiplication ** Exponentiation / Division % Modulus

12 Other important operators = is an assignment operator == or eq is equals += or -= assignment operators that add or subtract, $a += 2; # means $a = $a +2; ++,, -- are autoincrement operators that add or subtract one from variable when following variable ($a++ = $a + 1)

13 \n = newline Often times you would like to introduce some spacing into your output \n introduces a blank line following any variable Print “apple”; print “grape”; Output looks like: apple grape Print “apple\n”; print “grape\n”; Output looks like:apple grape

14 Chomp and Chop Chop removes the last character from a string –$a = “Dr. Barber is hip”; –Chop ($a);#$a is now “Dr. Barber is hi” Chomp removes a line from the end of the string –$a = “Dr. Barber is hip\n”; –Chomp ($a);#$a is now “Dr. Barber is hip”

15 Do examples 4-2, 4-3, 4-4

16 Working with Files Biological data can come in a variety of file formats and our job is to utilize these files and extract what we want One such file format is FASTA

17 Scalar vs. Array Example 4-5 provides a simple distinction between use of a scalar variable and an array, read it, but don’t necessarily do it Also, it shows how you use filehandles in association with your file are input operators, you will become better acquainted with this when we use later

18 adhI.pep Supplant NM_021964fragment.pep with adhI.pep, which can be downloaded from the web-site to a folder you need to create on your computer called “BIOS482” Do Example 4-7, if time permits write analogous code to the code that follows this example to test out arrays

19 The Power of Perl Regular Expressions

20 What is a regular expression (regex)? It is a description for a group of characters you want to search for in a string, a file, a website, etc. Think of the group of characters as a pattern that you want to find within a string Use regular expressions to search text quickly and accurately

21 Pattern Matching Syntax $variable_name =~ /pattern/; –$variable_name – this is the variable containng the string you want to search –=~ - the binding operator is used for testing regular expressions –Letters before and after / (front and back, respectively, are operators and modifiers that affect the regular expression search

22 Matching operator you have been introduced to substitution and translation operators already m// or just // is used to find patterns in a string Test if a string contains the sequence ATG –$dnastr = ‘TTCGATGCCAC’; –If ($str =~ /ATG/) { –Print (“ATG found.\n”); –} –Else { –Print (“ATG not found.\n”); –}

23 Case modifier /atg/ would not find a match in the previous example However /atg/i would i is a case-independent modifier We will introduce additional modifiers when necessary

24 Global modifier If there were more than one ATG in the sequence, the previous examples only acknowledge the first one they run into /ATG/g g is a modifier for a global search, searching a string for ALL instance of pattern not the first one.

25 Other operators for regex s/// - substitution perator is used to change strings, put the oldstring between the first and second /, and the new string between the second and third tr/// - is used to change individual characters. Put the old character between the first and second /, and new character between the second and third

26 Metacharacters help search for complicated patterns \d or [0-9] – match any digit \w or [a-zA-Z_0-9] – match a character \D – match a non-digit character \W – match a non-word character \s, [\t\n\r\f] – match whitespace character \S – match non-whitespace character \n – match a newline character \r – match a carriage return \t – match a tab \f – match a formfeed. – match any SINGLE character There are more!

27 Regex quantifiers These syntax structures allow you to specifiy how long a regular expression pattern match should be –* match 0 or more times –+ match 1 or more times –? Match 1 or 0 times –{n} match exactly n times –{n, } match at least n times –{n,m} match at least n, but not more than m times

28 Examples of quantifier use [A+CGC?A] #match one or more A’s followed by CG, followed by an optional C followed by an A /A{3}/# Match exactly 3 A’s /A{3,} # match 3 or more A’s /A {3,8}/ #match 3 to 8 A’s The transcription factor binding site for SSP protein is GGCGGCGGCTGGCTAGGG –/{(GGC), 3}T{G,2}CTA{G,3}/

29 Alternation Vertical bar (|) allows you to match one of several alternatives /song|blue/ # match either ‘song’ or ‘blue’ /a|b|c/ # match a, b, or c, same as [abc] The GATA-1 TF binding site is defined by a T or an A, followed by GATA followed by an A or G. In regex that would be: /(T|A)GATA(A|G)/

30 Anchoring patterns ^ matches the beginning of a string, while $ matches the end of a string /^this/ #matches ‘this one’ but not ‘watch this’ /this$/ #matches ‘watch this’ but not ‘this one’

31 Pattern memory You know how to match characters, you need a way to find out what was matched by storing or saving the matching portions Putting parentheses around any pattern will allow the part of the string matched by the pattern to be remembered and stored in a special variable called $1. If there are multiple patterns, they are stored in $2, $3, …)

32 Finding and storing GATA-1 binding site $seq = “AAAGAGAGGGATAGAATAGAGATG ATAAGAAA”; $seq =~ /(T|A)GATA(A|G)/; Print “$1\n”; Output: TGATAA

33 Other special variables $& the part of the string that actually matched $` everything before the match $’ everything after the match –Modify previous program to : Print “$`\n”; Print “$&\n”; Print “$’\n”; Output: AAAGAGAGGGATAGAATAGAGA TGATAA GAAA

34 Websites on RegEx http://www.perldoc.com/perl 5.6.1/pod/perlre.htmlhttp://www.perldoc.com/perl 5.6.1/pod/perlre.html http://www.troubleshooters.com/codecorn/li ttperl/perlreg.htmhttp://www.troubleshooters.com/codecorn/li ttperl/perlreg.htm http://www.devshed.com/Server_Side/Admi nistration/RegExp/page2.htmlhttp://www.devshed.com/Server_Side/Admi nistration/RegExp/page2.html http://www.javaworld.com/javaworld/jw- 07-2001/jw-0713-regex.htmlhttp://www.javaworld.com/javaworld/jw- 07-2001/jw-0713-regex.html

35 Exercises Try some regular expressions with your motif.pl program pg. 67-69 Read pages 70-75, work through example 5- 4 (pick your own nucleotide file from NCBI) Next, do Example 5-7 to learn how to write to files

36 Homework


Download ppt "Bioinformatics is … - the use of computers and information technology to assist biological studies - a multi-dimensional and multi-lingual discipline Chapters."

Similar presentations


Ads by Google