Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp

Similar presentations


Presentation on theme: "Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp"— Presentation transcript:

1 Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp http://bioinf.gen.tcd.ie/GE3M25/programming

2 Computer Programming for Biologists  Project  Program Exit  Random numbers  Regular Expressions Overview

3 Computer Programming for Biologists Task 1: Report length of a sequence in Fasta format Understand the problem, consider input/output: >Tmsb10 ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCT GAAGAAA ACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGA AAAGAGG AGTGAAATCTCCTAA  Sequence length is 135 bp. Project

4 Computer Programming for Biologists Problems: 1.File contains header line 2.Sequence contains line-breaks >Tmsb10 ATGGCAGACAAGCCGGACATGGGGGAAATCGCCAGCTTCGATAAGGCCAAGCT GAAGAAA ACCGAGACGCAGGAGAAGAACACCCTGCCGACCAAAGAGACCATTGAACAGGA AAAGAGG AGTGAAATCTCCTAA Project

5 Computer Programming for Biologists Steps: 1.Read in file content (line-by-line) 2. Remove line-breaks 3. Skip header line 4. Concatenate sequence into one long string 5. Calculate and report length Project

6 Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { } Project

7 Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks # 3. Skip header line # 4. Concatenate sequence into one long string } Project

8 Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line # 4. Concatenate sequence into one long string } Project

9 Computer Programming for Biologists Steps: # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line # 4. Concatenate sequence into one long string $sequence.= $input; } Project

10 Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line # 4. Concatenate sequence into one long string $sequence.= $input; } # 5. Calculate and report length $length = length($sequence); print "Sequence length: $length bp\n"; Project

11 Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? if ($first eq '>') { # skip this line next; } $sequence.= $input; Project

12 Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? if ($first eq '>') { # skip this line next; } $sequence.= $input; Project

13 Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? unless ($first eq '>') { # this must be part of the sequence $sequence.= $input; } Project alternative version alternative version

14 Computer Programming for Biologists # 1. Read in file content (line-by-line) while ($input = <>) { # 2. Remove line-breaks chomp $input; # 3. Skip header line (check for '>' in first position) # extract first character: $first = substr $input, 0, 1; # is it a '>'? if ($first eq '>') { # skip this line next; } # 4. Concatenate sequence into one long string $sequence.= $input; } # 5. Calculate and report length $length = length($sequence); print "Sequence length: $length bp\n"; Project

15 Computer Programming for Biologists # Suggestions for the start of the script: # make sure a file has been provided unless (@ARGV) { die "Please specify file name on command line!"; } # initialise sequence variable $sequence = ''; # 1. Read in file content (line-by-line) while ($input = <>) { … Project

16 Computer Programming for Biologists 1. automatic exit at end of script 2. explicit exit with value: exit 0; # default or exit 1; # normally indicates an error 3. exit on failure: die "error message"; ("\n" supresses line number) Exiting a program

17 Computer Programming for Biologists Example: Exiting a program

18 Computer Programming for Biologists Practical: Project http://bioinf.gen.tcd.ie/GE3M25/programming/class5

19 Computer Programming for Biologists constructs that describe patterns powerful methods for text processing search for patterns in a string search and extract patterns search and replace patterns pattern at which to split a string Regular Expressions

20 Computer Programming for Biologists Examples: Look for a motif in a dna/protein sequence Find low complexity repeats and mask with x's Find start of sequence string in GenBank record Extract e-mail addresses from a web-page Replace strings, e.g.: '@tcd.ie' with '@gmail.com' Regular Expressions

21 Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions string in which to search

22 Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions binding operator

23 Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions pattern

24 Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions delimiters

25 Computer Programming for Biologists Find a pattern in a string (stored in a variable): $sequence = 'ataggctagctaga'; if ( $sequence =~ /ctag/ ) { print 'Found!';} Regular Expressions binding operator pattern delimiters string in which to search

26 Computer Programming for Biologists Find a pattern in a string (stored in a variable): $_ = 'ataggctagctaga'; if ( /ctag/ ) { print 'Found!';} Regular Expressions pattern delimiters without binding // to a variable, regular expression works on $_

27 Computer Programming for Biologists Search modifier: i = make search case-insensitive $sequence = 'ataggctagctaga'; if ( $sequence =~ /TAG/i ) { print 'Found!'; } Regular Expressions

28 Computer Programming for Biologists Metacharacters: ^ = match at the beginning of a line $ = match at the end of the line. = match any character (except newline) \ = escape the next metacharacter $sequence = ">sequence1\natgacctggaataggat"; if ( $sequence =~ /^>/ ) { # line starts with '>' print 'Found Fasta header!'; } Regular Expressions /\.$/ matches dot at end of line

29 Computer Programming for Biologists Exercise: Modify your course project (sequanto.pl) to use a regular expression for detection of a header line instead of 'substr' and 'eq' to check first character. Project

30 Computer Programming for Biologists Matching repetition: a? = match 'a' 1 or 0 times a* = match 'a' 0 or more times, i.e., any number of times a+ = match 'a' 1 or more times, i.e., at least once a{n,m} = match at least "n" times, but not more than "m" times. a{n,} = match at least "n" or more times a{n} = match exactly "n" times $sequence =~ /a{5,}/; # finds repeats of 5 or more 'a's Regular Expressions

31 Computer Programming for Biologists Search for classes of characters \d = match a digit character \w = match a word character (alphanumeric and '_') \D = match a non-digit character \W = match a non-word character \s = whitespace \S = match a non-whitespace character $date = '30 Jan 2009'; if ( date =~ /\d{1,2} \w+ \d{2,4}/ ) { print 'Correct date format!'; } Regular Expressions also matches '1 February 09'

32 Computer Programming for Biologists Match special characters \t = matches a tabulator (tab) \b = matches a word boundary \r = matches return \n = matches UNIX newline \cM = matches Control-M (line-ending in Windows) while (my $line = <>) { if ($line =~ /\cM/) { warn "Windows line-ending detected!"; } Regular Expressions

33 Computer Programming for Biologists Search for range of characters [ ] = match at least one of the characters specified within these brackets - = specifies a range, e.g. [a-z], or [0-9] ^ = match any character not in the list, e.g. [^A-Z] $sequence = 'ataggctapgctaga'; if ( $sequence =~ /[^acgt]/ ) { print "Sequence contains non-DNA character: $&"; } Regular Expressions $& is a special variable containing the last pattern match $` and $' contain strings before and after match

34 Computer Programming for Biologists Search and replace (substitute): s/pattern1/pattern2/ $sequence = 'ataggctagctaga'; $rna = $sequence; $rna =~ s/t/u/; -> 'auaggctagctaga' Regular Expressions Only the first match will be replaced!

35 Computer Programming for Biologists Modifiers for substitution: i = case in-sensitive g = global s = match includes newline $sequence = 'ataggctagctaga'; $rna = $sequence; $rna =~ s/t/u/g; -> 'auaggcuagcuaga' Regular Expressions replaces all 't' in the line with 'u'

36 Computer Programming for Biologists Example: Clean up a sequence string: $sequence = " 1 ataggctagctagat 16 ttagagctagta "; $sequence =~ s/[^actg]//g; -> 'ataggctagctagatttagagctagta' Regular Expressions Deletes everything that is not a, c, t, or g.

37 Computer Programming for Biologists Extract matched patterns: -put patterns in parentheses -\1, \2, \3, … refers back to ()'s within pattern match -$1, $2, $3, … refers back to ()'s after pattern match $sequence = ">test\natgtagagctagta"; if ($sequence =~ /^>(.*)/) { $id = $1; } or $email =~ s/(.*)\@(.*)\.(.*)/\1 at \2 dot \3/; print "Changed address to $1 at $2 dot $3\n"; Regular Expressions changes 'kahokamp@tcd.ie' to 'kahokamp at tcd dot ie''

38 Computer Programming for Biologists Practical: Project http://bioinf.gen.tcd.ie/GE3M25/programming/class5

39 Computer Programming for Biologists Change a character into an array: @array = split //, $string; Split input line at tabs: @columns = split /\t/, $input_line; Default splits $_ on whitespace: while (<>) { @colums = split; … } Regular Expressions in split


Download ppt "Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp"

Similar presentations


Ads by Google