Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print.

Similar presentations


Presentation on theme: "Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print."— Presentation transcript:

1 Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print "No"; }

2 Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print "No"; } =~ is a binding operator and means: perform the following action on this variable. The following action m/atg/ in this case is a substring search, with the "m" for "match"' and substring "atg".

3 Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print "No"; } If the substring occurs, the statement will return TRUE and the if- block will be executed. The value of $sequence does not change by the match.

4 Finding substrings, repeated my $sequence = "gatgcaggctcgctagcggct"; my $count = 0; while($sequence =~ m/ggc/g) { $count++; } print "$count matches for gcc\n";

5 m//g 'g' option allows repeated matching, because the position of the last match is remembered

6 Finding substrings, repeated my $sequence = "gatgcaggctcgctagcggct"; my $count = 0; while($sequence =~ m/ggc/g) { $count++; } print "$count matches for gcc\n";

7 Finding substrings, repeated my $sequence = "gatgcaggctcgctagcggct"; my $codon = "ggc"; my $count = 0; while($sequence =~ m/$codon/g) { $count++; } print "$count matches for $codon\n";

8 Position after last match my $sequence = "gatgcaggctcgctagcggct"; my $codon = "ggc"; print "looking for $codon from 0\n"; while($sequence =~ m/$codon/g) { print "found, will continue from: "; print pos($sequence),"\n"; }

9 Position after last match my $sequence = "gatgcaggctcgctagcggct"; my $codon = "ggc"; pos($sequence) = 10; print "looking for $codon from 10\n"; while($sequence =~ m/$codon/g) { print "found, will continue from: "; print pos($sequence),"\n"; }

10 Replacing substrings my $sequence = "gatgcagaattcgctagcggct"; print $sequence,"\n"; #Replace the EcoRI site with '******' $sequence =~ s/gaattc/******/; # gatgca******gctagcggct #Replace all the other characters with space $sequence =~ s/[^*]/ /g; print $sequence,"\n"; Output: gatgcagaattcgctagcggct ******

11 Examples of regular expressions s/World/Wur/ replaces World with Wur, making "Hello World" "Hello Wur" s/t/u/ replaces the first 't' with 'u', "atgtag" becomes "augtag" s/t/u/g replaces all 't's with 'u's, "atgtag" becomes "auguag" s/[gatc]/N/g replaces all g,a,t,c's with N, "atgtag" becomes "NNNNNN" s/[^gatc]//g replaces all characters that are not g,a,t or c with nothing s/a{3}/NNN/g replaces all 'aaa' with 'NNN', "taaataa" becomes "tNNNtaa" m/sq/i match 'sq', 'Sq', 'sQ' and SQ: case insensitive m/^SQ/ match 'SQ' at the beginning of the string m/^[^S]/ match strings that do not begin with 'S' m/att?g/ match 'attg' and 'atg' m/a.g/ match 'atg', 'acg', 'aag', 'agg', 'a g', 'aHg' etc. s/(\w+) (\w+)/$2 $1/ swap two words, "one two" => "two one" m/atg(…)*?(ta[ag]|tga)/ matches an ORF

12 The matched strings are stored my $text = "This is a piece of text\n"; print $text; $word = 0; while($text =~ /(\w+)\W/g) { $word++; print "word $word: $1\n"; }

13 The matched strings are stored my $text = "one two"; $text =~ /(\w+) (\w+)/g print "word one:$1 "; print "word two:$2 "; print "complete string: $&";

14 The matched strings are stored my $sequence = "gatgcaggctcgctagcggct"; while ($sequence =~ m/([acgt]{3})/g) { print "$1\n"; }

15 Special characters \ttab \nnewline \rreturn (CR) \b"word" boundary \Bnot a "word" boundary \wmatches any single character classified as a "word" character (alphanumeric or _) \Wmatches any non-"word" character \smatches any whitespace character (space, tab, newline) \Smatches any non-whitespace character \dmatches any digit character, equiv. to [0-9] \Dmatches any non-digit character \xhhcharacter with hex. code hh

16 Metacharacters ^beginning of string $end of string.any character except newline *match 0 or more times +match 1 or more times ?match 0 or 1 times; or shortest match |alternative ( )grouping, or storing [ ]set of characters { }repetition modifier \quote or special

17 Repetition a*zero or more a's a+ one or more a's a? zero or one a's (i.e., optional a) a{m} exactly m a's a{m,} at least m a's a{m,n} at least m but at most n a's a{0,n} at most n a's $mRNAsequence = "aaaauaaaaa"; $mRNAsequence =~ m/a{2,}ua{3,}/;

18 Greediness Pattern matching in Perl by default is greedy, which means that it will try to match as much characters as possible. This can be prevented by appending the ? Operator to the expression $sequence = "atgtagtagtagtagtag"; #This will replace the entire string: s/atg(tag)*// #This will stop matching at the first tag: s/atg(tag)*?//

19 open SEQFILE, "example1.fasta"; my $sequence = ""; my $ID = ; while ( ) { chomp; $sequence.= $_; } print $ID; print $sequence,"\n"; #SmaI striction (ccc^ggg) $sequence =~ s/cccggg/ccc^ggg/g; #PvuII striction (cag^ctg) $sequence =~ s/cagctg/cag^ctg/g; my @sequenceFragments = split '\^', $sequence; print "\n", "-"x90, "\n"; print "Digested sequence:\n",$sequence,"\n\n"; print "-"x90,"\n"; print "Fragments:\n"; foreach $fragment(@sequenceFragments) { print $fragment,"\n"; print "-"x90,"\n"; }

20 >BTBSCRYR Bovine mRNA for lens beta-s-crystallin... tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttcc acatgtacctgagccgctgcaactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctaccccgg ggcgagtatcctgagtaccagcactggatgggcctcaacgaccgcctcagctcctgcagggctgttcacctgtctagtggaggccagtataagcttcagat ctttgagaaaggggattttaatggtcagatgcatgagaccacggaagactgcccttccatcatggagcagttccacatgcgggaggtccactcctgtaagg tgctggagggcgcctggatcttctatgagctgcccaactaccgaggcaggcagtacctgctggacaagaaggagtaccggaagcccgtcgactggggtgca gcttccccagctgtccagtctttccgccgcattgtggagtgatgatacagatgcggccaaacgctggctggccttgtcatccaaataagcattataaataa aacaattggcatgc ------------------------------------------------------------------------------------------ Digested sequence: tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttcc acatgtacctgagccgctgcaactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctacccc^g gggcgagtatcctgagtaccagcactggatgggcctcaacgaccgcctcagctcctgcagggctgttcacctgtctagtggaggccagtataagcttcaga tctttgagaaaggggattttaatggtcagatgcatgagaccacggaagactgcccttccatcatggagcagttccacatgcgggaggtccactcctgtaag gtgctggagggcgcctggatcttctatgagctgcccaactaccgaggcaggcagtacctgctggacaagaaggagtaccggaagcccgtcgactggggtgc agcttccccag^ctgtccagtctttccgccgcattgtggagtgatgatacagatgcggccaaacgctggctggccttgtcatccaaataagcattataaat aaaacaattggcatgc ------------------------------------------------------------------------------------------ Fragments: tgcaccaaacatgtctaaagctggaaccaaaattactttctttgaagacaaaaactttcaaggccgccactatgacagcgattgcgactgtgcagatttcc acatgtacctgagccgctgcaactccatcagagtggaaggaggcacctgggctgtgtatgaaaggcccaattttgctgggtacatgtacatcctacccc ------------------------------------------------------------------------------------------ ggggcgagtatcctgagtaccagcactggatgggcctcaacgaccgcctcagctcctgcagggctgttcacctgtctagtggaggccagtataagcttcag atctttgagaaaggggattttaatggtcagatgcatgagaccacggaagactgcccttccatcatggagcagttccacatgcgggaggtccactcctgtaa ggtgctggagggcgcctggatcttctatgagctgcccaactaccgaggcaggcagtacctgctggacaagaaggagtaccggaagcccgtcgactggggtg cagcttccccag ------------------------------------------------------------------------------------------ ctgtccagtctttccgccgcattgtggagtgatgatacagatgcggccaaacgctggctggccttgtcatccaaataagcattataaataaaacaattggc atgc ------------------------------------------------------------------------------------------

21 Exercises 6.Create a script to find the DNA fragments you get after cutting the sequence in the example1.fasta file with AluI and with AvaI 7.Find the open reading frames in the example1.fasta sequence 8.Translate the open reading frames to protein, using the standard genetic code from the Geneticcode database (http://srs.bioinformatics.nl)


Download ppt "Finding substrings my $sequence = "gatgcaggctcgctagcggct"; #Does this string contain a startcodon? if ($sequence =~ m/atg/) { print "Yes"; } else { print."

Similar presentations


Ads by Google