Presentation is loading. Please wait.

Presentation is loading. Please wait.

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays.

Similar presentations


Presentation on theme: "BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays."— Presentation transcript:

1 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays Using Hashes to Pass Parameters to subroutines Arrays of Hashes Hashes of Hashes Parsing Chapter 10 of Tisdall Program 3

2 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING2 Quiz 3 Solution Problem 1 #!/usr/bin/perl use strict; use warnings; { my($a) = -3; my($b) = 7; my($n) = 10; srand(0); = myrandom($n,$a,$b); for (my $i = 0; $i < $n; $i++){ print "$numbers[$i] \n"; } exit; } sub myrandom{ my($n, $a, for (my $i = 0; $i < $n; $i++){ $nums[$i] = ($b-$a)*rand(1) + $a; } } Output: There is actually no simple way to generate number x such that a<=x<=b this generates numbers x such that a<=x

3 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING3 Quiz 3 Solution Problem 2 #!/usr/bin/perl use strict; use warnings; { # define our hash my(%coins) = ("Cys", 25, "Asp", 10, "Glu", 5); = keys(%coins); my $key; # loop through our hash: foreach $key { print "The value of $key is $coins{$key}\n"; } exit; } Output: The value of Glu is 5 The value of Cys is 25 The value of Asp is 10

4 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING4 More Functions on Arrays splice ARRAY,OFFSET,LENGTH,LIST Removes the elements designated by OFFSET and LENGTH from an array, and replaces them with the elements of LIST, if any. The array grows or shrinks as necessary. In array context, returns list of elements removed from = qw/ A list of words /; = print print "spliced A short list of words A few words spliced out: short list of Note: OFFSET is the offset from the start of the array LENGTH is the number of characters in the array starting at OFFSET to remove

5 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING5 More Functions on Arrays grep BLOCK LIST grep EXPR, LIST Evaluates the BLOCK or EXPR for each element of LIST (setting $_ to each element) and returns the list of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was = (12, 6, 3, 20, = grep($_ > = (12, 20, 22) $big = grep $_ > # $big = = qw / A list of words = grep = ("of", "words")

6 Using grep to Remove Comments from a File #!/usr/bin/perl -w # Thomas Bonham # 06/06/08 if($#ARGV !=0) { print "usage: path to the configuration\n"; exit; } $fileName=$ARGV[0]; open(O,"<$fileName") || die($!); open(N,">$fileName.free") || die($!); while( ) { next if($_ =~/^#.*/) ; print N $_ } BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING6

7 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING7 More Functions on Arrays map BLOCK LIST map EXPR, LIST Evaluates the BLOCK or EXPR for each element of LIST (setting $_ to each element) and returns the list value of the results of each such = qw / A list of words = map print #prints: = map print #prints: A LIST OF WORDS

8 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING8 Creating References my $fruit = fruit_i_like(); sub fruit_i_like() { = ('apple', 'banana', 'orange'); return } What does $fruit hold after called the subroutine? This works with scalars, arrays and hashes too. my $scalar_ref = \$a_scalar; my $hash_ref = \%a_hash; my $subroutine_ref = \&a_subroutine; References to anonymous arrays and Hashes my $array_ref = ['apple', 'banana', 'orange']; my $hash_ref = {name => 'Becky', age => 23};

9 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING9 Dereferencing – Part 1 (Via Sigils) #!/usr/bin/perl use strict; use warnings; my $scalar = "This is a scalar"; my $scalar_ref = \$scalar; print "Reference: ". $scalar_ref. "\n"; print "Dereferenced: ". $$scalar_ref. "\n"; Output: Reference: SCALAR(0x182a2b4) Dereferenced: This is a scalar Similarly for arrays #!/usr/bin/perl use strict; use warnings; my $array_ref = ['apple', 'banana', 'orange']; print "Reference: $array_ref\n"; print Output: Reference: ARRAY(0x22a0a4) Dereferenced: apple banana orange

10 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING10 Dereferencing – Part 1 Similarly for hashes #!/usr/bin/perl use strict; use warnings; my $hash_ref = {name => 'Becky', age => 23}; my %hash = %$hash_ref; print "Reference: $hash_ref\n"; print "Dereferenced:\n"; foreach my $k (keys %hash) { print "$k: $hash{$k}\n"; } Output: Reference: HASH(0x22a0a4) Dereferenced: name: Becky age: 23

11 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING11 Dereferencing – Part 2 #!/usr/bin/perl use strict; use warnings; my $scalar = "This is a scalar"; my $scalar_ref = \$scalar; print "Reference: ". $scalar_ref. "\n"; print "Dereferenced: ". ${$scalar_ref}. "\n"; Output: Reference: SCALAR(0x182a2b4) Dereferenced: This is a scalar

12 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING12 Dereferencing – Part 3 The arrow operator also allows you to dereference references to arrays or hashes. #!/usr/bin/perl use strict; use warnings; my $array_ref = ['apple', 'banana', 'orange']; print "My first fruit is: ". $array_ref->[0]. "\n"; Output: My first fruit is: apple Here is similar code for a hash #!/usr/bin/perl use strict; use warnings; my $hash_ref = {name => 'Becky', age => 23}; foreach my $k (keys %$hash_ref) { print "$k: ". $hash_ref->{$k}. "\n"; } Output: name: Becky age: 23

13 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING13 Arrays, Lists and References We saw that is is useful to pass arrays and hashes to subroutines using references: # passes pointers Sometime it is useful to create a reference to an anonymous list by using square brackets [ ]: $B = [1, 2, 3, 4]; # $B points to a list print # this is called "dereferencing" $B print "$$B[1]\n"; # think of "$B" as the "name" of the list 2 We can create a two-dimensional array by creating a array of = ([1, 2, 3, 4], [5, 6, 7, 8]); print A: ARRAY(0x80dd60) ARRAY(0x80dda8) -- these are the references in A print "$A[1][2]\n"; 7 Notice that the 0-th row is the first row and that the 0 th column is the first column

14 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING14 Two-dimensional arrays We can create a two-dimensional array by creating a array of = ([1, 2, 3, 4], [5, 6, 7, 8]); Access a two-dimensional array by using double brackets: print "$A[1][2]\n"; 7 The number of rows is just the size of the array: $rows = We can get the size of the fist row as follows: # brackets are needed to distinguish (syntax error) $cols = Print out all rows and columns: for ($i = 0; $i < $rows; $i++) { for ($j = 0; $j < $cols; $j++) { print "$A[$i][$j] "; } print "\n"; # newline after each row }

15 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING15 Two-dimensional arrays There is no need to declare the size of an array, so arrays can be created dynamically: = (); my $rows = 100; my $cols = 100; # create a matrix with 1's on diagonal for ($i = 0; $i < $rows; $i++) { for ($j = 0; $j < $cols; $j++) { $A[$i][$j] = 0; } $A[$i][$i] = 1; } Array sizes can be changed dynamically: $A[0][200] = 123; # first row now has 201 items # but other rows are unaffected!

16 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING16 Two-dimensional arrays Arrays can contain any scalar values Two-dimensional arrays in Perl do not have to be "rectangular" Each row can have a different length # this array has three row with lengths 3, 4 and 1 = ( ["John", "Jim, "Bill"], [ 100, 23.5, "ATCGTTGA", \%codons ]; [ 0 ]);

17 Example: pretty print a 2D array #!/usr/bin/perl use strict; use warnings; # File: print_array.pl = (); # initialize the two dimensional array # with some numbers my $rows = 6; my $cols = 5; for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j++) { $A[$i][$j] = $i*100 + $j*17; } # print and quit exit; # a subroutine to print out a two # dimensional rectangular array using # 5 digits per array element sub print_2Darray { my my $rows = my $cols = for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j++) { printf "%5d ", $a[$i][$j]; } print "\n"; # newline after each row } % print_array.pl BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

18 Example: transpose a 2D array(exchange rows and columns) #!/usr/bin/perl use strict; use warnings; # File: transpose.pl = (); # initialize the two dimensional array my $rows = 6; my $cols = 5; for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j++) { $A[$i][$j] = $i*100 + $j*17; } print "A:\n"; = print "B:\n"; exit; sub transpose { my = (); my $rows = my $cols = for (my $i=0; $i < $rows; $i++) { for (my $j=0; $j < $cols; $j++) { $b[$j][$i] = $a[$i][$j]; } } sub print_2Darray {. } 18BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

19 19 % transpose.pl A: B:

20 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING20 Conditional Operator $a = COND ? VALUE_1 : VALUE_2; is short for if COND then $a = VALUE_1 else $a = VALUE_2; Examples: # silly way to compute abs($x) my $abs_x = ($x > 0)? $x : -$x; # a safe way to compute an average my $ave = ($n > 0)? $sum / $n : 0.0; # get max of a, b: my $max = ($a > $b)? $a : $b; # safely update a hash count: exists $count{$x}? $count{$x}++ : $count{$x} = 0;

21 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING21 Hashes for Passing Parameters Problem: suppose you have some subroutines that take many arguments Don’t want to have to remember the right order Don’t want to have to remember the default values Solution: Pass in a hash of the form (arg1 => value1, arg2 =>value2, etc.) This allows arguments to appear in any order in calling program Subroutine can detect missing arguments (using exists) and supply default values

22 Example: Suppose we need to write a subroutine that prints DNA, with an optional header line, optional protein translation, and optional line numbers: # assume subroutines "print_header()", "print_protein()" and # "print_dna()" are already defined sub output_dna { my ($dna, $header, $linelength, $translate, $linenumbers) print_header($header) if (length $header > 0); if ($translate) { print_protein($dna, $linelength, $linenumbers)} else { print_dna($dna, $linelength, $linenumbers); } } Calls in the main program would look like this: output_dna($dna, $header, 60, 0, 1); # header, no protein, linenums output_dna($dna, "", 60, 1, 0); # no header, protein, no line numbers Problems: 1.Even if you just want to print the dna, you still have to give all the arguments (in the right order). 2.Code is not self-documenting (will you remember what it does in 6 months?) 3.Clumsy for other people to use. 22BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

23 REMINDER -- The conditional operator: $a = COND ? VALUE_1 : VALUE_2; # if COND then $a = VALUE_1 else $a = VALUE_2 Version 2: Note that missing arguments are set to a default value sub output_dna { my (%args) my $dna = exists $args{dna} ? $args{dna} : ""; my $header = exists $args{header} ? $args{header} : ""; my $linelength = exists $args{linelength} ? $args{linelength} : 60; my $translate = exists $args{translate} ? $args{translate} : 0; my $linenumbers = exists $args{linenumbers} ? $args{linenumbers} : 0; print_header($header) if (length $header > 0); if ($translate) {print_protein($dna, $linelength, $linenumbers); } else { print_dna($dna, $linelength, $linenumbers); } } Calls in the main program now look like this: output_dna(linelength=>30, dna=>$x); # arguments can be in any order output_dna(dna=>$dnastring, translate=>1); # missing args set to default values 23BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

24 24 Arrays of Hashes #!/usr/bin/perl -w #demonstrates an array of hashes; use strict; use warnings; my $role; my = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", }, { husband => "homer", wife => "marge", son => "bart", }, ); print \n"; Output: HASH(0x22a0ac) HASH(0x229f8c) HASH(0x )

25 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING25 Arrays of Hashes – Manipulating the Variables You can set a key/value pair of a particular hash as follows: $AoH[0]{husband} = "fred"; To capitalize the husband of the second array, apply a substitution: $AoH[1]{husband} =~ s/(\w)/\u$1/;

26 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 26 Arrays of Hashes - Printing Method 1 #!/usr/bin/perl -w #demonstrates an array of hashes; use strict; use warnings; my $role; my = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", }, { husband => "homer", wife => "marge", son => "bart", }, ); for $href ) { print "{ "; for $role ( keys %$href ) { print "$role=$href->{$role} "; } print "}\n"; } Output: { son=bamm bamm wife=betty husband=barney } { son=elroy wife=jane husband=george } { son=bart wife=marge husband=homer } What does the  do in the print statement?

27 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 27 Arrays of Hashes - Printing Method 2 !/usr/bin/perl -w #demonstrates an array of hashes; use strict; use warnings; my $role; my $href; my = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", }, { husband => "homer", wife => "marge", son => "bart", }, ); for $i ( 0.. $#AoH ) { print "$i is { "; for $role ( keys %{ $AoH[$i] } ) { print "$role=$AoH[$i]{$role} "; } print "}\n"; } Output: 0 is { son=bamm bamm wife=betty husband=barney } 1 is { son=elroy wife=jane husband=george } 2 is { son=bart wife=marge husband=homer } Note that $# is the subscript of the last element in an array

28 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 28 Hashes of Hashes #!/usr/bin/perl -w #demonstrates a hash of hashes; use strict; use warnings; my $family; my $role; my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key quotes needed }, simpsons => { lead => "homer", wife => "marge", kid => "bart", }, ); # print the whole thing foreach $family ( keys %HoH ) { print "$family: "; foreach $role ( keys %{ $HoH{$family} } ) { print "$role=$HoH{$family}{$role} "; } print "\n"; } Output: simpsons: kid=bart lead=homer wife=marge jetsons: his boy=elroy lead=george wife=jane flintstones: lead=fred pal=barney

29 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING29 Hashes of Hashes - Printing #!/usr/bin/perl -w #demonstrates a hash of hashes; use strict; use warnings; my $family; my $roles; my $role; my $person; my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key quotes needed }, simpsons => { lead => "homer", wife => "marge", kid => "bart", }, ); # print the whole thing, using temporaries while ( ($family,$roles) = each %HoH ) { print "$family: "; while ( ($role,$person) = each %$roles ) { # using each precludes sorting print "$role=$person "; } print "\n"; } Output: simpsons: kid=bart lead=homer wife=marge jetsons: his boy=elroy lead=george wife=jane flintstones: lead=fred pal=barney

30 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING30 Chapter 10 Parsing GenBank

31 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING31 Data Types and Parsers A databank or data store A flat file as compared to a database We are going to explore the art of parsing using this data Alternative parsing software repositories include National Institutes of Health (NIH) European Bioinformatics Institute (EIB) European Molecular Biology Laboratory (EMBL)

32 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING32 GenBank Files - I LOCUS AB bp mRNA PRI 27-MAY-2000 DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ACCESSION AB VERSION AB GI: KEYWORDS. SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (sites) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and Takano,T. TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain, is regulated by proteolysis JOURNAL Biochem. Biophys. Res. Commun. 271 (2), (2000) MEDLINE REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. Tadahiro Fujino, Keio University School of Medicine, Department of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo , Japan Tel: (ex.62692), Fax: ) FEATURES Location/Qualifiers source /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene /gene="PCCX1" CDS /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing CXXC domain 1" /protein_id="BAA " /db_xref="GI: "

33 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING33 GenBank Files - II /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"

34 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING34 GenBank Files - III BASE COUNT 564 a 715 c 768 g 440 t ORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat 301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac 361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc 421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat 481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat 541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca 601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg 661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg 721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt 781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg 841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca 901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag 961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca 1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta … 2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa //

35 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING35 A Little More About GeneBank FEATURES table Specific information about locations of exons, regulatory regions, important mutations etc. More information is contained at GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30). There are approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions and 191,401,393,188 bases in 62,715,288 sequence records in the WGS division as of April 2011.Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30

36 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING36 GenBank: Separating Sequence and Annotation Method 1 Slurp the GenBank record into an array and look through the lines Method 2 Put the whole GenBank record into a scalar and use regular expressions to look through it Example files record.gb library.gb

37 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 37 Parsing Genbank Files Using Arrays – Example 10-1 #!/usr/bin/perl # Example 10-1 Extract annotation and sequence from GenBank file use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # declare and initialize variables = ( ); my $sequence = ''; my $filename = 'record.gb'; \$sequence, $filename); # Print the annotation, and then # print the DNA in new format just to check if we got it okay. print_sequence($sequence, 50); exit; ############################################### ################################# # Subroutine ############################################### ################################# # parse1 # # -parse annotation and sequence from GenBank record sub parse1 { my($annotation, $dna, $filename) # $annotation-reference to array # $dna -reference to scalar # $filename -scalar # declare and initialize variables my $in_sequence = 0; = ( ); # Get the GenBank data into an array from a = get_file_data($filename); # Extract all the sequence lines foreach my $line {

38 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING38 Parsing Genbank Files Using Arrays – Example 10-1 if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n, last; #break out of the foreach loop. } elsif( $in_sequence) { # If we know we're in a sequence, $$dna.= $line; # add the current line to $$dna. } elsif ( $line =~ /^ORIGIN/ ) { # If $line begins a sequence, $in_sequence = 1; # set the $in_sequence flag. } else{ # Otherwise $line); # add the current line } # remove whitespace and line numbers from DNA sequence $$dna =~ s/[\s0-9]//g; }

39 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING39 Some Musings on the Array Approach if( $line =~ /^\/\/\n/ ) { # If $line is end-of-record line //\n, last; #break out of the foreach loop. } If we have a large number of forward slashes we can change the delimiter to a ! like this m!//\n! The order of the tests for what we are currently gathering in the file is in the reverse order of things in the file For example we first gather the annotation lines and then set a flag when the ORIGIN start-of-sequence line is found

40 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING40 Parsing Genbank Files Using Scalars We may end up with several lines in the same scalar; there are modifiers to our search code that can make our life easier Pattern Modifiers /g (What does this do?) /i (How about this one?) Regular expression matching symbols ^ (What does this do?) $ (How about this one?). (What does this do?)

41 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING41 Special Multiline Regular Expression Modifiers m Treat string as multiple lines. That is, change ``^'' and ``$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string, s Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the $* setting. That is, no matter what $* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``$'' to match, respectively, just after and just before newlines within the string./s /m/s/m

42 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING42 Pattern Modifiers in Action #!/usr/bin/perl use strict; use warnings; "AAC\nGTT" =~ /^.*$/; print $&, "\n"; exit; This fails because there is no match and $& has no value Suppose we use "AAC\nGTT" =~ /^.*$/m; Output: AAC Suppose we use "AAC\nGTT" =~ /^.*$/s; Output: AAC GTT

43 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING43 Separating Annotations from Sequence Recall that a GenBank record starts with LOCUS and ends with // The input record separator This is normally set to new line so each call to read a scalar from a file handle gets one line The input record separator is denoted by $/ $/ = “//\n”; Now a call to read a scalar from a file handle takes all data up to the GenBank end of record separator

44 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 44 Example 10-2 –Extract Record and Annotation Sequence from a GenBank Record #!/usr/bin/perl # Example 10-2 Extract the annotation and sequence sections from the first # record of a GenBank library use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my $annotation = ''; my $dna = ''; my $record = ''; my $filename = 'record.gb'; my $save_input_separator = $/; # Open GenBank library file unless (open(GBFILE, $filename)) { print "Cannot open GenBank file \"$filename\"\n\n"; exit; } # Set input separator to "//\n" and read in a record to a scalar $/ = "//\n"; $record = ; # reset input separator $/ = $save_input_separator; # Now separate the annotation from the sequence data ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); # Print the two pieces, which should give us the same as the # original GenBank file, minus the // at the end print $annotation, $dna; exit;

45 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING45 The Regular Expressions In Depth ($annotation, $dna)=($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); What do the two sets of parentheses do? What do we do with the contents of these parenthesis? What does the /s modifier do? ^(LOCUS.*ORIGIN\s*\n) Can you explain this? \/\/\n What does this match? Notice that it took one line of Perl code to extract annotation and sequence

46 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING46 Parsing Annotations #!/usr/bin/perl -w # Example 10-3 Parsing GenBank annotations using arrays use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_book_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables = ( ); my $locus = ''; my $accession = ''; my $organism = ''; # Get GenBank file = get_file_data('record.gb'); # Let's start with something simple. Let's get some of the identifying # information, let's say the locus and accession number (here the same # thing) and the definition and the organism. for my $line { if($line =~ /^LOCUS/) { $line =~ s/^LOCUS\s*//; $locus = $line; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSION\s*//; $accession = $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^\s*ORGANISM\s*//; $organism = $line; } print "*** LOCUS ***\n"; print $locus; print "*** ACCESSION ***\n"; print $accession; print "*** ORGANISM ***\n"; print $organism; exit; Notice the use of the flags and that this program handles single line entries.

47 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 47 Tackling the Definition Field #!/usr/bin/perl -w # Example 10-4 Parsing GenBank annotations using arrays, take 2 use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables = ( ); my $locus = ''; my $accession = ''; my $organism = ''; my $definition = ''; my $flag = 0; # Get GenBank file = get_file_data('record.gb'); # Let's start with something simple. Let's get some of the identifying # information, let's say the locus and accession number (here the same # thing) and the definition and the organism. for my $line { if($line =~ /^LOCUS/) { $line =~ s/^LOCUS\s*//; $locus = $line; }elsif($line =~ /^DEFINITION/) { $line =~ s/^DEFINITION\s*//; $definition = $line; $flag = 1; }elsif($line =~ /^ACCESSION/) { $line =~ s/^ACCESSION\s*//; $accession = $line; $flag = 0; }elsif($flag) { chomp($definition); $definition.= $line; }elsif($line =~ /^ ORGANISM/) { $line =~ s/^\s*ORGANISM\s*//; $organism = $line; }

48 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING48 Parsing the Annotations Using Regular Expressions sub open_file Given the filename, return the filehandle sub get_next_record Given the file handle, get the record (we can get the offset by first calling tell) sub get_annotation_and_dna Given a record, split it into annotation and cleaned up record sub search_sequence Given a sequence and a regular expression, return array of locations and hits sub search_annotation Given a GenBank annotation and a regular expression, return array of location and hits sub parse_annotation Separate out the files of the annotation in a convenient form sub parse_features Given the features field, separate out the components

49 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING49 Byte offset The Perl function tell allows us to save the byte offset of the record of interest The byte offset is the number of characters into the file where the information of interest lies We can return again “instantaneously” to this location in the file again using the Perl function seek

50 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING50 Parsing of the Data 1. Separate out the annotation and the sequence (some capability to search exists at this stage) 2. Extract out the fields 3. Parse the features table

51 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 51 Example GenBank Library Subroutines !/usr/bin/perl # Example test program of GenBank library subroutines) use strict; use warnings; # Don't use BeginPerlBioinfo # Since all subroutines defined in this file # use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my $fh; # variable to store filehandle my $record; my $dna; my $annotation; my $offset; my $library = 'library.gb'; # Perform some standard subroutines for test $fh = open_file($library); $offset = tell($fh); while( $record = get_next_record($fh) ) { ($annotation, $dna) = get_annotation_and_dna($record); if( search_sequence($dna, 'AAA[CG].')) { print "Sequence found in record at offset $offset\n"; } if( search_annotation($annotation, 'homo sapiens')) { print "Annotation found in record at offset $offset\n"; } $offset = tell($fh); } exit;

52 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING52 Subroutines # open_file # # - given filename, set filehandle sub open_file { my($filename) my $fh; unless(open($fh, $filename)) { print "Cannot open file $filename\n"; exit; } return $fh; } # get_next_record # # - given GenBank record, get annotation and DNA sub get_next_record { my($fh) my($offset); my($record) = ''; my($save_input_separator) = $/; $/ = "//\n"; $record = ; $/ = $save_input_separator; return $record; }

53 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 53 Subroutines # get_annotation_and_dna # # - given filehandle to open GenBank library file, get next record sub get_annotation_and_dna { my($record) my($annotation) = ''; my($dna) = ''; # Now separate the annotation from the sequence data ($annotation, $dna) = ($record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s) ; # clean the sequence of any whitespace or / characters # (the / has to be written \/ in the character class, because # / is a metacharacter, so it must be "escaped" with \) $dna =~ s/[\s\/]//g; return($annotation, $dna) } # search_sequence # # - search sequence with regular expression sub search_sequence { my($sequence, $regularexpression) = ( ); while( $sequence =~ /$regularexpression/ig ) { pos ); } return }

54 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING54 Subroutines # search_annotation # # - search annotation with regular expression sub search_annotation { my($annotation, $regularexpression) = ( ); # note the /s modifier-. matches any character including newline while( $annotation =~ /$regularexpression/isg ) { pos ); } return } Output: Sequence found in record at offset 0 Annotation found in record at offset 0 Sequence found in record at offset 6358 Annotation found in record at offset 6358 Sequence found in record at offset Annotation found in record at offset Sequence found in record at offset Annotation found in record at offset Sequence found in record at offset Annotation found in record at offset 22722

55 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING55 Parsing the Annotations at the Top Level Let’s start with one of the simpler annotations DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ACCESSION AB VERSION AB GI: We need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.

56 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING56 Parsing the Annotations at the Top level /^[A-Z].*\n(^\s.*\n)*/m What does /m do? ^[A-Z].*\n Capital letter at the beginning of the line followed by any number of characters (except newlines) followed by a newline (^\s.*\n)* Matches a space or tab at the beginning of the line followed by any number of characters )except newlines) followed by a newline ()* means 0 or more of these

57 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING57 Example 10-6 Parsing GenBank Data #!/usr/bin/perl # Example test program for parse_annotation subroutine use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my $fh; my $record; my $dna; my $annotation; my %fields; my $library = 'library.gb'; # Open library and read a record $fh = open_file($library); $record = get_next_record($fh); # Parse the sequence and annotation ($annotation, $dna) = get_annotation_and_dna($record); # Extract the fields of the annotation %fields = parse_annotation($annotation); # Print the fields foreach my $key (keys %fields) { print "******** $key *********\n"; print $fields{$key}; } exit;

58 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING58 Subroutines # parse_annotation # # given a GenBank annotation, returns a hash with # keys: the field names # values: the fields sub parse_annotation { my($annotation) my(%results) = ( ); while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) { my $value = $&; (my $key = $value) =~ s/^([A-Z]+).*/$1/s; $results{$key} = $value; } return %results; }

59 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING59 Extracting the key while( $annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) { my $value = $&; (my $key = $value) =~ s/^([A-Z]+).*/$1/s; $results{$key} = $value; } What does the bolded line do? First assigns $key the value $value Uses the /s modifier for embedded newlines Replaces $key with $1 which is a special variable indicating the match between the first pair of parenthesis

60 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING60 Parsing the Features Table Let’s tackle source, gene and CDS features keys source /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene /gene="PCCX1" CDS /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC Will use an array rather than a hash since there can be multiple instances of the same feature in a record

61 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 61 Parsing the Features Table #!/usr/bin/perl # - main program to test parse_features use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my $fh; my $record; my $dna; my $annotation; my %fields; my $library = 'library.gb'; # Get the fields from the first GenBank record in a library $fh = open_file($library); $record = get_next_record($fh); ($annotation, $dna) = get_annotation_and_dna($record); %fields = parse_annotation($annotation); # Extract the features from the FEATURES = parse_features($fields{'FEATURES'}) ; # Print out the features foreach my $feature { # extract the name of the feature (or "feature key") my($featurename) = ($feature =~ /^ {5}(\S+)/); print "******** $featurename *********\n"; print $feature; } exit;

62 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING62 Subroutine # parse_features # # extract the features from the FEATURES field of a GenBank record sub parse_features { my($features) # entire FEATURES field in a scalar variable # Declare and initialize variables = (); # used to store the individual features # Extract the features while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) { my $feature = $&; $feature); } }

63 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING63 A Closer Look at the Crucial Regular Expression while( $features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) Line begins with 5 spaces followed by non-whitespace character followed by any number of non-newlines followed by a newline Next we space 21 or more spaces followed by non-whitepace characters \S followed by any number of non-newlines.* followed by a newline

64 BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING64 Homework for Next Week Read Tisdall Chapter 10 Particular attention to the Indexing GenBank with DBM section Exercises 10.3 and 10.6 Begin working on Program 3 Quiz 4 next week


Download ppt "BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays."

Similar presentations


Ads by Google