# BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays.

## Presentation on theme: "BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays."— Presentation transcript:

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays Using Hashes to Pass Parameters to subroutines Arrays of Hashes Hashes of Hashes Parsing Chapter 10 of Tisdall Program 3

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING2 Quiz 3 Solution Problem 1 #!/usr/bin/perl use strict; use warnings; { my(\$a) = -3; my(\$b) = 7; my(\$n) = 10; srand(0); my(@numbers) = myrandom(\$n,\$a,\$b); for (my \$i = 0; \$i < \$n; \$i++){ print "\$numbers[\$i] \n"; } exit; } sub myrandom{ my(\$n, \$a, \$b)=@_; my @nums; for (my \$i = 0; \$i < \$n; \$i++){ \$nums[\$i] = (\$b-\$a)*rand(1) + \$a; } return(@nums); } Output: -2.9884033203125 -0.64434814453125 3.4813232421875 There is actually no simple way to generate number x such that a<=x<=b this generates numbers x such that a<=x { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/14/4372403/slides/slide_2.jpg", "name": "BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING2 Quiz 3 Solution Problem 1 #!/usr/bin/perl use strict; use warnings; { my(\$a) = -3; my(\$b) = 7; my(\$n) = 10; srand(0); my(@numbers) = myrandom(\$n,\$a,\$b); for (my \$i = 0; \$i < \$n; \$i++){ print \$numbers[\$i] \n ; } exit; } sub myrandom{ my(\$n, \$a, \$b)=@_; my @nums; for (my \$i = 0; \$i < \$n; \$i++){ \$nums[\$i] = (\$b-\$a)*rand(1) + \$a; } return(@nums); } Output: -2.9884033203125 -0.64434814453125 3.4813232421875 There is actually no simple way to generate number x such that a<=x<=b this generates numbers x such that a<=x

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING3 Quiz 3 Solution Problem 2 #!/usr/bin/perl use strict; use warnings; { # define our hash my(%coins) = ("Cys", 25, "Asp", 10, "Glu", 5); my(@key_list) = keys(%coins); my \$key; # loop through our hash: foreach \$key (@key_list) { print "The value of \$key is \$coins{\$key}\n"; } exit; } Output: The value of Glu is 5 The value of Cys is 25 The value of Asp is 10

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING4 More Functions on Arrays splice ARRAY,OFFSET,LENGTH,LIST Removes the elements designated by OFFSET and LENGTH from an array, and replaces them with the elements of LIST, if any. The array grows or shrinks as necessary. In array context, returns list of elements removed from ARRAY. @A = qw/ A list of words /; splice @A,1,0,"short"; print "@A\n"; @B = splice @A,1,3,"few"; print "@A\n"; print "spliced out: @B\n"; A short list of words A few words spliced out: short list of Note: OFFSET is the offset from the start of the array LENGTH is the number of characters in the array starting at OFFSET to remove

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING5 More Functions on Arrays grep BLOCK LIST grep EXPR, LIST Evaluates the BLOCK or EXPR for each element of LIST (setting \$_ to each element) and returns the list of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true. @A = (12, 6, 3, 20, 22); @B = grep(\$_ > 10, @A); # @B = (12, 20, 22) \$big = grep \$_ > 10, @A; # \$big = 3 @A = qw / A list of words /; @B = grep /o/, @A; # @B = ("of", "words")

Using grep to Remove Comments from a File #!/usr/bin/perl -w # Thomas Bonham # 06/06/08 if(\$#ARGV !=0) { print "usage: path to the configuration\n"; exit; } \$fileName=\$ARGV[0]; open(O,"<\$fileName") || die(\$!); open(N,">\$fileName.free") || die(\$!); while( ) { next if(\$_ =~/^#.*/) ; print N \$_ } BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING6 http://linuxgazette.net/152/misc/lg/2_cent_tip__removing_the_comments_out_of_a_configuration_file.html

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING7 More Functions on Arrays map BLOCK LIST map EXPR, LIST Evaluates the BLOCK or EXPR for each element of LIST (setting \$_ to each element) and returns the list value of the results of each such evaluation. @A = qw / A list of words /; @L = map length, @A; print "@L\n"; #prints: 1 4 2 5 @L = map uc, @A; print "@L\n"; #prints: A LIST OF WORDS

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING8 Creating References my \$fruit = fruit_i_like(); sub fruit_i_like() { my @fruit = ('apple', 'banana', 'orange'); return \@fruit; } What does \$fruit hold after called the subroutine? This works with scalars, arrays and hashes too. my \$scalar_ref = \\$a_scalar; my \$hash_ref = \%a_hash; my \$subroutine_ref = \&a_subroutine; References to anonymous arrays and Hashes my \$array_ref = ['apple', 'banana', 'orange']; my \$hash_ref = {name => 'Becky', age => 23};

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING9 Dereferencing – Part 1 (Via Sigils) #!/usr/bin/perl use strict; use warnings; my \$scalar = "This is a scalar"; my \$scalar_ref = \\$scalar; print "Reference: ". \$scalar_ref. "\n"; print "Dereferenced: ". \$\$scalar_ref. "\n"; Output: Reference: SCALAR(0x182a2b4) Dereferenced: This is a scalar Similarly for arrays #!/usr/bin/perl use strict; use warnings; my \$array_ref = ['apple', 'banana', 'orange']; my @array = @\$array_ref; print "Reference: \$array_ref\n"; print "Dereferenced: @array\n"; Output: Reference: ARRAY(0x22a0a4) Dereferenced: apple banana orange http://www.merriam-webster.com/dictionary/sigil

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING10 Dereferencing – Part 1 Similarly for hashes #!/usr/bin/perl use strict; use warnings; my \$hash_ref = {name => 'Becky', age => 23}; my %hash = %\$hash_ref; print "Reference: \$hash_ref\n"; print "Dereferenced:\n"; foreach my \$k (keys %hash) { print "\$k: \$hash{\$k}\n"; } Output: Reference: HASH(0x22a0a4) Dereferenced: name: Becky age: 23

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING11 Dereferencing – Part 2 #!/usr/bin/perl use strict; use warnings; my \$scalar = "This is a scalar"; my \$scalar_ref = \\$scalar; print "Reference: ". \$scalar_ref. "\n"; print "Dereferenced: ". \${\$scalar_ref}. "\n"; Output: Reference: SCALAR(0x182a2b4) Dereferenced: This is a scalar

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING12 Dereferencing – Part 3 The arrow operator also allows you to dereference references to arrays or hashes. #!/usr/bin/perl use strict; use warnings; my \$array_ref = ['apple', 'banana', 'orange']; print "My first fruit is: ". \$array_ref->[0]. "\n"; Output: My first fruit is: apple Here is similar code for a hash #!/usr/bin/perl use strict; use warnings; my \$hash_ref = {name => 'Becky', age => 23}; foreach my \$k (keys %\$hash_ref) { print "\$k: ". \$hash_ref->{\$k}. "\n"; } Output: name: Becky age: 23

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING13 Arrays, Lists and References We saw that is is useful to pass arrays and hashes to subroutines using references: some_subroutine(\@a, \@b); # passes pointers to @a and @b Sometime it is useful to create a reference to an anonymous list by using square brackets [ ]: \$B = [1, 2, 3, 4]; # \$B points to a list print "@\$B\n"; # this is called "dereferencing" \$B 1 2 3 4 print "\$\$B[1]\n"; # think of "\$B" as the "name" of the list 2 We can create a two-dimensional array by creating a array of lists: @A = ([1, 2, 3, 4], [5, 6, 7, 8]); print "@A\n"; A: ARRAY(0x80dd60) ARRAY(0x80dda8) -- these are the references in A print "\$A[1][2]\n"; 7 Notice that the 0-th row is the first row and that the 0 th column is the first column

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING14 Two-dimensional arrays We can create a two-dimensional array by creating a array of lists: @A = ([1, 2, 3, 4], [5, 6, 7, 8]); Access a two-dimensional array by using double brackets: print "\$A[1][2]\n"; 7 The number of rows is just the size of the array: \$rows = scalar @A; We can get the size of the fist row as follows: # brackets are needed to distinguish from @\$A[0] (syntax error) \$cols = scalar @{\$A[0]}; Print out all rows and columns: for (\$i = 0; \$i < \$rows; \$i++) { for (\$j = 0; \$j < \$cols; \$j++) { print "\$A[\$i][\$j] "; } print "\n"; # newline after each row }

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING15 Two-dimensional arrays There is no need to declare the size of an array, so arrays can be created dynamically: my @A = (); my \$rows = 100; my \$cols = 100; # create a matrix with 1's on diagonal for (\$i = 0; \$i < \$rows; \$i++) { for (\$j = 0; \$j < \$cols; \$j++) { \$A[\$i][\$j] = 0; } \$A[\$i][\$i] = 1; } Array sizes can be changed dynamically: \$A[0][200] = 123; # first row now has 201 items # but other rows are unaffected!

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING16 Two-dimensional arrays Arrays can contain any scalar values Two-dimensional arrays in Perl do not have to be "rectangular" Each row can have a different length # this array has three row with lengths 3, 4 and 1 my @A = ( ["John", "Jim, "Bill"], [ 100, 23.5, "ATCGTTGA", \%codons ]; [ 0 ]);

Example: pretty print a 2D array #!/usr/bin/perl use strict; use warnings; # File: print_array.pl my @A = (); # initialize the two dimensional array # with some numbers my \$rows = 6; my \$cols = 5; for (my \$i=0; \$i < \$rows; \$i++) { for (my \$j=0; \$j < \$cols; \$j++) { \$A[\$i][\$j] = \$i*100 + \$j*17; } # print and quit print_2Darray(@A); exit; # a subroutine to print out a two # dimensional rectangular array using # 5 digits per array element sub print_2Darray { my (@a) = @_; my \$rows = scalar @a; my \$cols = scalar @{\$a[0]}; for (my \$i=0; \$i < \$rows; \$i++) { for (my \$j=0; \$j < \$cols; \$j++) { printf "%5d ", \$a[\$i][\$j]; } print "\n"; # newline after each row } % print_array.pl 0 17 34 51 68 100 117 134 151 168 200 217 234 251 268 300 317 334 351 368 400 417 434 451 468 500 517 534 551 568 17BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

Example: transpose a 2D array(exchange rows and columns) #!/usr/bin/perl use strict; use warnings; # File: transpose.pl my @A = (); # initialize the two dimensional array my \$rows = 6; my \$cols = 5; for (my \$i=0; \$i < \$rows; \$i++) { for (my \$j=0; \$j < \$cols; \$j++) { \$A[\$i][\$j] = \$i*100 + \$j*17; } print "A:\n"; print_2Darray(@A); my @B = transpose(@A); print "B:\n"; print_2Darray(@B); exit; sub transpose { my (@a) = @_; my @b = (); my \$rows = scalar @a; my \$cols = scalar @{\$a[0]}; for (my \$i=0; \$i < \$rows; \$i++) { for (my \$j=0; \$j < \$cols; \$j++) { \$b[\$j][\$i] = \$a[\$i][\$j]; } return @b; } sub print_2Darray {. } 18BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

19 % transpose.pl A: 0 17 34 51 68 100 117 134 151 168 200 217 234 251 268 300 317 334 351 368 400 417 434 451 468 500 517 534 551 568 B: 0 100 200 300 400 500 17 117 217 317 417 517 34 134 234 334 434 534 51 151 251 351 451 551 68 168 268 368 468 568

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING20 Conditional Operator \$a = COND ? VALUE_1 : VALUE_2; is short for if COND then \$a = VALUE_1 else \$a = VALUE_2; Examples: # silly way to compute abs(\$x) my \$abs_x = (\$x > 0)? \$x : -\$x; # a safe way to compute an average my \$ave = (\$n > 0)? \$sum / \$n : 0.0; # get max of a, b: my \$max = (\$a > \$b)? \$a : \$b; # safely update a hash count: exists \$count{\$x}? \$count{\$x}++ : \$count{\$x} = 0;

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING21 Hashes for Passing Parameters Problem: suppose you have some subroutines that take many arguments Don’t want to have to remember the right order Don’t want to have to remember the default values Solution: Pass in a hash of the form (arg1 => value1, arg2 =>value2, etc.) This allows arguments to appear in any order in calling program Subroutine can detect missing arguments (using exists) and supply default values

Example: Suppose we need to write a subroutine that prints DNA, with an optional header line, optional protein translation, and optional line numbers: # assume subroutines "print_header()", "print_protein()" and # "print_dna()" are already defined sub output_dna { my (\$dna, \$header, \$linelength, \$translate, \$linenumbers) = @_; print_header(\$header) if (length \$header > 0); if (\$translate) { print_protein(\$dna, \$linelength, \$linenumbers)} else { print_dna(\$dna, \$linelength, \$linenumbers); } } Calls in the main program would look like this: output_dna(\$dna, \$header, 60, 0, 1); # header, no protein, linenums output_dna(\$dna, "", 60, 1, 0); # no header, protein, no line numbers Problems: 1.Even if you just want to print the dna, you still have to give all the arguments (in the right order). 2.Code is not self-documenting (will you remember what it does in 6 months?) 3.Clumsy for other people to use. 22BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

REMINDER -- The conditional operator: \$a = COND ? VALUE_1 : VALUE_2; # if COND then \$a = VALUE_1 else \$a = VALUE_2 Version 2: Note that missing arguments are set to a default value sub output_dna { my (%args) = @_; my \$dna = exists \$args{dna} ? \$args{dna} : ""; my \$header = exists \$args{header} ? \$args{header} : ""; my \$linelength = exists \$args{linelength} ? \$args{linelength} : 60; my \$translate = exists \$args{translate} ? \$args{translate} : 0; my \$linenumbers = exists \$args{linenumbers} ? \$args{linenumbers} : 0; print_header(\$header) if (length \$header > 0); if (\$translate) {print_protein(\$dna, \$linelength, \$linenumbers); } else { print_dna(\$dna, \$linelength, \$linenumbers); } } Calls in the main program now look like this: output_dna(linelength=>30, dna=>\$x); # arguments can be in any order output_dna(dna=>\$dnastring, translate=>1); # missing args set to default values 23BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING

24 Arrays of Hashes #!/usr/bin/perl -w #demonstrates an array of hashes; use strict; use warnings; my @AoH; my \$role; my \$href; @AoH = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", }, { husband => "homer", wife => "marge", son => "bart", }, ); print "@AoH \n"; Output: HASH(0x22a0ac) HASH(0x229f8c) HASH(0x1846024)

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING25 Arrays of Hashes – Manipulating the Variables You can set a key/value pair of a particular hash as follows: \$AoH[0]{husband} = "fred"; To capitalize the husband of the second array, apply a substitution: \$AoH[1]{husband} =~ s/(\w)/\u\$1/;

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 26 Arrays of Hashes - Printing Method 1 #!/usr/bin/perl -w #demonstrates an array of hashes; use strict; use warnings; my @AoH; my \$role; my \$href; @AoH = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", }, { husband => "homer", wife => "marge", son => "bart", }, ); for \$href ( @AoH ) { print "{ "; for \$role ( keys %\$href ) { print "\$role=\$href->{\$role} "; } print "}\n"; } Output: { son=bamm bamm wife=betty husband=barney } { son=elroy wife=jane husband=george } { son=bart wife=marge husband=homer } What does the  do in the print statement?

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 27 Arrays of Hashes - Printing Method 2 !/usr/bin/perl -w #demonstrates an array of hashes; use strict; use warnings; my @AoH; my \$role; my \$href; my \$i; @AoH = ( { husband => "barney", wife => "betty", son => "bamm bamm", }, { husband => "george", wife => "jane", son => "elroy", }, { husband => "homer", wife => "marge", son => "bart", }, ); for \$i ( 0.. \$#AoH ) { print "\$i is { "; for \$role ( keys %{ \$AoH[\$i] } ) { print "\$role=\$AoH[\$i]{\$role} "; } print "}\n"; } Output: 0 is { son=bamm bamm wife=betty husband=barney } 1 is { son=elroy wife=jane husband=george } 2 is { son=bart wife=marge husband=homer } Note that \$# is the subscript of the last element in an array

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 28 Hashes of Hashes #!/usr/bin/perl -w #demonstrates a hash of hashes; use strict; use warnings; my \$family; my \$role; my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key quotes needed }, simpsons => { lead => "homer", wife => "marge", kid => "bart", }, ); # print the whole thing foreach \$family ( keys %HoH ) { print "\$family: "; foreach \$role ( keys %{ \$HoH{\$family} } ) { print "\$role=\$HoH{\$family}{\$role} "; } print "\n"; } Output: simpsons: kid=bart lead=homer wife=marge jetsons: his boy=elroy lead=george wife=jane flintstones: lead=fred pal=barney

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING29 Hashes of Hashes - Printing #!/usr/bin/perl -w #demonstrates a hash of hashes; use strict; use warnings; my \$family; my \$roles; my \$role; my \$person; my %HoH = ( flintstones => { lead => "fred", pal => "barney", }, jetsons => { lead => "george", wife => "jane", "his boy" => "elroy", # key quotes needed }, simpsons => { lead => "homer", wife => "marge", kid => "bart", }, ); # print the whole thing, using temporaries while ( (\$family,\$roles) = each %HoH ) { print "\$family: "; while ( (\$role,\$person) = each %\$roles ) { # using each precludes sorting print "\$role=\$person "; } print "\n"; } Output: simpsons: kid=bart lead=homer wife=marge jetsons: his boy=elroy lead=george wife=jane flintstones: lead=fred pal=barney

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING30 Chapter 10 Parsing GenBank

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING31 Data Types and Parsers A databank or data store A flat file as compared to a database We are going to explore the art of parsing using this data Alternative parsing software repositories include National Institutes of Health (NIH) www.ncbi.nlm.nih.gov European Bioinformatics Institute (EIB) www.ebi.ac.uk European Molecular Biology Laboratory (EMBL) www.embl.de

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING32 GenBank Files - I LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ACCESSION AB031069 VERSION AB031069.1 GI:8100074 KEYWORDS. SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to mRNA. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (sites) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and Takano,T. TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain, is regulated by proteolysis JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) MEDLINE 20261256 REFERENCE 2 (bases 1 to 2487) AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and Takano,T. TITLE Direct Submission JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. Tadahiro Fujino, Keio University School of Medicine, Department of Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan (E-mail:fujino@microb.med.keio.ac.jp, Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC domain" /codon_start=1 /product="protein containing CXXC domain 1" /protein_id="BAA96307.1" /db_xref="GI:8100075"

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING34 GenBank Files - III BASE COUNT 564 a 715 c 768 g 440 t ORIGIN 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat 241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat 301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac 361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc 421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat 481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat 541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca 601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg 661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg 721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt 781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg 841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca 901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag 961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca 1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta … 2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa //

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING35 A Little More About GeneBank FEATURES table Specific information about locations of exons, regulatory regions, important mutations etc. More information is contained at http://www.ncbi.nlm.nih.gov/genbank/GenBankFtp.html http://www.ncbi.nlm.nih.gov/genbank/GenBankFtp.html http://www.ncbi.nlm.nih.gov/Genbank/ GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30). There are approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions and 191,401,393,188 bases in 62,715,288 sequence records in the WGS division as of April 2011.Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING36 GenBank: Separating Sequence and Annotation Method 1 Slurp the GenBank record into an array and look through the lines Method 2 Put the whole GenBank record into a scalar and use regular expressions to look through it Example files record.gb library.gb

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 37 Parsing Genbank Files Using Arrays – Example 10-1 #!/usr/bin/perl # Example 10-1 Extract annotation and sequence from GenBank file use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # declare and initialize variables my @annotation = ( ); my \$sequence = ''; my \$filename = 'record.gb'; parse1(\@annotation, \\$sequence, \$filename); # Print the annotation, and then # print the DNA in new format just to check if we got it okay. print @annotation; print_sequence(\$sequence, 50); exit; ############################################### ################################# # Subroutine ############################################### ################################# # parse1 # # -parse annotation and sequence from GenBank record sub parse1 { my(\$annotation, \$dna, \$filename) = @_; # \$annotation-reference to array # \$dna -reference to scalar # \$filename -scalar # declare and initialize variables my \$in_sequence = 0; my @GenBankFile = ( ); # Get the GenBank data into an array from a file @GenBankFile = get_file_data(\$filename); # Extract all the sequence lines foreach my \$line (@GenBankFile) {

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING38 Parsing Genbank Files Using Arrays – Example 10-1 if( \$line =~ /^\/\/\n/ ) { # If \$line is end-of-record line //\n, last; #break out of the foreach loop. } elsif( \$in_sequence) { # If we know we're in a sequence, \$\$dna.= \$line; # add the current line to \$\$dna. } elsif ( \$line =~ /^ORIGIN/ ) { # If \$line begins a sequence, \$in_sequence = 1; # set the \$in_sequence flag. } else{ # Otherwise push( @\$annotation, \$line); # add the current line to @annotation. } # remove whitespace and line numbers from DNA sequence \$\$dna =~ s/[\s0-9]//g; }

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING39 Some Musings on the Array Approach if( \$line =~ /^\/\/\n/ ) { # If \$line is end-of-record line //\n, last; #break out of the foreach loop. } If we have a large number of forward slashes we can change the delimiter to a ! like this m!//\n! The order of the tests for what we are currently gathering in the file is in the reverse order of things in the file For example we first gather the annotation lines and then set a flag when the ORIGIN start-of-sequence line is found

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING40 Parsing Genbank Files Using Scalars We may end up with several lines in the same scalar; there are modifiers to our search code that can make our life easier Pattern Modifiers /g (What does this do?) /i (How about this one?) Regular expression matching symbols ^ (What does this do?) \$ (How about this one?). (What does this do?)

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING41 Special Multiline Regular Expression Modifiers m Treat string as multiple lines. That is, change ``^'' and ``\$'' from matching at only the very start or end of the string to the start or end of any line anywhere within the string, s Treat string as single line. That is, change ``.'' to match any character whatsoever, even a newline, which it normally would not match. The /s and /m modifiers both override the \$* setting. That is, no matter what \$* contains, /s without /m will force ``^'' to match only at the beginning of the string and ``\$'' to match only at the end (or just before a newline at the end) of the string. Together, as /ms, they let the ``.'' match any character whatsoever, while yet allowing ``^'' and ``\$'' to match, respectively, just after and just before newlines within the string./s /m/s/m

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING42 Pattern Modifiers in Action #!/usr/bin/perl use strict; use warnings; "AAC\nGTT" =~ /^.*\$/; print \$&, "\n"; exit; This fails because there is no match and \$& has no value Suppose we use "AAC\nGTT" =~ /^.*\$/m; Output: AAC Suppose we use "AAC\nGTT" =~ /^.*\$/s; Output: AAC GTT

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING43 Separating Annotations from Sequence Recall that a GenBank record starts with LOCUS and ends with // The input record separator This is normally set to new line so each call to read a scalar from a file handle gets one line The input record separator is denoted by \$/ \$/ = “//\n”; Now a call to read a scalar from a file handle takes all data up to the GenBank end of record separator

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 44 Example 10-2 –Extract Record and Annotation Sequence from a GenBank Record #!/usr/bin/perl # Example 10-2 Extract the annotation and sequence sections from the first # record of a GenBank library use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my \$annotation = ''; my \$dna = ''; my \$record = ''; my \$filename = 'record.gb'; my \$save_input_separator = \$/; # Open GenBank library file unless (open(GBFILE, \$filename)) { print "Cannot open GenBank file \"\$filename\"\n\n"; exit; } # Set input separator to "//\n" and read in a record to a scalar \$/ = "//\n"; \$record = ; # reset input separator \$/ = \$save_input_separator; # Now separate the annotation from the sequence data (\$annotation, \$dna) = (\$record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); # Print the two pieces, which should give us the same as the # original GenBank file, minus the // at the end print \$annotation, \$dna; exit;

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING45 The Regular Expressions In Depth (\$annotation, \$dna)=(\$record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s); What do the two sets of parentheses do? What do we do with the contents of these parenthesis? What does the /s modifier do? ^(LOCUS.*ORIGIN\s*\n) Can you explain this? \/\/\n What does this match? Notice that it took one line of Perl code to extract annotation and sequence

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING46 Parsing Annotations #!/usr/bin/perl -w # Example 10-3 Parsing GenBank annotations using arrays use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_book_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my @genbank = ( ); my \$locus = ''; my \$accession = ''; my \$organism = ''; # Get GenBank file data @genbank = get_file_data('record.gb'); # Let's start with something simple. Let's get some of the identifying # information, let's say the locus and accession number (here the same # thing) and the definition and the organism. for my \$line (@genbank) { if(\$line =~ /^LOCUS/) { \$line =~ s/^LOCUS\s*//; \$locus = \$line; }elsif(\$line =~ /^ACCESSION/) { \$line =~ s/^ACCESSION\s*//; \$accession = \$line; }elsif(\$line =~ /^ ORGANISM/) { \$line =~ s/^\s*ORGANISM\s*//; \$organism = \$line; } print "*** LOCUS ***\n"; print \$locus; print "*** ACCESSION ***\n"; print \$accession; print "*** ORGANISM ***\n"; print \$organism; exit; Notice the use of the flags and that this program handles single line entries.

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 47 Tackling the Definition Field #!/usr/bin/perl -w # Example 10-4 Parsing GenBank annotations using arrays, take 2 use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my @genbank = ( ); my \$locus = ''; my \$accession = ''; my \$organism = ''; my \$definition = ''; my \$flag = 0; # Get GenBank file data @genbank = get_file_data('record.gb'); # Let's start with something simple. Let's get some of the identifying # information, let's say the locus and accession number (here the same # thing) and the definition and the organism. for my \$line (@genbank) { if(\$line =~ /^LOCUS/) { \$line =~ s/^LOCUS\s*//; \$locus = \$line; }elsif(\$line =~ /^DEFINITION/) { \$line =~ s/^DEFINITION\s*//; \$definition = \$line; \$flag = 1; }elsif(\$line =~ /^ACCESSION/) { \$line =~ s/^ACCESSION\s*//; \$accession = \$line; \$flag = 0; }elsif(\$flag) { chomp(\$definition); \$definition.= \$line; }elsif(\$line =~ /^ ORGANISM/) { \$line =~ s/^\s*ORGANISM\s*//; \$organism = \$line; }

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING48 Parsing the Annotations Using Regular Expressions sub open_file Given the filename, return the filehandle sub get_next_record Given the file handle, get the record (we can get the offset by first calling tell) sub get_annotation_and_dna Given a record, split it into annotation and cleaned up record sub search_sequence Given a sequence and a regular expression, return array of locations and hits sub search_annotation Given a GenBank annotation and a regular expression, return array of location and hits sub parse_annotation Separate out the files of the annotation in a convenient form sub parse_features Given the features field, separate out the components

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING49 Byte offset The Perl function tell allows us to save the byte offset of the record of interest The byte offset is the number of characters into the file where the information of interest lies We can return again “instantaneously” to this location in the file again using the Perl function seek

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING50 Parsing of the Data 1. Separate out the annotation and the sequence (some capability to search exists at this stage) 2. Extract out the fields 3. Parse the features table

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 51 Example 10-5. GenBank Library Subroutines !/usr/bin/perl # Example 10-5 - test program of GenBank library subroutines) use strict; use warnings; # Don't use BeginPerlBioinfo # Since all subroutines defined in this file # use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my \$fh; # variable to store filehandle my \$record; my \$dna; my \$annotation; my \$offset; my \$library = 'library.gb'; # Perform some standard subroutines for test \$fh = open_file(\$library); \$offset = tell(\$fh); while( \$record = get_next_record(\$fh) ) { (\$annotation, \$dna) = get_annotation_and_dna(\$record); if( search_sequence(\$dna, 'AAA[CG].')) { print "Sequence found in record at offset \$offset\n"; } if( search_annotation(\$annotation, 'homo sapiens')) { print "Annotation found in record at offset \$offset\n"; } \$offset = tell(\$fh); } exit;

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING52 Subroutines # open_file # # - given filename, set filehandle sub open_file { my(\$filename) = @_; my \$fh; unless(open(\$fh, \$filename)) { print "Cannot open file \$filename\n"; exit; } return \$fh; } # get_next_record # # - given GenBank record, get annotation and DNA sub get_next_record { my(\$fh) = @_; my(\$offset); my(\$record) = ''; my(\$save_input_separator) = \$/; \$/ = "//\n"; \$record = ; \$/ = \$save_input_separator; return \$record; }

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 53 Subroutines # get_annotation_and_dna # # - given filehandle to open GenBank library file, get next record sub get_annotation_and_dna { my(\$record) = @_; my(\$annotation) = ''; my(\$dna) = ''; # Now separate the annotation from the sequence data (\$annotation, \$dna) = (\$record =~ /^(LOCUS.*ORIGIN\s*\n)(.*)\/\/\n/s) ; # clean the sequence of any whitespace or / characters # (the / has to be written \/ in the character class, because # / is a metacharacter, so it must be "escaped" with \) \$dna =~ s/[\s\/]//g; return(\$annotation, \$dna) } # search_sequence # # - search sequence with regular expression sub search_sequence { my(\$sequence, \$regularexpression) = @_; my(@locations) = ( ); while( \$sequence =~ /\$regularexpression/ig ) { push( @locations, pos ); } return (@locations); }

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING54 Subroutines # search_annotation # # - search annotation with regular expression sub search_annotation { my(\$annotation, \$regularexpression) = @_; my(@locations) = ( ); # note the /s modifier-. matches any character including newline while( \$annotation =~ /\$regularexpression/isg ) { push( @locations, pos ); } return (@locations); } Output: Sequence found in record at offset 0 Annotation found in record at offset 0 Sequence found in record at offset 6358 Annotation found in record at offset 6358 Sequence found in record at offset 12573 Annotation found in record at offset 12573 Sequence found in record at offset 18032 Annotation found in record at offset 18032 Sequence found in record at offset 22722 Annotation found in record at offset 22722

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING55 Parsing the Annotations at the Top Level Let’s start with one of the simpler annotations DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, complete cds. ACCESSION AB031069 VERSION AB031069.1 GI:8100074 We need a regular expression that matches everything from a word at the beginning of a line to a newline that just precedes another word at the beginning of a line.

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING56 Parsing the Annotations at the Top level /^[A-Z].*\n(^\s.*\n)*/m What does /m do? ^[A-Z].*\n Capital letter at the beginning of the line followed by any number of characters (except newlines) followed by a newline (^\s.*\n)* Matches a space or tab at the beginning of the line followed by any number of characters )except newlines) followed by a newline ()* means 0 or more of these

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING57 Example 10-6 Parsing GenBank Data #!/usr/bin/perl # Example 10-6 - test program for parse_annotation subroutine use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my \$fh; my \$record; my \$dna; my \$annotation; my %fields; my \$library = 'library.gb'; # Open library and read a record \$fh = open_file(\$library); \$record = get_next_record(\$fh); # Parse the sequence and annotation (\$annotation, \$dna) = get_annotation_and_dna(\$record); # Extract the fields of the annotation %fields = parse_annotation(\$annotation); # Print the fields foreach my \$key (keys %fields) { print "******** \$key *********\n"; print \$fields{\$key}; } exit;

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING58 Subroutines # parse_annotation # # given a GenBank annotation, returns a hash with # keys: the field names # values: the fields sub parse_annotation { my(\$annotation) = @_; my(%results) = ( ); while( \$annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) { my \$value = \$&; (my \$key = \$value) =~ s/^([A-Z]+).*/\$1/s; \$results{\$key} = \$value; } return %results; }

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING59 Extracting the key while( \$annotation =~ /^[A-Z].*\n(^\s.*\n)*/gm ) { my \$value = \$&; (my \$key = \$value) =~ s/^([A-Z]+).*/\$1/s; \$results{\$key} = \$value; } What does the bolded line do? First assigns \$key the value \$value Uses the /s modifier for embedded newlines Replaces \$key with \$1 which is a special variable indicating the match between the first pair of parenthesis

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING60 Parsing the Features Table Let’s tackle source, gene and CDS features keys source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" /cell_line="HuS-L12" /cell_type="lung fibroblast" /dev_stage="embryo" gene 229..2199 /gene="PCCX1" CDS 229..2199 /gene="PCCX1" /note="a nuclear protein carrying a PHD finger and a CXXC Will use an array rather than a hash since there can be multiple instances of the same feature in a record

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING 61 Parsing the Features Table #!/usr/bin/perl # - main program to test parse_features use strict; use warnings; use lib 'C:\Documents and Settings\Owner\workspace\binf634_bo ok_examples'; use BeginPerlBioinfo; # see Chapter 6 about this module # Declare and initialize variables my \$fh; my \$record; my \$dna; my \$annotation; my %fields; my @features; my \$library = 'library.gb'; # Get the fields from the first GenBank record in a library \$fh = open_file(\$library); \$record = get_next_record(\$fh); (\$annotation, \$dna) = get_annotation_and_dna(\$record); %fields = parse_annotation(\$annotation); # Extract the features from the FEATURES table @features = parse_features(\$fields{'FEATURES'}) ; # Print out the features foreach my \$feature (@features) { # extract the name of the feature (or "feature key") my(\$featurename) = (\$feature =~ /^ {5}(\S+)/); print "******** \$featurename *********\n"; print \$feature; } exit;

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING62 Subroutine # parse_features # # extract the features from the FEATURES field of a GenBank record sub parse_features { my(\$features) = @_; # entire FEATURES field in a scalar variable # Declare and initialize variables my(@features) = (); # used to store the individual features # Extract the features while( \$features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) { my \$feature = \$&; push(@features, \$feature); } return @features; }

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING63 A Closer Look at the Crucial Regular Expression while( \$features =~ /^ {5}\S.*\n(^ {21}\S.*\n)*/gm ) Line begins with 5 spaces followed by non-whitespace character followed by any number of non-newlines followed by a newline Next we space 21 or more spaces followed by non-whitepace characters \S followed by any number of non-newlines.* followed by a newline

BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING64 Homework for Next Week Read Tisdall Chapter 10 Particular attention to the Indexing GenBank with DBM section Exercises 10.3 and 10.6 Begin working on Program 3 Quiz 4 next week

Download ppt "BINF634_FALL14 COMPLEX DATA STRUCTURES AND PARSING1 Topics Quiz 3 Solutions More functions on arrays References and Dereferencing Two-dimensional arrays."

Similar presentations