roa1.fasta",'fasta',$seq_object); use Bio::Perl; my $seq = get_sequence('swiss',"ROA1_HUMAN"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_result);"> roa1.fasta",'fasta',$seq_object); use Bio::Perl; my $seq = get_sequence('swiss',"ROA1_HUMAN"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_result);">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biol 59500-033 - Practical Biocomputing1 BioPerl General capabilities (packages) Sequences ○ fetching, reading, writing, reformatting, annotating, groups.

Similar presentations


Presentation on theme: "Biol 59500-033 - Practical Biocomputing1 BioPerl General capabilities (packages) Sequences ○ fetching, reading, writing, reformatting, annotating, groups."— Presentation transcript:

1 Biol 59500-033 - Practical Biocomputing1 BioPerl General capabilities (packages) Sequences ○ fetching, reading, writing, reformatting, annotating, groups ○ Access to remote databases Applications ○ BLAST, Blat, FASTA, HMMer, Clustal, Alignment, many others Gene modeling ○ Genscan, Sim4, Grail, Genemark, ESTScan, MZEF, EPCR XML formats ○ GAME, BSML and AGAVE GFF Trees Genetic maps 3D structure Literature Graphics

2 Biol 59500-033 - Practical Biocomputing2 BioPerl Auxilliary packages possibly of less general interest require additional modules BioPerl-run – running applications ○ EMBOSS ○ PISE Bioperl-ext – extensions Bioperl-db and BioSQL

3 Biol 59500-033 - Practical Biocomputing3 BioPerl Simple use Bio::Perl; easy access to a small part of Bioperl's functionality in an easy to use manner use Bio::Perl; # this script will only work if you have an internet connection on the # computer you're using, the databases you can get sequences from # are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq' my $seq_object = get_sequence('swiss',"ROA1_HUMAN"); write_sequence(">roa1.fasta",'fasta',$seq_object); use Bio::Perl; my $seq = get_sequence('swiss',"ROA1_HUMAN"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_result);

4 Biol 59500-033 - Practical Biocomputing4 BioPerl Bio::Perl Bio::Perl has a number of other easy-to-use functions, including get_sequence - gets a sequence from standard, internet accessible databases read_sequence - reads a sequence from a file read_all_sequences - reads all sequences from a file new_sequence - makes a Bioperl sequence just from a string write_sequence - writes a single or an array of sequence to a file translate - provides a translation of a sequence translate_as_string - provides a translation of a sequence, returning back just the sequence as a string blast_sequence - BLASTs a sequence against standard databases at NCBI write_blast - writes a blast report out to a file

5 Biol 59500-033 - Practical Biocomputing5 BioPerl Sequence Objects Seq, PrimarySeq, LocatableSeq, RelSegment, LiveSeq, LargeSeq, RichSeq, SeqWithQuality, SeqI Common formats are interpreted automatically Simple formats - without features ○ FASTA (Pearson), Raw, GCG Rich Formats - with features and annotations ○ GenBank, EMBL ○ Swissprot, GenPept ○ XML - BSML, GAME, AGAVE, TIGRXML, CHADO

6 Biol 59500-033 - Practical Biocomputing6 BioPerl Sequences, Features and Annotations Sequence - DNA, RNA, Amino Acid Sequences are feature containers ○ Feature - Information with a Sequence Location ○ Annotation - Information without explicit Sequence location Parsing sequences Bio::SeqIO ○ for automatically reading most types ○ multiple drivers: genbank, embl, fasta,... Sequence objects ○ Bio::PrimarySeq ○ Bio::Seq ○ Bio::Seq::RichSeq

7 Biol 59500-033 - Practical Biocomputing7 BioPerl Simple examples #!/bin/perl -w use Bio::Seq; $seq_obj = Bio::Seq->new( -seq => "aaaatgggggggggggccccgtt", -alphabet => 'dna' ); #!/bin/perl -w use Bio::Seq; $seq_obj = Bio::Seq->new( -seq => "aaaatgggggggggggccccgtt", -display_id => "#12345", -desc => "example 1", -alphabet => "dna" ); print $seq_obj->seq();

8 Biol 59500-033 - Practical Biocomputing8 BioPerl Reading sequences from files & databases #!/bin/perl -w use Bio::SeqIO; $seqio_obj = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'fasta' ); # if there is more than one sequence in the file while ($seq_obj = $seqio_obj->next_seq){ # print the sequence print $seq_obj->seq,"\n"; } #!/bin/perl -w use Bio::DB::GenBank; $db_obj = Bio::DB::GenBank->new; $seq_obj = $db_obj->get_Seq_by_id( AE006468 );

9 Biol 59500-033 - Practical Biocomputing9 BioPerl Getting sequences directly from database #!/bin/perl -w use Bio::DB::GenBank; # also Bio::DB::GenBank, Bio::DB::GenPept, Bio::DB::SwissProt, Bio::DB::RefSeq and Bio::DB::EMBLBio::DB::GenBankBio::DB::GenPeptBio::DB::SwissProtBio::DB::RefSeqBio::DB::EMBL #keyword query $query_obj = Bio::DB::Query::GenBank->new( -query =>'gbdiv est[prop] AND Trypanosoma brucei [organism]', -db => 'nucleotide' ); $gb = new Bio::DB::GenBank; # this returns a Seq object : $seq1 = $gb->get_Seq_by_id('MUSIGHBA1'); # this also returns a Seq object : $seq2 = $gb->get_Seq_by_acc('AF303112'); # this returns a SeqIO object, which can be used to get a Seq object : $seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]); $seq3 = $seqio->next_seq;

10 Biol 59500-033 - Practical Biocomputing10 BioPerl Getting more sequence information Some methods − accession_number()get the accession number − display_id()get identifier string − description() or desc()get description string − seq()get the sequence as a string − length()get the sequence length − subseq($start, $end)get a subsequence (char string) − translate()translate to protein (seq obj) − revcom()reverse complement (seq obj) − species()Returns an Bio::Species object #!/usr/bin/env perl use strict; use Bio::SeqIO; use Bio::DB::GenBank; my $genBank = new Bio::DB::GenBank; my $seq = $genBank->get_Seq_by_acc('AF060485'); # get a record by accession my $dna = $seq->seq; # get the sequence as a string my $id = $seq->display_id; # identifier my $acc = $seq->accession; # accession number my $desc = $seq->desc; # get the description print "ID: $id\naccession: $acc\nDescription: $desc\n$dna\n";

11 Biol 59500-033 - Practical Biocomputing11 BioPerl Sequence Objects LOCUS ECORHO 1880 bp DNA linear BCT 26-APR-1993 DEFINITION E.coli rho gene coding for transcription termination factor. ACCESSION J01673 J01674 VERSION J01673.1 GI:147605 KEYWORDS attenuator; leader peptide; rho gene; transcription terminator. SOURCE Escherichia coli ORGANISM Escherichia coli Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 1880) AUTHORS Brown,S., Albrechtsen,B., Pedersen,S. and Klemm,P. TITLE Localization and regulation of the structural gene for transcription-termination factor rho of Escherichia coli JOURNAL J. Mol. Biol. 162 (2), 283-298 (1982) MEDLINE 83138788 PUBMED 6219230 REFERENCE 2 (bases 1 to 1880) AUTHORS Pinkham,J.L. and Platt,T. TITLE The nucleotide sequence of the rho gene of E. coli K-12 COMMENT Original source text: Escherichia coli (strain K-12) DNA. A clean copy of the sequence for [2] was kindly provided by J.L.Pinkham and T.Platt. FEATURES Location/Qualifiers source 1..1880 /organism="Escherichia coli" /mol_type="genomic DNA" /strain="K-12" /db_xref="taxon:562" mRNA 212..>1880 /product="rho mRNA" gene 468..1727 /gene="rho" CDS 468..1727 /gene="rho" /note="transcription termination factor" /codon_start=1 /translation="MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAK... IDAMEFLINKLAMTKTNDDFFEMMKRS" ORIGIN 15 bp upstream from HhaI site. 1 aaccctagca ctgcgccgaa atatggcatc cgtggtatcc cgactctgct gctgttcaaa 61 aacggtgaag tggcggcaac caaagtgggt gcactgtcta aaggtcagtt gaaagagttc...deleted... 1801 tgggcatgtt aggaaaattc ctggaatttg ctggcatgtt atgcaatttg catatcaaat 1861 ggttaatttt tgcacaggac //

12 Biol 59500-033 - Practical Biocomputing12 BioPerl Bio::Seq object methods add_SeqFeature($feature) - attach feature(s) get_SeqFeatures() - get all the attached features. species() - a Bio::Species object annotation() - Bio::Annotation::Collection Features Bio::SeqFeatureI - interface Bio::SeqFeature::Generic - basic implementation SeqFeature::Similarity - some score info SeqFeature::FeaturePair - pair of features

13 Biol 59500-033 - Practical Biocomputing13 BioPerl Sequence Features Bio::SeqFeatureI - interface - GFF derived ○ start(), end(), strand() for location information ○ location() - Bio::LocationI object (to represent complex locations) ○ score,frame,primary_tag, source_tag - feature information ○ spliced_seq() - for attached sequence, get the sequence spliced. Bio::SeqFeature::Generic ○ add_tag_value($tag,$value) - add a tag/value pair ○ get_tag_value($tag) - get all the values for this tag ○ has_tag($tag) - test if a tag exists ○ get_all_tags() - get all the tags

14 Biol 59500-033 - Practical Biocomputing14 BioPerl Sequence Annotations Each Bio::Seq has a Bio::Annotation::Collection via $seq->annotation() Annotations are stored with keys like ‘comment’ and ‘reference’ Annotation::Comment ○ comment field Annotation::Reference ○ author,journal,title, etc Annotation::DBLink ○ database,primary_id,optional_id,comment Annotation::SimpleValue @com=$annotation-> get_Annotations(’comment’) $annotation-> add_Annotation(’comment’,$an)

15 Biol 59500-033 - Practical Biocomputing15 BioPerl Sequences, Features, and Annotations Bio::LocationI has-a Bio::SeqFeature::Generic Bio::Annotation::Comment has-a Annotations Bio::SeqBio::Annotation::Collection Features

16 Biol 59500-033 - Practical Biocomputing16 BioPerl Writing sequences write in a different format than read = reformatting use Bio::SeqIO; #convert swissprot to fasta format my $in = Bio::SeqIO->new(-format => ‘swiss’, -file => ‘file.sp’); my $out = Bio::SeqIO->new(-format => ‘fasta’, -file => ‘>file.fa’);` while( my $seq = $in->next_seq ) { $out->write_seq($seq); }

17 Biol 59500-033 - Practical Biocomputing17 BioPerl Remote Blast Retrieve sequence, setup and submit use Bio::DB::GenBank; use Bio::Tools::Run::RemoteBlast; # retrieve sequence from genbank my $db_obj = Bio::DB::GenBank->new; my $seq_obj = $db_obj->get_Seq_by_acc( '1942116' ); my $seq = $seq_obj->seq; print "seq:$seq\n"; #remote BLAST setup and query submission my $v = 1;# turn on verbose output my $remote_blast = Bio::Tools::Run::RemoteBlast->new( '-prog' => 'blastp', '-data' => 'swissprot', '-expect' => '1e-10' ); my $r = $remote_blast->submit_blast( $seq_obj ); print STDERR "waiting…" if( $v > 0 );

18 Biol 59500-033 - Practical Biocomputing18 BioPerl Remote Blast Retrieve sequence, setup and submit --------------------- WARNING --------------------- MSG: Unrecognized DBSOURCE data: pdb: molecule 2NLL, chain 65, release Aug 27, 2007; deposition: Nov 20, 1996; class: TranscriptionDNA; source: Mol_id: 1; Organism_scientific: Homo Sapiens; Organism_common: Human; Genus: Homo; Species: Sapiens; Expression_system: Escherichia Coli; Expression_system_common: Bacteria; Expression_system_genus: Escherichia; Expression_system_species: Coli; Mol_id: 2; Organism_scientific: Homo Sapiens; Organism_common: Human; Genus: Homo; Species: Sapiens; Expression_system: Escherichia Coli; Expression_system_common: Bacteria; Expression_system_genus: Escherichia; Expression_system_species: Coli; Mol_id: 3; Synthetic: Yes; Mol_id: 4; Synthetic: Yes; Exp. method: X-Ray Diffraction. --------------------------------------------------- seq:CAICGDRSSGKHYGVYSCEGCKGFFKRTVRKDLTYTCRDNKDCLIDKRQRNRCQYCRYQKCLAMGM

19 Biol 59500-033 - Practical Biocomputing19 BioPerl Remote Blast Results list of search rids are stored in the remoteblast object #while (my @rids = $remote_blast->each_rid ) { foreach my $rid ( @rids ) { # Try to retrieve a search, $rc is not a reference until the search is done # when the serch is complete, $rc is a Bio::SearchIO object my $rc = $remote_blast->retrieve_blast($rid); if( !ref($rc) ) { # if the search is not done, wait 5 sec and try again # it would be a good idea to put a maximum limit here so the script # doesn't run forever in the event of an error if ( $rc < 0 ) { $remote_blast->remove_rid($rid); } print STDERR "." if ( $v > 0 ); sleep 5; } else { # search result successfully retrieved my $result = $rc->next_result(); # see Bio::Search::Result #save the output my $filename = $result->query_name()."\.out"; $remote_blast->save_output($filename); $remote_blast->remove_rid($rid); print "\nQuery Name: ", $result->query_name(), "\n"; while ( my $hit = $result->next_hit ) {a # see Bio::Search::Hit::HitI next unless ( $v > 0); print "\thit name is ", $hit->name, "\n"; while( my $hsp = $hit->next_hsp ) { print "\t\tscore is ", $hsp->score, "\n"; }

20 Biol 59500-033 - Practical Biocomputing20 BioPerl Remote Blast waiting.... Query Name: 2NLL_A hit name is sp|P28700.1|RXRA_MOUSE score is 275 hit name is sp|P19793.1|RXRA_HUMAN score is 275 hit name is sp|Q05343.1|RXRA_RAT score is 275 hit name is sp|Q90415.1|RXRAB_DANRE score is 273 hit name is sp|A2T929.2|RXRAA_DANRE score is 272 hit name is sp|Q7SYN5.1|RXRBA_DANRE score is 270 hit name is sp|Q90416.2|RXRGA_DANRE score is 268 hit name is sp|P51128.1|RXRA_XENLA score is 268 hit name is sp|P28701.1|RXRG_CHICK score is 268 hit name is sp|Q90417.1|RXRBB_DANRE score is 266 hit name is sp|Q0GFF6.2|RXRG_PIG score is 264 hit name is sp|Q0VC20.1|RXRG_BOVIN score is 264 hit name is sp|Q5BJR8.1|RXRG_RAT score is 264 hit name is sp|Q5REL6.1|RXRG_PONAB score is 264 hit name is sp|P51129.1|RXRG_XENLA score is 264 hit name is sp|P48443.1|RXRG_HUMAN score is 264 hit name is sp|P28705.2|RXRG_MOUSE score is 264 hit name is sp|Q6DHP9.1|RXRGB_DANRE score is 261 hit name is sp|Q5TJF7.1|RXRB_CANFA score is 258 … hit name is sp|Q505F1.2|NR2C1_MOUSE score is 200 hit name is sp|Q9TTR7.1|COT2_BOVIN score is 200 hit name is sp|Q90733.1|COT2_CHICK score is 200 hit name is sp|P16375.1|7UP1_DROME score is 200 hit name is sp|P16376.3|7UP2_DROME hit name is sp|P24468.1|COT2_HUMAN hit name is sp|O09018.1|COT2_RAT hit name is sp|A0JNE3.1|NR2C1_BOVIN hit name is sp|Q6PH18.1|N2F1B_DANRE hit name is sp|Q9N4B8.4|NHR41_CAEEL hit name is sp|O45666.2|NHR49_CAEEL hit name is sp|P49866.2|HNF4_DROME hit name is sp|P79926.1|HNF4B_XENLA

21 Biol 59500-033 - Practical Biocomputing21 BioPerl Database Search BLAST - 3 Components ○ Result: Bio::Search::Result::ResultI ○ Hit: Bio::Search::Hit::HitI ○ HSP: Bio::Search::HSP::HSPI

22 Biol 59500-033 - Practical Biocomputing22 BioPerl Blast use Bio::Perl; my $seq = get_sequence('swiss',"ROA1_HUMAN"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_result);} $report_obj = new Bio::SearchIO(-format => 'blast', -file => 'report.bls'); while( $result = $report_obj->next_result ) { while( $hit = $result->next_hit ) { while( $hsp = $hit->next_hsp ) { if ( $hsp->percent_identity > 75 ) { print "Hit\t", $hit->name, "\n", "Length\t", $hsp->length('total'), "\n", "Percent_id\t", $hsp->percent_identity, "\n"; }

23 Biol 59500-033 - Practical Biocomputing23 BioPerl BLAST – Processed result Query is: BOSS_DROME Bride of sevenless protein precursor. 896 aa Matrix was BLOSUM62 Hit is F35H10.10 HSP Len is 315 E-value is 4.9e-11 Bit score 182 Query loc: 511 813 Sbject loc: 1006 1298 HSP Len is 28 E-value is 1.4e-09 Bit score 39 Query loc: 508 535 Sbject loc: 427 454

24 Biol 59500-033 - Practical Biocomputing24 BioPerl BLAST – Using the search::Hit object use Bio::SearchIO; use strict; my $parser = new Bio::SearchIO(-format => ‘blast’, -file => ‘file.bls’); while( my $result = $parser->next_result ){ while( my $hit = $result->next_hit ) { print “hit name=”,$hit->name, “ desc=”, $hit->description, “\n len=”, $hit->length, “ acc=”, $hit->accession, ”\n”; print “raw score “, $hit->raw_score, “ bits “, $hit->bits, “ significance/evalue=“, $hit->evalue, “\n”; } }

25 Biol 59500-033 - Practical Biocomputing25 BioPerl Search::Hit methods start(), end() ○ get overall alignment start and end for all HSPs strand() ○ get best overall alignment strand matches() ○ get total number of matches across entire set of HSPs ○ can specify only exact ‘id’ or conservative ‘cons’

26 Biol 59500-033 - Practical Biocomputing26 BioPerl Using Search::HSP use Bio::SearchIO; use strict; my $parser = new Bio::SearchIO(-format => ‘blast’, -file => ‘file.bls’); while( my $result = $parser->next_result ){ while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { print “hsp evalue=“, $hsp->evalue, “ score=“ $hsp->score, “\n”; print “total length=“, $hsp->hsp_length, “ qlen=”, $hsp->query->length, “ hlen=”,$hsp->hit->length, “\n”; print “qstart=”,$hsp->query->start, “ qend=”,$hsp->query->end, “ qstrand=“, $hsp->query->strand, “\n”; print “hstart=”,$hsp->hit->start, “ hend=”,$hsp->hit->end, “ hstrand=“, $hsp->hit->strand, “\n”; print “percent identical “, $hsp->percent_identity, “ frac conserved “, $hsp->frac_conserved(), “\n”; print “num query gaps “, $hsp->gaps(’query’), “\n”; print “hit str =”, $hsp->hit_string, “\n”; print “query str =”, $hsp->query_string, “\n”; print “homolog str=”, $hsp->homology_string, “\n”; } }

27 Biol 59500-033 - Practical Biocomputing27 BioPerl Search::HSP methods rank() ○ order in the alignment ○ by score, size matches seq_inds ○ residue positions that are − conserved, identical, mismatches, gaps

28 Biol 59500-033 - Practical Biocomputing28 BioPerl SearchIO object correspond to many results BLAST (WU-BLAST, NCBI, XML, PSIBLAST, BL2SEQ, MEGABLAST, TABULAR (-m8/m9)) FASTA (m9 and m0) HMMER (hmmpfam, hmmsearch) UCSC formats (WABA, AXT, PSL) Gene based alignments ○ Exonerate, SIM4, {Gene,Genome}wise Can write searches in alternative formats

29 Biol 59500-033 - Practical Biocomputing29 BioPerl Sequence Alignment Bio::AlignIO to read alignment files Produces Bio::SimpleAlign objects ○ Phylip ○ Clustal Interface and objects designed for round-tripping and some functional work

30 Biol 59500-033 - Practical Biocomputing30 BioPerl Graphics use Bio::Graphics; use Bio::SeqIO; use Bio::SeqFeature::Generic; my $file = shift or die "provide a sequence file as the argument"; my $io = Bio::SeqIO->new(-file=>$file) or die "couldn't create Bio::SeqIO"; my $seq = $io->next_seq or die "couldn't find a sequence in the file"; my @features = $seq->all_SeqFeatures; # sort features by their primary tags my %sorted_features; for my $f (@features) { my $tag = $f->primary_tag; push @{$sorted_features{$tag}},$f; } my $panel = Bio::Graphics::Panel->new( -length => $seq->length, -key_style => 'between', -width => 800, -pad_left => 10, -pad_right => 10); $panel->add_track(arrow => Bio::SeqFeature::Generic->new(-start => 1, -end => $seq->length), -bump => 0, -double=>1, -tick => 2); $panel->add_track(generic => Bio::SeqFeature::Generic->new(-start => 1, -end => $seq->length, -bgcolor => 'blue', -label => 1,); # general case my @colors = qw(cyan orange blue purple green chartreuse magenta yellow aqua); my $idx = 0; for my $tag (sort keys %sorted_features) { my $features = $sorted_features{$tag}; $panel->add_track($features, -glyph => 'generic', -bgcolor => $colors[$idx++ % @colors], -fgcolor => 'black', -font2color => 'red', -key => "${tag}s", -bump => +1, -height => 8, -label => 1, -description => 1, ); } print $panel->png;

31 Biol 59500-033 - Practical Biocomputing31 BioPerl Graphics


Download ppt "Biol 59500-033 - Practical Biocomputing1 BioPerl General capabilities (packages) Sequences ○ fetching, reading, writing, reformatting, annotating, groups."

Similar presentations


Ads by Google