Presentation is loading. Please wait.

Presentation is loading. Please wait.

INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc.

Similar presentations


Presentation on theme: "INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc."— Presentation transcript:

1 INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc

2 BioPerl is …  A Set of Perl modules for manipulating genomic and other biological data  An Open Source Toolkit with many contributors  A flexible and extensible system for doing bioinformatics data manipulation

3 Some things we can do  Read in sequence data from a file in standard formats (FASTA, GenBank, EMBL, SwissProt,...)  Convert sequence file format (Sequence & Alignment)  Manipulate sequences, reverse complement, translate coding DNA sequence to protein.  Parse a BLAST like report, get access to every bit of data in the report

4 Sequence file formats  Simple formats - without features  Fasta  Rich formats - with features and annotations  EMBL, GenBank, GFF3  SwissProt, GenPept  TIGRXML, BSML, InterPro (XML)

5 Simple formats >ID Description(Free text) AGTGATGATAGTGAGTAGGA >gi|number|emb|ACCESSION AGATAGTAGGGGATAGAG >gi|number|sp|BOSS_7LES MTMFWQQNVDHQSDEQDKQAKGAAPTKRLN

6 Building a sequence #!/usr/bin/perl -w use strict; use Bio::Seq; my $seq = new Bio::Seq( -seq => 'ATGGGACCAAGTA', -display_id => 'example1‘ ); print “Sequence name ", $seq->display_id, "\n"; print “Sequence length is ", $seq->length, "\n"; print “Sub-sequence is ", $seq->subseq(1,3), "\n"; % perl ex2.pl Sequence name is example1 Sequence length is 13 Sub-sequence is ATG

7 Bio::PrimarySeq : Primary Information MethodDescription $seq->seqGet/Set the sequence string $seq->display_idGet/Set the Sequence identifier string $seq->descGet/Set the description string $seq->lengthReturn the length of the sequence $seq->subseq(start,end)Get a sub-sequence as a tring $seq->trunc(start,end)Get a sub-sequence as an object $seq->revcomGet the reverse complement (dna only) $seq->translateGet the protein translation (dna only)

8 Rich formats Taxonomic informations Bibliographic references Features (with location) + Annotations Sequence data Primary informations

9 Features & Annotations  GFF format derived

10 GFF format  « Generic Feature Format »  Tab delimited format  9 columns: sequence_id, source, primary_tag, start, stop, score, strand, frame, description  Different versions of GFF (GFF1, GFF2 & GFF3)  Variation is in how the description column is formatted  For GFF3, ‘primary_tag’ column values must be in the sequence ontology

11 Features & Annotations  GFF format derived  Have a location on a sequence  start(), end() & strand() for location information  score(), frame(), primary_tag(), source_tag() for feature information  tag(): hash reference of tag/value  Bio::SeqFeature::Generic  More details 

12 Convert format : Bio::SeqIO  Read /Write sequence  Initialize  file: filename for input; prepend ‘>’ for writing  format: for reading or writing  Some supported format  Format fastaFASTA genbankGenBank DB emblEMBL DB swissSwissProt DB

13 Read in sequence and write out in different format use Bio::SeqIO; my $in = new Bio::SeqIO( -format => 'genbank', -file => 'in.gb‘ ); my $out = new Bio::SeqIO( -format => 'fasta', -file =>'>out.fa‘ ); while ( my $seq = $in->next_seq ) { $out->write_seq($seq); }

14 Read GFF #!/usr/bin/perl use Bio::Tools::GFF; my $file = shift; my $tag = shift; my $in = new Bio::Tools::GFF( -gff_version => 3, -file => $file ); while(my $feature = $in->next_feature) { if ($feature->primary_tag() eq $tag) { my ($id) = $feature->get_tag_values("ID"); print join("\t",$id,$feature->seq_id,$feature->start,$feature->end,$feature->strand),"\n"; } $in->close;

15 Bio::SearchIO  Parsing analysis report  Can be split into 3 components  Result : One per query  Hit : Sequence which matches query (Component of Result)  HSP : High Scoring Segment Pairs (Component of Hit)  Implemented for BLAST, BLAT, FASTA, HMMER, Exonerate…

16 Bio::SearchIO Can be split into 3 components: Result: One per query Hit: Sequence whiches match query Component of a Result Result HSP: High Scoring Segment Pairs Component of a Hit Hit 1 Hit 2 HSP 1 HSP 2 HSP 1

17 Bio::SearchIO use strict; use Bio::SearchIO; my $in = new Bio::SearchIO( -format => 'blast', -file => 'report.bls‘ ); while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { if( $hsp->length('total') > 50 ) { if ( $hsp->percent_identity >= 75 ) { print "Query=", $result->query_name, " Hit=", $hit->name, " Length=", $hsp->length('total'), " Percent_id=", $hsp->percent_identity, "\n"; }

18 HOWTO Parsing with Bio::SearchIO  Table of methods

19 Things I'm skipping (here)  Bio::Tools::SeqStats - base-pair freq, dicodon freq, etc  Bio::Tools::SeqWords - count n-mer words in a sequence  Bio::SeqUtils – mixed helper functions  Bio::Restriction - find restriction enzyme sites and cut sequence  Bio::Graphics – represent information graphically

20 Link  HOWTO :  CPAN BioPerl : Modules Documentation


Download ppt "INTRODUCTION TO BIOPERL Gautier Sarah & Gaëtan Droc."

Similar presentations


Ads by Google