Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ensembl Compara Perl API Stephen Fitzgerald EBI - Wellcome Trust Genome Campus, UK compara.

Similar presentations

Presentation on theme: "Ensembl Compara Perl API Stephen Fitzgerald EBI - Wellcome Trust Genome Campus, UK compara."— Presentation transcript:

1 Ensembl Compara Perl API Stephen Fitzgerald EBI - Wellcome Trust Genome Campus, UK compara

2 What is Ensembl Compara? A single database which contains precalculated comparative genomics data Access via perl API and mysql A production system for generating that database (not in this presentation)

3 Compara data ProteinSequences Raw genomic sequence Whole genome alignments (tBLAT, BlastZ-net, PECAN) 46 species in Ensembl release-52 Syntenic regions ( based on BlastZ-net ) Raw Protein Alignments Protein Family clusters Protein trees Gene orthology / paraology predictions

4 Compara database & the Ensembl core databases Since there is minimal primary data inside Compara, to gain full access to the data external links with core DBs must be re- established Example: compara_52 must be linked with the Ensembl core_52 databases Proper REGISTRY configuration is critical Or load_registry_from_db is probably the best choice here

5 Written in Object-Oriented Perl Used to retrieve data from and store data into ensembl-compara database Generalized to extend to non-ensembl genomic data (Uniprot) Follows same Data Object & Object Adaptor DBAdaptor design as the other Ensembl APIs The Compara Perl API

6 Compara object model overview NCBITaxon GenomeDB DnaFrag Member MethodLinkSpeciesSet GenomicAlign GenomicAlignBlockSyntenyRegion DnaFragRegion HomologyFamily PRIMARY DATA ANALYSIS RESULTS Attribute ProteinTree AlignedMember

7 Primary data GenomeDB: relates to a particular Ensembl core DB name(), assembly(), genebuild(), taxon() fetch_by_name_assembly(), fetch_by_registry_name(), fetch_by_Slice(), fetch_all() DnaFrag: represents a top level SeqRegion name(), length(), genome_db(), slice(), coord_system_name() fetch_by_Slice(), fetch_by_GenomeDB_and_name() Member: list all Ensembl genes + SwissProt + SPTrEMBL source_name(), stable_id(), genome_db(), taxon(), sequence(), get_all_peptide_Members(), get_longest_peptide_Member(), gene_member() fetch_by_source_stable_id()

8 Analysis MethodLinkSpeciesSet provides a handle to isolate specific data from the shared tables (homology, genomic_align_block) MethodLink: Each individual analysis in compara is tagged with a unique name called a method_link_type BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY, ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES SpeciesSet: the sets of species as (a ref. to) an array of GenomeDBs fetch_by_method_link_type_GenomeDBs(), fetch_by_method_link_type_registry_aliases() name(), method_link_type(), species_set(), source()

9 Exercises GenomeDB 1. Find out the versions of human and mouse genomes in the database 2. Print the name of all the GenomeDBs in the database DnaFrag 1. Get the DnaFrag for the chromosome 1 of the macaque genome (using a genome_db object as an argument) 2. Get the DnaFrag for the chromosome X of the mouse genome (using a core slice object as an argument) MethodLinkSpeciesSet 1. Find out how many analyses are stored in the database 2. Get the name of the MethodLinkSpeciesSet corresponding to the BlastZ-net analysis for human and mouse 3. Get the names of the all the species using the mlss corresponding to the Pecan analyses

10 GenomeDB example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); print Name:,$genome_db->name,"\n"; print Assembly:,$genome_db->assembly,"\n"; print GeneBuild:,$genome_db->genebuild,"\n";

11 GenomeDB example code $> perl Homo sapiens NCBI Ensembl Mus musculus NCBIM Ensembl

12 DnaFrag example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); my $dnafrag_adaptor = $reg->get_adaptor( "Multi", "compara", "DnaFrag"); my $dnafrag = $dnafrag_adaptor-> fetch_by_GenomeDB_and_name($genome_db, "13"); print "Name:", $dnafrag->name, "\n"; print "Length:", $dnafrag->length, "\n"; print "CoordSystem:", $dnafrag->coord_system_name, "\n";

13 DnaFrag example code $> perl Name :13 Length : CoordSystem :chromosome

14 MethodLinkSpeciesSet example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"", -user => "anonymous"); my $mlssa = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet"); my $mlss = $mlssa-> fetch_by_method_link_type_registry_aliases( "BLASTZ_NET", ["human", "mouse"]); print $mlss->name, "\n"; print "type: ", $mlss->method_link_type, "\n"; my $species_set = $mlss->species_set(); foreach my $this_genome_db { print $this_genome_db->name(), "\n"; }

15 MethodLinkSpeciesSet example code $ > perl blastz-net (on

16 Genomic Alignments BlastZ-Net used to compare closely related pair of species BlastZ-raw -> BlastZ-chain -> BlastZ-net Translated BLAT used to compare more distant pair of species Pecan multiple global alignments all vs all coding exons wublastp -> Mercator -> Pecan on each syntenic block

17 GenomicAlignBlock represents a genomic alignment contains 1 GenomicAlign per sequence fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice) Methods: method_link_species_set(), score(), length(), perc_id(), get_all_GenomicAligns(), get_SimpleAlign() GenomicAlign dnafrag(), genome_db(), get_Slice(), dnafrag_start, dnafrag_end(), dnafrag_strand(), aligned_sequence()

18 GenomicAlignBlock $all_GAlign = $GABlock->get_all_GenomicAligns()$arrayref $Simplealign= $GABlock->get_SimpleAlign()$object $Simplealign:a bioperl object which contains the whole alignment - can be printed in various format using bioperl modules $Galign:an object which represents one of the sequences in the alignment only Hsap.X : ACCTTC-A<- $ga Cfam.X : ACC--CGA<- $ga

19 Synteny Based on BlastZ-net alignments SyntenyRegionAdaptor fetch_all_by_MethodLinkSpeciesSet_Slice(), fetch_all_by_MethodLinkSpeciesSet_DnaFrag() Methods: get_all_DnaFragRegions(), method_link_species_set(), DnaFragRegion slice(), dnafrag(), dnafrag_start(), dnafrag_end(), dnafrag_strand()

20 Exercises GenomicAlignBlock 1. Fetch all the BLASTZ_NET alignments between the first 130K nucleotides of the human chromosome X and the mouse genome. 2. Print the exact location of the alignment blocks. 3. Compare the original and the aligned sequences. 4. Find the BLASTZ_NET alignments between human gene BRCA2 and the mouse genome. 5. Print the BLASTZ_NET alignments between the rat gene ECSIT and the mouse genome. 6. Print the PECAN multiple alignments between the rat gene ECSIT and 11 other amniote vertebrates. 7. Print the constrained-element alignments within the rat ECSIT locus (use the constrained elements generated from the 12-way alignments). Synteny 1. Get the human-mouse syntenic map for human chromosome X.

21 GenomicAlignBlock example code [...] my $slice_adaptor = $reg->get_adaptor( "human", "core", "Slice"); my $slice = $slice_adaptor-> fetch_by_region("chromosome", "12", 1e4, 2e4); my $gaba = $reg->get_adaptor("Multi", "compara", "GenomicAlignBlock"); my $genomic_align_blocks = $gaba-> fetch_all_by_MethodLinkSpeciesSet_Slice( $method_link_species_set, $slice); foreach my $this_gab { my $all_gas = $this_gab->get_all_GenomicAligns(); foreach my $this_ga { print $this_ga->genome_db->name(), ":", $this_ga->get_Slice()->name(), "\n"; print $this_ga->aligned_sequence(), "\n"; } print "\n"; }

22 GenomicAlignBlock example code $>perl Mus musculus:chromosome:NCBIM37:6: : :-1 CCTCTTAATAAACATTATTGTCAA[…] Homo sapiens:chromosome:NCBI36:12:19128:19507:1 CCTCTTAATAAGCACACATATCCT[..]

23 Synteny example code [...] my $synteny_region_adaptor = $reg->get_adaptor( "Multi", "compara", "SyntenyRegion"); my $synteny_regions = $synteny_region_adaptor-> fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_synteny_method_link_species_set, $human_slice); foreach my $this_synteny_region { my $these_dnafrag_regions = $this_synteny_region->get_all_DnaFragRegions(); foreach my $this_dnafrag_region { print $this_dnafrag_region->dnafrag-> genome_db->name, ": ", $this_dnafrag_region->slice->name, "\n"; } print "\n"; }

24 Homology (e! 38): Orthologue predictions based on best reciprocal blast hits Paralogues for a selected set of species No global view of the evolution history of the gene considered e! 39+: Orthologues and paralogues are inferred from protein trees Phylogeny: Orthology/Paralogy in one go

25 BSR: Blast Score Ratio. When 2 proteins P1 and P2 are compared, BSR=scoreP1P2/max(self-scoreP1 or self-scoreP2). The default threshold used in the initial clustering step is 0.33.

26 Homology types

27 Homology Homology object contains 1 pair of Member/Attribute per gene/protein fetch_all_by_Member(), fetch_all_by_MethodLinkSpeciesSet(), fetch_all_by_Member_MethodLinkSpeciesSet() Methods: method_link_species_set(), description(), subtype(), perc_id(), get_all_Member_Attribute(), get_SimpleAlign()

28 Family Compara compute gene family clusters Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT and Uniprot/SPTREMBL metazoan proteins The algorithm is based on : All vs all blastp MCL clustering Muscle multiple aligner Results stored in family, family_member tables

29 Family Family object contains 1 pair of Member/Attribute per gene/protein fetch_all by_Member() Methods: method_link_species_set(), description(), description_score(), get_all_Member_Attribute(), get_SimpleAlign()

30 Exercises Members 1. Find the Member corresponding to SwissProt protein O Find the Member for the human gene BRCA2 3. Find all the peptide Members corresponding to the human gene CTDP1 Homology 1. Get all the predicted homologues for the human gene BRCA2 2. Get all the mouse orthologues predicted for the human gene CTDP1 Family 1. Get family predicted for the human gene BRCA2 2. Get the alignments corresponding to the family of the human gene HBEGF

31 Member example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"", -user => "anonymous"); my $member_adaptor = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $member_adaptor-> fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG "); print "All proteins:\n"; my $all_peptide_members = $member-> get_all_peptide_Members(); foreach my $this_peptide { print $this_peptide->stable_id(), "\n"; }

32 Member example code $> perl All proteins: ENSP ENSP ENSP

33 Homology example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG "); my $homology_adaptor = $reg->get_adaptor( "Multi", "compara", "Homology"); my $homologies = $homology_adaptor-> fetch_all_by_Member($member); foreach my $this_homology { print $this_homology->description, "\n"; my $member_attributes = $this_homology-> get_all_Member_Attribute(); foreach my $this_mem_attr { my ($this_member, $this_attribute) print $this_member->genome_db->name, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; }

34 Family example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG "); my $family_adaptor = $reg->get_adaptor( "Multi", "compara", "Family"); my $families = $family_adaptor-> fetch_all_by_Member($member); foreach my $this_family { print $this_family->description, "\n"; my $member_attributes = $this_family-> get_all_Member_Attribute(); foreach my $this_mem_attr { my ($this_member, $this_attribute) print $this_member->taxon->binomial, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; }

35 Getting More Information perldoc – Viewer for inline API documentation. shell> perldoc Bio::EnsEMBL::Compara::GenomeDB shell> perldoc Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor online at: Tutorial document: cvs: ensembl-compara/docs/ComparaTutorial.pdf ensembl-dev mailing list: Exercise solutions:

36 Ensembl-dev mailing list and HelpDesk ensembl-dev mailing list is great for questions around the API and the DB HelpDesk is very helpful Give detailed info on what you are trying to do Check that you have the modules installed ($PERL5LIB pointing to them)

37 Guy Coates, Tim Cutts, Shelley GoddardSystems & Support Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel ZerbinoResearch Martin Hammond, Dan Lawson, Karyn MegyVectorBase Annotation Kerstin Jekosch, Mario Caccamo, Ian SealyZebrafish Annotation Val Curwen, Steve Searle, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White Analysis and Annotation Pipeline Javier Herrero, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Albert Vilella, Leo GordonComparative Genomics James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA)Web Team Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael SchusterOutreach Eugene Kulesha Distributed Annotation System (DAS) Arek Kasprzyk, Damian Smedley, Richard Holland, Syed HaldarBioMart Glenn Proctor, Ian Longden, Patrick Meidl, Andreas KähäriDatabase Schema and Core API Ensembl Team

38 A special case of ortholog

Download ppt "Ensembl Compara Perl API Stephen Fitzgerald EBI - Wellcome Trust Genome Campus, UK compara."

Similar presentations

Ads by Google