HOGENOM a phylogenomic database

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Phylogenetic analysis To infer and study evolutionary history of homologous gene families Manuel Ruiz (CIRAD, Data Integration team) Alexis Dereeper (IRD)
BLAST Sequence alignment, E-value & Extreme value distribution.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Types of homology BLAST
First release of HOGENOM, a database of homologous genes from complete genome Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et.
Structural bioinformatics
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Bioinformatics and Phylogenetic Analysis
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
Bas E. Dutilh Phylogenomics Using complete genomes to determine the phylogeny of species.
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence alignment, E-value & Extreme value distribution
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Comparative Genomics of the Eukaryotes
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
« Databases of homologous gene families for comparatives genomics » Poster 23 - JOBIM Nantes - Juin 2009 Databases of homologous gene families for comparatives.
Protein Sequence Alignment and Database Searching.
Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Protein and RNA Families
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
Sequence similarity, BLAST alignments & multiple sequence alignments
BLAST program selection guide
Basics of Comparative Genomics
Sequence based searches:
Comparative Genomics.
Genome Annotation Continued
Identify D. melanogaster ortholog
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Basic Local Alignment Search Tool (BLAST)
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

HOGENOM a phylogenomic database Simon Penel, Pascal Calvat, Jean-Francois Dufayard, Vincent Daubin, Laurent Duret , Manolo Gouy, Dominique Guyot, Daniel Kahn, Vincent Miele, Vincent Navratil, Guy Perrière, Rémi Planel

Several phylogenomic databases developed at LBBE/PRABI HOVERGEN Verterbrate Proteins from UniProt Clustering with SiLiX HOMOLENS Proteins from Ensembl Complete Genomes Clustering from Ensembl Trees calculated and annoated (S,D,L) with new methods (PhylDog,LBBE) HOGENOM Proteins from all available complete genomes (Bacteria, Eukaroyota, Archaea) Clustering with SiLiX and post-processing with HiFiX Trees will be annotated (S,D,L,T)

HOGENOM characteristics all complete genomes from the whole tree of life (not restricted to particular phylum) Propose « gene families » : full length homologous sequences (different of « domain families »)

Domain vs. gene families Protein domain family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation)

Domain vs. gene families Homologous gene family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation) Homologous Gene families (HOGENOM): - Evolution of homologous genes by speciation or by gene duplication, or horizontal transfer - Sequences are homologous over their entire length (or almost)

Orthologs and paralogs in HOGENOM HOGENOM is centered on phylogenetic trees of gene families. Information on orthologs and paralogs can be deduced from gene trees: - from the annotation of gene trees (Duplication, Speciation, Transfer) - from query tools such as tree-pattern matching

Building Compare all proteins against each other (BLAST) Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

Compare all proteins against each other Iterative BLAST calculation Use of a non-redundant protein sequence database … (all know proteins , about 20,000,000 non redondant sequences) … associated with a resulting BLAST hits database (from which blast hits may be extracted) Cluster, grid and cloud computing

Building Compare all proteins against each other (BLAST) Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

Local pairwise alignments SiLiX 1st step : similarity search Protein database  Local pairwise alignments BLASTP BLOSUM62 E ≤ 10-4

SiLiX 2nd step : SiLiX clustering Use the all-against-all BLAST hits

SiLiX : Selection of consistent HSPs Seq. A Seq. B S2 S1’ ∆lg1 lgHSP1 ∆lg2 ∆lg3 lgHSP2 Seq. A Seq. B

SiLiX : single linkage clustering B A C HSP ≥ 80 % length Identity ≥ 35 % B A Cluster A, B, C C

SiLiX Computing efficiency: Clustering quality: Ultra-fast SiLiX : single linkage clustering with alignment coverage constraints (Mièle et al. BMC Bioinformatics 2011) Computing efficiency: Ultra-fast Memory efficient Scalable (parallel architecture) Clustering quality: At least as good as the previously published methods

However … Because of over-extension of BLAST alignments, some sequences that share only partial homology may be clustered in a same family The risk of alignment over-extension is low, but becomes a problem for very large protein families Use more stringent clustering criteria ? No : optimal clustering criteria are not the same for all families

HiFiX The mode and tempo of evolution is specific to each protein family A multiple alignment provides information about the specific pattern of evolution of a family => this can be used to decide whether or not a new sequence belongs to that family

HiFiX Step 1: rapid clustering (SiLiX) pre-families Step2: sub-clustering of pre-families into homogeneous protein clusters sub-families Step3: progressive merging of sub-families into families, with evaluation of multiple alignment quality at each step families

HiFiX

HiFiX

HiFiX

Results of clustering About 7,000,000 proteins clustered into 300,000 families Family size distribution: Number Sequences Number of Families at least 2 296,920 2:10 242,398 10:500 53,450 500:2000 1,026 more than 2000 79

Building Compare all proteins against each other (BLAST) Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

Compute multiple alignments All alignments ( ~ 300, 000) have been calculated with ClustalΩ

Building Compare all proteins against each other (BLAST) Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

Compute phylogenetic tree Question: what about the alternative splicing ?

Alternative splicing In eukaryotes, due to alternative splicing , one unique gene may be be transcripted into several  transcripts 

Transcripts in HOGENOM6 We selected all the transcripts for each gene. Because the longest transcript is not allways the best!

Selection of a representaitive isoform in HOGENOM Because: We don’t want several proteins for a same gene in a phylogenetic tree: may be seen as a duplication We want 1 protein per gene for statistic comparison among organisms

Selection of a representaitive isoform : how ?

Selection of a representative isoform : how ? Eukarya 1 or more transcripts per gene Archaea and bacteria 1 transcript per gene

Selection of a representative isoform : how ? Eukarya clustering Archaea and bacteria

Selection of a representative isoform : how ? First step: when a gene has isoforms in different families ( ), choose a family for the gene

Selection of a representative isoform : how ? We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 2 3 2 genes 2 genes 3 genes

Selection of a representative isoform : how ? We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 2 genes 2 genes 3 genes

Selection of a representative isoform : how ? We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes

Selection of a representative isoform : how ? We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes The « rejected » isoforms are called « ISOFORMEX » SOME FAMILIES MAY FINALLY BE EMPTY AFTER THIS

Selection of a representative isoform : how ? Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes 2 genes 3 genes

Selection of a representative isoform : how ? Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes ? 2 genes ? 3 genes

Selection of a representative isoform : how ? We use the alignment

Selection of a representative isoform : how ? We use the alignment Suppression of ISOFORMEX

Selection of a representative isoform : how ? We use the alignment Selection positions with < 50% gap

Selection of a representative isoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 2 2

Selection of a representative isoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2

Selection of a representative isoform : how ? For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2 2 2

Tree calculation

Tree calculation isformin isformin a b c isformin d isformex e f g

Tree calculation isformin isformin a b c isformin d isformex e f g

Tree calculation Gblocks Phyml, FastTree d isformin a isformin e f a b isformex e f g

Building Compare all proteins against each other (BLAST) Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

Annotate phylogenetic trees Several methods are currently developed in the ANCESTROM project Speciation, Duplication and Loss Speciation, Duplication, Transfert and Loss See Vincent Daubin talk tomorow

Querying the database ACNUC server (client server application, R pacakge, python package, C API, bio++ API)

Querying the database Web interface on PRABI

Querying the database Web interface on PRABI

Querying the database Web interface on PRABI

Querying the database Homologous families detected with HMM (D. Guyot)

Querying the database New tools ! (R. Planel, J.F. Dufayard)

Querying the database Displaying the gene tree and the the syntheny context of the gene

Querying the database Displaying the gene tree and the the syntheny context of the gene

Querying the database Search for orthologous vertrebrate genes between mouse and man

Querying the database Search for orthologous vertrebrate genes between mouse and man

Thank you for your attention Ancestrome: Integrative phylogenetic approaches for reconstructing ancestral "-omes"