Presentation is loading. Please wait.

Presentation is loading. Please wait.

HOGENOM a phylogenomic database

Similar presentations


Presentation on theme: "HOGENOM a phylogenomic database"— Presentation transcript:

1 HOGENOM a phylogenomic database
Simon Penel, Pascal Calvat, Jean-Francois Dufayard, Vincent Daubin, Laurent Duret , Manolo Gouy, Dominique Guyot, Daniel Kahn, Vincent Miele, Vincent Navratil, Guy Perrière, Rémi Planel

2 Several phylogenomic databases developed at LBBE/PRABI
HOVERGEN Verterbrate Proteins from UniProt Clustering with SiLiX HOMOLENS Proteins from Ensembl Complete Genomes Clustering from Ensembl Trees calculated and annoated (S,D,L) with new methods (PhylDog,LBBE) HOGENOM Proteins from all available complete genomes (Bacteria, Eukaroyota, Archaea) Clustering with SiLiX and post-processing with HiFiX Trees will be annotated (S,D,L,T)

3 HOGENOM characteristics
all complete genomes from the whole tree of life (not restricted to particular phylum) Propose « gene families » : full length homologous sequences (different of « domain families »)

4 Domain vs. gene families
Protein domain family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation)

5 Domain vs. gene families
Homologous gene family Families of homologous protein domains (ProDom): - Evolution by domain shuffling (duplication, loss, translocation) Homologous Gene families (HOGENOM): - Evolution of homologous genes by speciation or by gene duplication, or horizontal transfer - Sequences are homologous over their entire length (or almost)

6 Orthologs and paralogs in HOGENOM
HOGENOM is centered on phylogenetic trees of gene families. Information on orthologs and paralogs can be deduced from gene trees: - from the annotation of gene trees (Duplication, Speciation, Transfer) - from query tools such as tree-pattern matching

7 Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

8 Compare all proteins against each other
Iterative BLAST calculation Use of a non-redundant protein sequence database … (all know proteins , about 20,000,000 non redondant sequences) … associated with a resulting BLAST hits database (from which blast hits may be extracted) Cluster, grid and cloud computing

9 Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

10 Local pairwise alignments
SiLiX 1st step : similarity search Protein database Local pairwise alignments BLASTP BLOSUM62 E ≤ 10-4

11 SiLiX 2nd step : SiLiX clustering Use the all-against-all BLAST hits

12 SiLiX : Selection of consistent HSPs
Seq. A Seq. B S2 S1’ ∆lg1 lgHSP1 ∆lg2 ∆lg3 lgHSP2 Seq. A Seq. B

13 SiLiX : single linkage clustering
B A C HSP ≥ 80 % length Identity ≥ 35 % B A Cluster A, B, C C

14 SiLiX Computing efficiency: Clustering quality: Ultra-fast
SiLiX : single linkage clustering with alignment coverage constraints (Mièle et al. BMC Bioinformatics 2011) Computing efficiency: Ultra-fast Memory efficient Scalable (parallel architecture) Clustering quality: At least as good as the previously published methods

15 However … Because of over-extension of BLAST alignments, some sequences that share only partial homology may be clustered in a same family The risk of alignment over-extension is low, but becomes a problem for very large protein families Use more stringent clustering criteria ? No : optimal clustering criteria are not the same for all families

16 HiFiX The mode and tempo of evolution is specific to each protein family A multiple alignment provides information about the specific pattern of evolution of a family => this can be used to decide whether or not a new sequence belongs to that family

17 HiFiX Step 1: rapid clustering (SiLiX)
pre-families Step2: sub-clustering of pre-families into homogeneous protein clusters sub-families Step3: progressive merging of sub-families into families, with evaluation of multiple alignment quality at each step families

18 HiFiX

19 HiFiX

20 HiFiX

21 Results of clustering About 7,000,000 proteins clustered into 300,000 families Family size distribution: Number Sequences Number of Families at least ,920 2: ,398 10: ,450 500: ,026 more than

22 Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

23 Compute multiple alignments
All alignments ( ~ 300, 000) have been calculated with ClustalΩ

24 Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

25 Compute phylogenetic tree
Question: what about the alternative splicing ?

26 Alternative splicing In eukaryotes, due to alternative splicing , one unique gene may be be transcripted into several  transcripts 

27 Transcripts in HOGENOM6
We selected all the transcripts for each gene. Because the longest transcript is not allways the best!

28 Selection of a representaitive isoform in HOGENOM
Because: We don’t want several proteins for a same gene in a phylogenetic tree: may be seen as a duplication We want 1 protein per gene for statistic comparison among organisms

29 Selection of a representaitive isoform : how ?

30 Selection of a representative isoform : how ?
Eukarya 1 or more transcripts per gene Archaea and bacteria 1 transcript per gene

31 Selection of a representative isoform : how ?
Eukarya clustering Archaea and bacteria

32 Selection of a representative isoform : how ?
First step: when a gene has isoforms in different families ( ), choose a family for the gene

33 Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 2 3 2 genes 2 genes 3 genes

34 Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 2 genes 2 genes 3 genes

35 Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes

36 Selection of a representative isoform : how ?
We select the family with the highest number of eukaryotic genes (and not proteins) 1 1 1 2 2 If the number of eukaryotic genes are identical, we select the family with the highest number of eukaryotic proteins 2 3 If the number of eukaryotic proteins are identical, we select the family with the highest number of proteins 2 genes 2 genes 3 genes The « rejected » isoforms are called « ISOFORMEX » SOME FAMILIES MAY FINALLY BE EMPTY AFTER THIS

37 Selection of a representative isoform : how ?
Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes 2 genes 3 genes

38 Selection of a representative isoform : how ?
Second step: when a gene has isoforms in a family, choose a representative isoform for the gene 1 1 1 2 2 2 3 2 genes ? 2 genes ? 3 genes

39 Selection of a representative isoform : how ?
We use the alignment

40 Selection of a representative isoform : how ?
We use the alignment Suppression of ISOFORMEX

41 Selection of a representative isoform : how ?
We use the alignment Selection positions with < 50% gap

42 Selection of a representative isoform : how ?
For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 2 2

43 Selection of a representative isoform : how ?
For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2

44 Selection of a representative isoform : how ?
For each isoform of a given gene, for each position, we count for 1 each time the residue is identical to the residue in at least one of the isoforms of all other eukaryotic genes. The isoform with the highest total is selected, the other isoforms being tagged as ISOFORMIN 1 2 1 2 2 2 2

45 Tree calculation

46 Tree calculation isformin isformin a b c isformin d isformex e f g

47 Tree calculation isformin isformin a b c isformin d isformex e f g

48 Tree calculation Gblocks Phyml, FastTree d isformin a isformin e f a b
isformex e f g

49 Building Compare all proteins against each other (BLAST)
Cluster homologous sequences into families (SILIX + HIFIX) Compute multiple alignments for each family Compute phylogenetic trees for each family Annotate phylogenetic trees (gene duplications, losses, transfers)

50 Annotate phylogenetic trees
Several methods are currently developed in the ANCESTROM project Speciation, Duplication and Loss Speciation, Duplication, Transfert and Loss See Vincent Daubin talk tomorow

51 Querying the database ACNUC server (client server application, R pacakge, python package, C API, bio++ API)

52 Querying the database Web interface on PRABI

53 Querying the database Web interface on PRABI

54 Querying the database Web interface on PRABI

55 Querying the database Homologous families detected with HMM (D. Guyot)

56 Querying the database New tools ! (R. Planel, J.F. Dufayard)

57 Querying the database Displaying the gene tree and the the syntheny context of the gene

58 Querying the database Displaying the gene tree and the the syntheny context of the gene

59 Querying the database Search for orthologous vertrebrate genes between mouse and man

60 Querying the database Search for orthologous vertrebrate genes between mouse and man

61 Thank you for your attention
Ancestrome: Integrative phylogenetic approaches for reconstructing ancestral "-omes"


Download ppt "HOGENOM a phylogenomic database"

Similar presentations


Ads by Google