Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic genome-wide reconstruction of phylogenetic gene trees

Similar presentations


Presentation on theme: "Automatic genome-wide reconstruction of phylogenetic gene trees"— Presentation transcript:

1 Automatic genome-wide reconstruction of phylogenetic gene trees
Ilan Wapinski, Avi Pfeffer, Nir Friedman, Aviv Regev Bioinformatics, Vol. 23, No. 13. (1 July 2007) Presented by : Hsin-Ta Wu CSCI-2950C Topics in Computational Biology 2008.NOV.4

2 Outline Introduction Objective Methods Results Conclusion Discussion

3 Gene Duplication and Loss
Gene duplication and loss is a powerful source of functional innovation, including development of new function or pruning of old ones. Gene duplication is the most important mechanism for generating new genes and new biochemical processes that have facilitated the evolution of complex organisms from simpler ones. Normal fly Ubx (Ultrabithorax) gene duplication Extra set of wings! Carroll S.B. et al. From DNA to Diversity (2001) Blackwell Science

4 Relation between Evolutionary History and Gene Duplication / Loss?
What classes of genes readily evolve through duplication and loss? What innovations typically arise from gene duplication events? Studies addressing such questions have been limited by the difficulty of tracing the exact evolutionary history of genes. Reconstruct the gene tree with reliable resolution of gene orthology and paralogy helps us to do the systematic study of gene duplication and loss events

5 Outline Introduction Objective Methods Results Conclusion Discussion

6 OBJECTIVE Develop a scalable algorithm to reconstruct the underlying evolutionary history of all genes in a large group of species.

7 Outline Introduction Objective Methods Results Conclusion Discussion
Background Knowledge Introduce Orthogroup SYNERGY Algorithm Results Conclusion Discussion

8 Phylogenetic species tree
Leave : extant species (modern observations) Internal node : common ancestral a Root: ancestral species Y Speciation : evolutionary process by which new biological species arise X b c Speciation

9 Difference between Species and Gene tree
Species Tree Gene Tree We denote as ga a gene in species a The tree shows the evolutionary descent of the ancestral gene g1x (Each node in the tree is a gene or a duplication event) More information for duplication and loss events in gene tree Paralogy and orthology information is clear in gene tree

10 Definition – orthologs & paralogs
Genes share a common ancestor at a speciation events Paralogs Genes are related through duplication events

11 Outline Introduction Objective Methods Results Conclusion Discussion
Background Knowledge Introduce Orthogroup SYNERGY Algorithm Results Conclusion Discussion OG1x g1c g2b g1b g2a g1a OG1Y OG2Y An orthogroup OGxi is defined with respect to an ancestral species x in T and includes only and all of those genes from the extant species under x

12 Why orthogroup? Orthogroups contain the shared ancestral relationships between genes at each internal node of a phylogenetic species tree. It’s useful for reconstructing the gene tree! OG1x g1c g2b g1b g2a g1a OG1Y OG2Y OGix ← {g1a, g1c , g2a g2b , g1b} OGiY ← {g1a, g1b} OGiZ ← {g2a , g2b }

13 What is orthogroup? Orthogroups is the set of genes that descended from a single common ancestral gene. Species tree T OGiX contains all of those genes from the extant species under X (a and b). The genes in OGiX are descended from a single common ancestral gene gix in X OGix x a b x gix a OGix ← {gia, gib} gia b gib

14 Gene Tree Pix Each orthogroups OGix has a corresponding gene tree Pix.
Orthogroups gene tree Pix Species tree T x Duplication event g1a g2a g1b The leaves are the genes g1a in OGix a b x gix a OGix ← {g1a, g2a , g1b} g1a g2a b g1b

15 Important Definitions
Sound orthogroup (definition 1) contains only the genes that descended from a single common ancestor (specificity) Complete orthogroup (definition 2) contains all the genes that descended from a single common ancestor (sensitivity) ※ the importance of two definitions will be discussed later

16 Outline Introduction Objective Methods Results Conclusion Discussion
Background Knowledge Introduce Orthogroup SYNERGY Algorithm Results Conclusion Discussion

17 SYNERGY Algorithm SYNERGY recursively traverses the nodes of the given species tree T from its leaves to its root (bottom-up strategy), identifying orthogroups with respect to each node Input is a set of species including their: species tree T the sequence of predicted genes for each extant species chromosomal positions for each species Output is gene tree

18 Sequence and chromosomal location for each extant species in T
Pipeline Species tree T Sequence and chromosomal location for each extant species in T Input Data Pre-Processing (Scoring Gene Similarity & Gene Similarity Graph) A1 B1 C1 A2 B2 C2 A3 B3 A4 C4 A 24 12 16 15 B 33 22 44 C SYNERGY Algorithm (Identify Orthogroup & Reconstruct Gene Tree) X1 A1 B1 B2 Y1 C1 D1 Z1 LCA: y g1a OG1Y Output Data g2b g1b

19 Previewing SYNERGY Algorithm
OGiX A1 A2 A3 A4 Y B A X C X Y B1 B2 B3 C1 C2 C4 P2Y P4Y P1Y Considering species A and B below X in the species tree T g1A g1B OG1X OG1Y OG1x g1B Determine the OG1x : g1A Last common ancestral is X and the common ancestral gene is g1X Considering species X and C below Y in the species tree T g1C OG1y OG1x Determine the OG1y : g1C Last common ancestral is Y and the common ancestral gene is g1y

20 Two issues in the algorithm
OGix OGiz OGiy How to identify the orthogroup OGix ? g1A g1B OG1X g1C OG1Y Given a orthogroup OGix , how to reconstruct a gene tree?

21 SYNERGY Algorithm – Matching orthogroups
If A1 and B1 are similar, record the relation in the data structure. g1A g1B OG1x A1 A X X B B1 Species tree T Genes in Species A and B Candidate orthogroup Orthology Assignment How to determine two genes are similar? Remember what we do in pre-processing? Scoring the gene similarity

22 Selecting similar genes for Orthogroup
SYNERGY relies on the pre-computed distances between genes to make orthology assignments. Execute all-versus-all FASTA alignments between all genes in the input. The relations between genes will represented by a gene similarity graph as a weighted directed graph G = (V, E)

23 Pipeline for gene similarity graph
g1a g1b FASTA alignment Derived from logic of the dot plot compute best diagonals from alignment Define the gene pairs with significantly similar: FASTA E-value is below 0.1 (significant) Either gib is the best FASTA hit in species b to gia or the percent identity between gia and gib is above 50% of that between gia and its best hit in b (similarity) E-value: estimate of the likelihood of a similar match occurring by chance Using percent identity or E-value as weight of edge? No, because… ? g1a g1b Identity and E-values are unsuitable for representing the nearest phylogenetic neighbor (Koski and Golding, 2001; Wall et al., 2003)

24 How to get the weight of edge?
Weighting each edge by the distance similarity method. Pre-compute peptide sequence similarity and synteny similarity scores as distance between two genes. Peptide sequence score – globally align two proteins based on JTT amino acid substitution matrix. Synteny similarity score – the fraction of their neighbors that are orthologous to each other

25 Synteny similarity score
Synteny – the similar (syntenic) blocks comprised of multiple genes on the chromosome. Synteny similarity score quantifies the similarity between the chromosomal neighborhoods of two genes. SYNERGY compute the score between two genes as the fraction of their neighbors that are orthologous to each other. a X X Syntenic blocks b The synteny similarity score ds for g3a and g3b is 2/3

26 Peptide similarity score
Globally align two proteins based on JTT amino acid substitution matrix (Jones et al. 1992).

27 Distance of two genes - example
g1A g2A g3A g4A g1B g3B g4B Both dp and ds are scaled and treated as distances for assessing protein and chromosomal evolution between pairs of genes. Two genes with high similarity have scores close to 0 ; genes sharing no similarity have scores 2.0 The peptide similarity score dp for g3a and g3b is 48 The synteny similarity score ds for g3a and g3b is 2/3

28 SYNERGY Algorithm – Matching orthogroups
If A1 and B1 are similar, record the relation in the Graph g1A g1B OG1x A1 A X X B B1 Species tree T Genes in Species A and B Candidate orthogroup Orthology Assignment ? Gene similarity is stored in Graph structure what’s condition for doing orthology assignment?

29 Generate Candidate Orthogroup
SYNERGY assigns orthogroups (genes) into the same candidate orthogroup if they have reciprocal edges between them. g2A g2B OG2X g1A g1B OG1X A1 A2 A3 A X X B B1 B2 B3 Species tree T Gene similarity graph Candidate orthogroup

30 Generate Candidate Orthogroup
SYNERGY assigns orthogroups (genes) into the same candidate orthogroup if they apply transitive closure on these reciprocal relations. g2A g2B OG2X g1A g1B OG1X A1 A2 A3 A X X B B1 B2 B3 Species tree T Gene similarity graph Candidate orthogroup

31 Two issues in the algorithm
How to identify the orthogroup OGix ? OGiz OGiy OGix g1A g1B OG1X g1C OG1Y Given a orthogroup OGix , how to reconstruct a gene tree?

32 Reconstruction of Gene Tree
OG1z OG1y OG1X Recall that the trees {Py} and {Pz} were already resolved in previous iteration g1a g1b g2b g1c g1d OGix OGiy OGiz x c d z a b y one-to-one relation How to solve one-to-many or many-to-many relations due to duplication and/or losses?

33 Solving one-to-many relation during reconstruction
g2b g1b g1a OG1Y g1Z Using the modified Neighbor-Joining method applied to the distance matrix between the orthogroups that comprise OGix g1Z g1b g2b g1a 24 44 15 11 20 26 g1a g1Z g1b g2b Dist[g1Z, g1b_ g2b] = (Dist[g1Z, g1b] + Dist[g1Z, g2b] - Dist[g1b, g2b])/2 = ( ) / 2 = 28.5 g1Z g1b_g2b g1a 28.5 15 17.5 g1b g2b g1Z g1a Unrooted phylogenetic gene tree! Tree rooting need to be solved!!

34 The importance of Tree Rooting
Correct rooting is important since the selected root position may determine whether all of an orthogroup’s members descended from a single gene or from multiple genes.

35 How to choose a tree’s root?
Assumption: Rates of evolution among all the leaves in a tree is equal. A tree’s root should be approximately equidistant to all the leaves. SYNERGY compute every possible rooting r at internal branch, and assign a score to each rooting. The score is proportional to the variance in both peptide sequence and synteny scores, termed πr and σr, respectively.

36 Scoring Function for Tree Rooting
πr : Amino Acid Score σr : Synteny Score SYNERGY select the rooting that maximizes: Following a gene duplication, one or both of the paralogs are often under relaxed selection, and can evolve at an accelerated rate(Lynch and Katju, 2004; Ohno, 1970). This conflicts with the assumption above that all branches of the tree evolve at an equal rate, and complicates tree rooting. Therefore, SYNERGY introduce a score ωr for root locations that are in terms of the number of duplication and lose it invokes. δ s : rates of duplication at the branch s λ s : rates of loss at the branch s

37 Multiple Ancestral Genes?
We may find that the root of Pix represents not a single gene because of an earlier duplication event. (Fig. c) This violates Definition 1 (Sound Orthogroup) (details will be presented in discussion) Split the Pix SYNERGY iterates this until each orthogroup represents a single ancestral gene and no orthogroups need to be partitioned.

38 Updating the Gene Similarity Graph
B A X C X B1 B2 B3 C1 C2 C4 Updating the gene similarity graph g1A g1B OG1X g2A g2B OG2X g3A g3B OG3X Y B A X C A1 B1 C1 A2 B2 C2 A3 B3 A4 C4 OG1X OG2X OG3X

39 Updating the Gene Similarity Graph
B A X C A1 B1 C1 A2 B2 C2 A3 B3 A4 C4 OG1X OG2X OG3X New edges like C1OG1X, C2OG2X, C2OG3X need to be updated! A2 B2 C2 OG2X Using Neighbor-Joining algorithms to update edges C2OG2X = ½ (C2A2+C2B2 – A2B2) If one of the distances in equation is not defined in the original graph, SYNERGY use the maximal distance value

40 Review the SYNERGY algorithm
1) Pre-Processing (Scoring Gene Similarity & Gene Similarity Graph) 2)Identifying Orthogourps (Using Gene similarity & Orthology assignments) 3)Reconstructing Gene Tree (Rooting, Breaking orthogroups) 4)Updating the Gene Similarity Graph (Using Joining algorithms to update edges) 5)Recursively step 2 to 4 for next stage 6)We get the whole gene tree!!!!

41 Outline Introduction Objective Methods Results Conclusion Discussion
Test on Ascomycota fungi Comparison to curated resource Conclusion Discussion

42 SYNERGY Application on fungal species
Why Fungi? With whole-genome duplication (WGD) event, followed by widespread loss of paralogous genes (Byrne and Wolfe, 2005; Dietrich et al., 2004; Kellis et al., 2004) With studied model, Saccharomyces cerevisiae, offers studies of genome evolution and function (kellis et al., 2003) Source: nine Ascomycota fungal species with a total of 52,092 protein coding genes

43 Results of the test case
(c) A species tree of nine fungi (Scannell et al., 2006) A much larger number of duplication and loss events must be invoked to reconcile this tree with the known species phylogeny The gene tree reconstructed by SYNERGY for OG#3184 The gene tree constructed for the same set of genes using CLUSTALW’s Neighbor-Joining

44 Results of the test case
Predicted protein coding genes The number and percent of singleton in SYNERGY’s prediction The number of ancestral genes inferred from SYNERGY’s gene trees 15 The number of duplication events 645 The number of loss events (c) A species tree of nine fungi (Scannell et al., 2006)

45 Results of the test case
Three species have a large number of duplication events!! Faulty ORF predictions amongst three sensu stricto species. (c) A species tree of nine fungi (Scannell et al., 2006)

46 SYNERGY versus RBH (reciprocal best hits)
Compare SYNERGY’s results with those attained by RBH anchored by S.cerevisiae and noticed a marked improvement in performance. 1. Less singleton in S.cerevisiae 2. Identify orthologs for 106 more genes in S.cerevisiae than RBH (data not shown). 3. Identify 298 more orthogroups spanning all species than RBH. Many orthogroups have more than nine genes, a result of gene duplication events.

47 Measuring Orthogroup Robustness
Jackknife-based approach – repeatedly excluding different portions of data (perturbations) to measure orthogroup robustness to choice of species included the accuracy of gene predictions within each species Estimate species confidence score by systematically hiding each branch of the species tree T and running SYNERGY separately, resulting in 31 holdout experiments. Estimate gene confidence score by randomly withholding a proportion of genes from each genome repeatedly. Set the probability of hiding each gene at 0.1, and 50 holdout experiments

48 Measuring Orthogroup Confidence
For both species and gene confidence, SYNERGY test the soundness and completeness of the identified orthogroups. The non-singleton orthogroups SYNERGY obtained are remarkably robust to a systematic perturbation of the set of included species (93.5% are complete and 99.7% are sound at 80% confidence level) (Ilan, et al., Nature 2007) Perturbations to gene content were more disruptive than to species, and gene soundness was more robust than completeness When removing up to 20% of the genes in each genome at random, 96.3% of the orthogroups are complete, and 78% are sound (Ilan, et al., Nature 2007)

49 Outline Introduction Objective Methods Results Conclusion Discussion
Test on Ascomycota fungi Comparison to curated resource Conclusion Discussion

50 Comparison to curated resource - YGOB
Yeast Gene Order Browser (YGOB, Byrne and Wolfe, 2005), provides a “gold standard” of orthology and paralogy relations. YGOB assume that the WGD is the only duplication event among the lineage and relies predominantly on synteny to assign orthology relations (manually curated). Authors also compared the quality of SYNERGY’s paralogy assignments to that of INPARANOID (Remm et al., 2001). INPARANOID is a hit-clustering method designed to identify paralogous relations.

51 Comparison with INPARANOID
SYNERGY identified more known paralogs dating to the WGD than INPARANOID did SYNERGY also showed greater sensitivity (orange cells) than INPARANOID when identifying orthology relations Some of the reduced specificity may be the result of a limitation of our gold standard YGOB is limited by two assumptions: gene order is nearly always conserved and thus can be used as the primary source of evidence for shared ancestry Assumption 2 relegate a far fewer proportion of their orthologous loci are ancestral to all of their species than those that SYNERGY identified YGOB is limited by two assumptions: all duplication events originated in the WGD and thus orthology is at most a two-to-one relationship Assumption 1 relegate a greater portion of genes as singletons without orthologs SYNERGY (top number) INPARANOID (bottom number) Sensitivity (orange cells) Specificity (green cells) Paralogues reported (blue cells)

52 Outline Introduction Objective Methods Results Conclusion Discussion

53 Conclusion SYNERGY is the gnome-wide reconstruction of homology relations across multiple genomes. SYNERGY combines hit-clustering approaches with the phylogenetic reconstruction of tree-based methods. The results of SYNERGY markedly improve over the widely used RBH approach. Also, they are comparable quality to a manually curated gold standard.

54 Outline Introduction Objective Methods Results Conclusion Discussion

55 Complete and Sound At each recursive Stage, SYNERGY assumes that sound and complete orthogroups are resolved for the lower nodes in the tree. SYNERGY ensures completeness by allowing many edges (candidate homology relations) into the input gene similarity graph and by applying a lenient criterion to derive candidate orthogroups. Then, SYNERGY achieves soundness by refining these coarse relations as we progress through the species tree, breaking orthogroups using phylogenetic principles at each Stage.

56 Violation of soundness during generating Candidate Orthogroup
SYNERGY ensures completeness by allowing many edges (candidate homology relations) into the input gene similarity graph and by applying a lenient criterion to derive candidate orthogroups. However, such lenient criterion will… g2A A1 A2 A g1A X X OG1X g1B B B1 g1A g1B OG1X g2A g2B OG2X In fact, candidate orthogroup OG1X contains g2A through duplication events that predate X Such violations of the orthogroup soundness condition (Definition 1) are handled later

57 Thanks for Your Attention!

58 Orthogroup Confidence
A complete orthogroup (Definition 2) contains all the genes that descended from a single common ancestor and thus its genes should not ‘migrate out’ of it in the holdout experiments. where, h(gj , gk) and OGi (gj , gk) specify the last species in the tree in which gj and gk share a common ancestor in the holdout experiment h and the original orthogroup

59 A sound orthogroup (Definition 1) contains only the genes that descended from a single common ancestor, and thus new genes should not ‘migrate into’ the orthogroup in the holdout experiments. count the number of pairs of non-orthologous genes (gj, gk), gj in OGi , gk not in OGi that became orthologous

60 Introduction Current methods for inferring homology relation
Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Synteny methods Detect conserved regions, stretches of nearby hits Phylogenetic methods Phylogeny of family clusters orthologs near each other


Download ppt "Automatic genome-wide reconstruction of phylogenetic gene trees"

Similar presentations


Ads by Google