Automatic genome-wide reconstruction of phylogenetic gene trees

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Comparative genomics Joachim Bargsten February 2012.
Molecular Evolution Revised 29/12/06
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Bioinformatics and Phylogenetic Analysis
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Similar Sequence Similar Function Charles Yan Spring 2006.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Sushmita Roy BMI/CS 576
Sequencing a genome and Basic Sequence Alignment
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Phylogenetics.
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Phylogeny.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Phylogeny and the Tree of Life
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Reconstructing the Evolutionary History of Complex Human Gene Clusters
Distance based phylogenetics
BLAST program selection guide
Basics of Comparative Genomics
Comparative Genomics.
Pipelines for Computational Analysis (Bioinformatics)
Multiple Alignment and Phylogenetic Trees
Tests for Gene Clustering
The Tree of Life From Ernst Haeckel, 1891.
Summary and Recommendations
Volume 11, Issue 3, Pages (March 2018)
SEG5010 Presentation Zhou Lanjun.
Gautam Dey, Tobias Meyer  Cell Systems 
Unit Genomic sequencing
Basics of Comparative Genomics
Volume 11, Issue 3, Pages (March 2018)
Summary and Recommendations
Volume 11, Issue 7, Pages (May 2015)
Comparing 3D Genome Organization in Multiple Species Using Phylo-HMRF
Presentation transcript:

Automatic genome-wide reconstruction of phylogenetic gene trees Ilan Wapinski, Avi Pfeffer, Nir Friedman, Aviv Regev Bioinformatics, Vol. 23, No. 13. (1 July 2007) Presented by : Hsin-Ta Wu CSCI-2950C Topics in Computational Biology 2008.NOV.4

Outline Introduction Objective Methods Results Conclusion Discussion

Gene Duplication and Loss Gene duplication and loss is a powerful source of functional innovation, including development of new function or pruning of old ones. Gene duplication is the most important mechanism for generating new genes and new biochemical processes that have facilitated the evolution of complex organisms from simpler ones. Normal fly Ubx (Ultrabithorax) gene duplication Extra set of wings! Carroll S.B. et al. From DNA to Diversity (2001) Blackwell Science

Relation between Evolutionary History and Gene Duplication / Loss? What classes of genes readily evolve through duplication and loss? What innovations typically arise from gene duplication events? Studies addressing such questions have been limited by the difficulty of tracing the exact evolutionary history of genes. Reconstruct the gene tree with reliable resolution of gene orthology and paralogy helps us to do the systematic study of gene duplication and loss events

Outline Introduction Objective Methods Results Conclusion Discussion

OBJECTIVE Develop a scalable algorithm to reconstruct the underlying evolutionary history of all genes in a large group of species.

Outline Introduction Objective Methods Results Conclusion Discussion Background Knowledge Introduce Orthogroup SYNERGY Algorithm Results Conclusion Discussion

Phylogenetic species tree Leave : extant species (modern observations) Internal node : common ancestral a Root: ancestral species Y Speciation : evolutionary process by which new biological species arise X b c Speciation

Difference between Species and Gene tree Species Tree Gene Tree We denote as ga a gene in species a The tree shows the evolutionary descent of the ancestral gene g1x (Each node in the tree is a gene or a duplication event) More information for duplication and loss events in gene tree Paralogy and orthology information is clear in gene tree

Definition – orthologs & paralogs Genes share a common ancestor at a speciation events Paralogs Genes are related through duplication events

Outline Introduction Objective Methods Results Conclusion Discussion Background Knowledge Introduce Orthogroup SYNERGY Algorithm Results Conclusion Discussion OG1x g1c g2b g1b g2a g1a OG1Y OG2Y An orthogroup OGxi is defined with respect to an ancestral species x in T and includes only and all of those genes from the extant species under x

Why orthogroup? Orthogroups contain the shared ancestral relationships between genes at each internal node of a phylogenetic species tree. It’s useful for reconstructing the gene tree! OG1x g1c g2b g1b g2a g1a OG1Y OG2Y OGix ← {g1a, g1c , g2a g2b , g1b} OGiY ← {g1a, g1b} OGiZ ← {g2a , g2b }

What is orthogroup? Orthogroups is the set of genes that descended from a single common ancestral gene. Species tree T OGiX contains all of those genes from the extant species under X (a and b). The genes in OGiX are descended from a single common ancestral gene gix in X OGix x a b x gix a OGix ← {gia, gib} gia b gib

Gene Tree Pix Each orthogroups OGix has a corresponding gene tree Pix. Orthogroups gene tree Pix Species tree T x Duplication event g1a g2a g1b The leaves are the genes g1a in OGix a b x gix a OGix ← {g1a, g2a , g1b} g1a g2a b g1b

Important Definitions Sound orthogroup (definition 1) contains only the genes that descended from a single common ancestor (specificity) Complete orthogroup (definition 2) contains all the genes that descended from a single common ancestor (sensitivity) ※ the importance of two definitions will be discussed later

Outline Introduction Objective Methods Results Conclusion Discussion Background Knowledge Introduce Orthogroup SYNERGY Algorithm Results Conclusion Discussion

SYNERGY Algorithm SYNERGY recursively traverses the nodes of the given species tree T from its leaves to its root (bottom-up strategy), identifying orthogroups with respect to each node Input is a set of species including their: species tree T the sequence of predicted genes for each extant species chromosomal positions for each species Output is gene tree

Sequence and chromosomal location for each extant species in T Pipeline Species tree T Sequence and chromosomal location for each extant species in T Input Data Pre-Processing (Scoring Gene Similarity & Gene Similarity Graph) A1 B1 C1 A2 B2 C2 A3 B3 A4 C4 A 24 12 16 15 B 33 22 44 C SYNERGY Algorithm (Identify Orthogroup & Reconstruct Gene Tree) X1 A1 B1 B2 Y1 C1 D1 Z1 LCA: y g1a OG1Y Output Data g2b g1b

Previewing SYNERGY Algorithm OGiX A1 A2 A3 A4 Y B A X C X Y B1 B2 B3 C1 C2 C4 P2Y P4Y P1Y Considering species A and B below X in the species tree T g1A g1B OG1X OG1Y OG1x g1B Determine the OG1x : g1A Last common ancestral is X and the common ancestral gene is g1X Considering species X and C below Y in the species tree T g1C OG1y OG1x Determine the OG1y : g1C Last common ancestral is Y and the common ancestral gene is g1y

Two issues in the algorithm OGix OGiz OGiy How to identify the orthogroup OGix ? g1A g1B OG1X g1C OG1Y Given a orthogroup OGix , how to reconstruct a gene tree?

SYNERGY Algorithm – Matching orthogroups If A1 and B1 are similar, record the relation in the data structure. g1A g1B OG1x A1 A X X B B1 Species tree T Genes in Species A and B Candidate orthogroup Orthology Assignment How to determine two genes are similar? Remember what we do in pre-processing? Scoring the gene similarity

Selecting similar genes for Orthogroup SYNERGY relies on the pre-computed distances between genes to make orthology assignments. Execute all-versus-all FASTA alignments between all genes in the input. The relations between genes will represented by a gene similarity graph as a weighted directed graph G = (V, E)

Pipeline for gene similarity graph g1a g1b FASTA alignment Derived from logic of the dot plot compute best diagonals from alignment Define the gene pairs with significantly similar: FASTA E-value is below 0.1 (significant) Either gib is the best FASTA hit in species b to gia or the percent identity between gia and gib is above 50% of that between gia and its best hit in b (similarity) E-value: estimate of the likelihood of a similar match occurring by chance Using percent identity or E-value as weight of edge? No, because… ? g1a g1b Identity and E-values are unsuitable for representing the nearest phylogenetic neighbor (Koski and Golding, 2001; Wall et al., 2003)

How to get the weight of edge? Weighting each edge by the distance similarity method. Pre-compute peptide sequence similarity and synteny similarity scores as distance between two genes. Peptide sequence score – globally align two proteins based on JTT amino acid substitution matrix. Synteny similarity score – the fraction of their neighbors that are orthologous to each other

Synteny similarity score Synteny – the similar (syntenic) blocks comprised of multiple genes on the chromosome. Synteny similarity score quantifies the similarity between the chromosomal neighborhoods of two genes. SYNERGY compute the score between two genes as the fraction of their neighbors that are orthologous to each other. a X X Syntenic blocks b The synteny similarity score ds for g3a and g3b is 2/3

Peptide similarity score Globally align two proteins based on JTT amino acid substitution matrix (Jones et al. 1992).

Distance of two genes - example g1A g2A g3A g4A g1B g3B g4B Both dp and ds are scaled and treated as distances for assessing protein and chromosomal evolution between pairs of genes. Two genes with high similarity have scores close to 0 ; genes sharing no similarity have scores 2.0 The peptide similarity score dp for g3a and g3b is 48 The synteny similarity score ds for g3a and g3b is 2/3

SYNERGY Algorithm – Matching orthogroups If A1 and B1 are similar, record the relation in the Graph g1A g1B OG1x A1 A X X B B1 Species tree T Genes in Species A and B Candidate orthogroup Orthology Assignment ? Gene similarity is stored in Graph structure what’s condition for doing orthology assignment?

Generate Candidate Orthogroup SYNERGY assigns orthogroups (genes) into the same candidate orthogroup if they have reciprocal edges between them. g2A g2B OG2X g1A g1B OG1X A1 A2 A3 A X X B B1 B2 B3 Species tree T Gene similarity graph Candidate orthogroup

Generate Candidate Orthogroup SYNERGY assigns orthogroups (genes) into the same candidate orthogroup if they apply transitive closure on these reciprocal relations. g2A g2B OG2X g1A g1B OG1X A1 A2 A3 A X X B B1 B2 B3 Species tree T Gene similarity graph Candidate orthogroup

Two issues in the algorithm How to identify the orthogroup OGix ? OGiz OGiy OGix g1A g1B OG1X g1C OG1Y Given a orthogroup OGix , how to reconstruct a gene tree?

Reconstruction of Gene Tree OG1z OG1y OG1X Recall that the trees {Py} and {Pz} were already resolved in previous iteration g1a g1b g2b g1c g1d OGix OGiy OGiz x c d z a b y one-to-one relation How to solve one-to-many or many-to-many relations due to duplication and/or losses?

Solving one-to-many relation during reconstruction g2b g1b g1a OG1Y g1Z Using the modified Neighbor-Joining method applied to the distance matrix between the orthogroups that comprise OGix g1Z g1b g2b g1a 24 44 15 11 20 26 g1a g1Z g1b g2b Dist[g1Z, g1b_ g2b] = (Dist[g1Z, g1b] + Dist[g1Z, g2b] - Dist[g1b, g2b])/2 = (24 + 44 - 11) / 2 = 28.5 g1Z g1b_g2b g1a 28.5 15 17.5 g1b g2b g1Z g1a Unrooted phylogenetic gene tree! Tree rooting need to be solved!!

The importance of Tree Rooting Correct rooting is important since the selected root position may determine whether all of an orthogroup’s members descended from a single gene or from multiple genes.

How to choose a tree’s root? Assumption: Rates of evolution among all the leaves in a tree is equal. A tree’s root should be approximately equidistant to all the leaves. SYNERGY compute every possible rooting r at internal branch, and assign a score to each rooting. The score is proportional to the variance in both peptide sequence and synteny scores, termed πr and σr, respectively.

Scoring Function for Tree Rooting πr : Amino Acid Score σr : Synteny Score SYNERGY select the rooting that maximizes: Following a gene duplication, one or both of the paralogs are often under relaxed selection, and can evolve at an accelerated rate(Lynch and Katju, 2004; Ohno, 1970). This conflicts with the assumption above that all branches of the tree evolve at an equal rate, and complicates tree rooting. Therefore, SYNERGY introduce a score ωr for root locations that are in terms of the number of duplication and lose it invokes. δ s : rates of duplication at the branch s λ s : rates of loss at the branch s

Multiple Ancestral Genes? We may find that the root of Pix represents not a single gene because of an earlier duplication event. (Fig. c) This violates Definition 1 (Sound Orthogroup) (details will be presented in discussion) Split the Pix SYNERGY iterates this until each orthogroup represents a single ancestral gene and no orthogroups need to be partitioned.

Updating the Gene Similarity Graph B A X C X B1 B2 B3 C1 C2 C4 Updating the gene similarity graph g1A g1B OG1X g2A g2B OG2X g3A g3B OG3X Y B A X C A1 B1 C1 A2 B2 C2 A3 B3 A4 C4 OG1X OG2X OG3X

Updating the Gene Similarity Graph B A X C A1 B1 C1 A2 B2 C2 A3 B3 A4 C4 OG1X OG2X OG3X New edges like C1OG1X, C2OG2X, C2OG3X need to be updated! A2 B2 C2 OG2X Using Neighbor-Joining algorithms to update edges C2OG2X = ½ (C2A2+C2B2 – A2B2) If one of the distances in equation is not defined in the original graph, SYNERGY use the maximal distance value

Review the SYNERGY algorithm 1) Pre-Processing (Scoring Gene Similarity & Gene Similarity Graph) 2)Identifying Orthogourps (Using Gene similarity & Orthology assignments) 3)Reconstructing Gene Tree (Rooting, Breaking orthogroups) 4)Updating the Gene Similarity Graph (Using Joining algorithms to update edges) 5)Recursively step 2 to 4 for next stage 6)We get the whole gene tree!!!!

Outline Introduction Objective Methods Results Conclusion Discussion Test on Ascomycota fungi Comparison to curated resource Conclusion Discussion

SYNERGY Application on fungal species Why Fungi? With whole-genome duplication (WGD) event, followed by widespread loss of paralogous genes (Byrne and Wolfe, 2005; Dietrich et al., 2004; Kellis et al., 2004) With studied model, Saccharomyces cerevisiae, offers studies of genome evolution and function (kellis et al., 2003) Source: nine Ascomycota fungal species with a total of 52,092 protein coding genes

Results of the test case (c) A species tree of nine fungi (Scannell et al., 2006) A much larger number of duplication and loss events must be invoked to reconcile this tree with the known species phylogeny The gene tree reconstructed by SYNERGY for OG#3184 The gene tree constructed for the same set of genes using CLUSTALW’s Neighbor-Joining

Results of the test case Predicted protein coding genes The number and percent of singleton in SYNERGY’s prediction The number of ancestral genes inferred from SYNERGY’s gene trees 15 The number of duplication events 645 The number of loss events (c) A species tree of nine fungi (Scannell et al., 2006)

Results of the test case Three species have a large number of duplication events!! Faulty ORF predictions amongst three sensu stricto species. (c) A species tree of nine fungi (Scannell et al., 2006)

SYNERGY versus RBH (reciprocal best hits) Compare SYNERGY’s results with those attained by RBH anchored by S.cerevisiae and noticed a marked improvement in performance. 1. Less singleton in S.cerevisiae 2. Identify orthologs for 106 more genes in S.cerevisiae than RBH (data not shown). 3. Identify 298 more orthogroups spanning all species than RBH. Many orthogroups have more than nine genes, a result of gene duplication events.

Measuring Orthogroup Robustness Jackknife-based approach – repeatedly excluding different portions of data (perturbations) to measure orthogroup robustness to choice of species included the accuracy of gene predictions within each species Estimate species confidence score by systematically hiding each branch of the species tree T and running SYNERGY separately, resulting in 31 holdout experiments. Estimate gene confidence score by randomly withholding a proportion of genes from each genome repeatedly. Set the probability of hiding each gene at 0.1, and 50 holdout experiments

Measuring Orthogroup Confidence For both species and gene confidence, SYNERGY test the soundness and completeness of the identified orthogroups. The non-singleton orthogroups SYNERGY obtained are remarkably robust to a systematic perturbation of the set of included species (93.5% are complete and 99.7% are sound at 80% confidence level) (Ilan, et al., Nature 2007) Perturbations to gene content were more disruptive than to species, and gene soundness was more robust than completeness When removing up to 20% of the genes in each genome at random, 96.3% of the orthogroups are complete, and 78% are sound (Ilan, et al., Nature 2007)

Outline Introduction Objective Methods Results Conclusion Discussion Test on Ascomycota fungi Comparison to curated resource Conclusion Discussion

Comparison to curated resource - YGOB Yeast Gene Order Browser (YGOB, Byrne and Wolfe, 2005), provides a “gold standard” of orthology and paralogy relations. YGOB assume that the WGD is the only duplication event among the lineage and relies predominantly on synteny to assign orthology relations (manually curated). Authors also compared the quality of SYNERGY’s paralogy assignments to that of INPARANOID (Remm et al., 2001). INPARANOID is a hit-clustering method designed to identify paralogous relations.

Comparison with INPARANOID SYNERGY identified more known paralogs dating to the WGD than INPARANOID did SYNERGY also showed greater sensitivity (orange cells) than INPARANOID when identifying orthology relations Some of the reduced specificity may be the result of a limitation of our gold standard YGOB is limited by two assumptions: gene order is nearly always conserved and thus can be used as the primary source of evidence for shared ancestry Assumption 2 relegate a far fewer proportion of their orthologous loci are ancestral to all of their species than those that SYNERGY identified YGOB is limited by two assumptions: all duplication events originated in the WGD and thus orthology is at most a two-to-one relationship Assumption 1 relegate a greater portion of genes as singletons without orthologs SYNERGY (top number) INPARANOID (bottom number) Sensitivity (orange cells) Specificity (green cells) Paralogues reported (blue cells)

Outline Introduction Objective Methods Results Conclusion Discussion

Conclusion SYNERGY is the gnome-wide reconstruction of homology relations across multiple genomes. SYNERGY combines hit-clustering approaches with the phylogenetic reconstruction of tree-based methods. The results of SYNERGY markedly improve over the widely used RBH approach. Also, they are comparable quality to a manually curated gold standard.

Outline Introduction Objective Methods Results Conclusion Discussion

Complete and Sound At each recursive Stage, SYNERGY assumes that sound and complete orthogroups are resolved for the lower nodes in the tree. SYNERGY ensures completeness by allowing many edges (candidate homology relations) into the input gene similarity graph and by applying a lenient criterion to derive candidate orthogroups. Then, SYNERGY achieves soundness by refining these coarse relations as we progress through the species tree, breaking orthogroups using phylogenetic principles at each Stage.

Violation of soundness during generating Candidate Orthogroup SYNERGY ensures completeness by allowing many edges (candidate homology relations) into the input gene similarity graph and by applying a lenient criterion to derive candidate orthogroups. However, such lenient criterion will… g2A A1 A2 A g1A X X OG1X g1B B B1 g1A g1B OG1X g2A g2B OG2X In fact, candidate orthogroup OG1X contains g2A through duplication events that predate X Such violations of the orthogroup soundness condition (Definition 1) are handled later

Thanks for Your Attention!

Orthogroup Confidence A complete orthogroup (Definition 2) contains all the genes that descended from a single common ancestor and thus its genes should not ‘migrate out’ of it in the holdout experiments. where, h(gj , gk) and OGi (gj , gk) specify the last species in the tree in which gj and gk share a common ancestor in the holdout experiment h and the original orthogroup

A sound orthogroup (Definition 1) contains only the genes that descended from a single common ancestor, and thus new genes should not ‘migrate into’ the orthogroup in the holdout experiments. count the number of pairs of non-orthologous genes (gj, gk), gj in OGi , gk not in OGi that became orthologous

Introduction Current methods for inferring homology relation Pair-wise sequence comparison Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Hit clustering methods Detect clusters in graph of pair-wise hits Synteny methods Detect conserved regions, stretches of nearby hits Phylogenetic methods Phylogeny of family clusters orthologs near each other