Trees and Sequence Space J. Peter Gogarten University of Connecticut Dept. of Molecular and Cell Biology Sculpture at Royal Botanical Gardens, Kew.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

MCB 5472 Blast, Psi BLAST, Perl: Arrays, Loops J. Peter Gogarten Office: BPB 404 phone: ,
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Tree of Life Chapter 26.
Phylogeny and Systematics
Phylogenetic reconstruction
Phylogeny and Systematics
PHYLOGENY AND SYSTEMATICS
Molecular Evolution Revised 29/12/06
Ways to construct Protein Space Construction of sequence space from (Eigen et al. 1988) illustrating the construction of a high dimensional sequence space.
"Nothing in biology makes sense except in the light of evolution" Theodosius Dobzhansky.
Explorations of Multidimensional Sequence Space. one symbol -> 1D coordinate of dimension = pattern length.
"Nothing in biology makes sense except in the light of evolution" Theodosius Dobzhansky.
Steps of the phylogenetic analysis
Branches, splits, bipartitions In a rooted tree: clades (for urooted trees sometimes the term clann is used) Mono-, Para-, polyphyletic groups, cladists.
"Nothing in biology makes sense except in the light of evolution" Theodosius Dobzhansky.
Example of bipartition analysis for five genomes of photosynthetic bacteria (188 gene families) total 10 bipartitions R: Rhodobacter capsulatus, H: Heliobacillus.
Trees as a Tool to Visualize Evolutionary History
"Nothing in biology makes sense except in the light of evolution" Theodosius Dobzhansky.
Cenancestor (aka LUCA or MRCA) can be placed using the echo remaining from the early expansion of the genetic code. reflects only a single cellular component.
MCB 371/372 Sequence alignment Sequence space 4/4/05 Peter Gogarten Office: BSP 404 phone: ,
Branches, splits, bipartitions In a rooted tree: clades Mono-, Para-, polyphyletic groups, cladists and a natural taxonomy Terminology The term cladogram.
Trees? J. Peter Gogarten University of Connecticut Dept. of Molecular and Cell Biology Sculpture at Royal Botanical Gardens, Kew.
MCB 372 #14: Student Presentations, Discussion, Clustering Genes Based on Phylogenetic Information J. Peter Gogarten University of Connecticut Dept. of.
Systematics The study of biological diversity in an evolutionary context.
MCB5472 Computer methods in molecular evolution Lecture 3/22/2014.
Coalescence and the Cenancestor J. Peter Gogarten University of Connecticut Department of Molecular and Cell Biology.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Phylogenetic Trees: Common Ancestry and Divergence 1B1: Organisms share many conserved core processes and features that evolved and are widely distributed.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Warm-Up 1.Contrast adaptive radiation vs. convergent evolution? Give an example of each. 2.What is the correct sequence from the most comprehensive to.
Systematics and the Phylogenetic Revolution Chapter 23.
Chapter 26 Phylogeny and the Tree of Life
ATPase dataset from last Friday Alignment clustal vs muscle Conserved part are aligned reproducibly.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
ATPase dataset from last Friday Alignment clustal vs muscle Conserved part are aligned reproducibly.
Phylogeny & the Tree of Life
Cenancestor (aka LUCA or MRCA) can be placed using the echo remaining from the early expansion of the genetic code. reflects only a single cellular component.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Chapter 25: Phylogeny and Systematics. “Taxonomy is the division of organisms into categories based on… similarities and differences.” p. 495, Campbell.
Bootstrap ? See herehere. Maximum Likelihood and Model Choice The maximum Likelihood Ratio Test (LRT) allows to compare two nested models given a dataset.Likelihood.
Systematics and Phylogenetics Ch. 23.1, 23.2, 23.4, 23.5, and 23.7.
Chapter 26 Phylogeny and Systematics. Tree of Life Phylogeny – evolutionary history of a species or group - draw information from fossil record - organisms.
Introns early Self splicing RNA are an example for catalytic RNA that could have been present in RNA world. There is little reason to assume that the RNA.
Phylogeny & the Tree of Life
The Coral of Life (Darwin)
Phylogeny and the Tree of Life
Phylogeny and the Tree of Life
Ways to construct Protein Space
Average: 86.5% Median: 88% Stdev: 9%
Cladistics (Ch. 22) Based on phylogenetics – an inferred reconstruction of evolutionary history.
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
MCB Class 1.
The Ribosomal “Tree of Life”
D.5: Phylogeny and Systematics
Average: 86.5% Median: 88% Stdev: 9%
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
Chapter 25 Phylogeny and the Tree of Life
Phylogeny and the Tree of Life
Phylogeny and the Tree of Life
Phylogeny and the Tree of Life
MCB 5472 Intro to Trees Peter Gogarten Office: BSP 404
Reading Phylogenetic Trees
Phylogeny and the Tree of Life
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
Phylogeny and the Tree of Life
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
Phylogeny and the Tree of Life
Warm-Up Contrast adaptive radiation vs. convergent evolution? Give an example of each. What is the correct sequence from the most comprehensive to least.
The Ribosomal “Tree of Life”
Presentation transcript:

Trees and Sequence Space J. Peter Gogarten University of Connecticut Dept. of Molecular and Cell Biology Sculpture at Royal Botanical Gardens, Kew

Ways to construct Protein Space Construction of sequence space from (Eigen et al. 1988) illustrating the construction of a high dimensional sequence space. Each additional sequence position adds another dimension, doubling the diagram for the shorter sequence. Shown is the progression from a single sequence position (line) to a tetramer (hypercube). A four (or twenty) letter code can be accommodated either through allowing four (or twenty) values for each dimension (Rechenberg 1973; Casari et al. 1995), or through additional dimensions (Eigen and Winkler-Oswatitsch 1992). Eigen, M. and R. Winkler-Oswatitsch (1992). Steps Towards Life: A Perspective on Evolution. Oxford; New York, Oxford University Press. Eigen, M., R. Winkler-Oswatitsch and A. Dress (1988). "Statistical geometry in sequence space: a method of quantitative comparative sequence analysis." Proc Natl Acad Sci U S A 85(16): Casari, G., C. Sander and A. Valencia (1995). "A method to predict functional residues in proteins." Nat Struct Biol 2(2): Rechenberg, I. (1973). Evolutionsstrategie; Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Stuttgart-Bad Cannstatt, Frommann-Holzboog.

Diversion: From Multidimensional Sequence Space to Fractals

one symbol -> 1D coordinate of dimension = pattern length

Two symbols -> Dimension = length of pattern length 1 = 1D:

Two symbols -> Dimension = length of pattern length 2 = 2D: dimensions correspond to position For each dimension two possibiities Note: Here is a possible bifurcation: a larger alphabet could be represented as more choices along the axis of position!

Two symbols -> Dimension = length of pattern length 3 = 3D:

Two symbols -> Dimension = length of pattern length 4 = 4D: aka Hypercube

Two symbols -> Dimension = length of pattern

Three Symbols (the other fork)

Four Symbols: I.e.: with an alphabet of 4, we have a hypercube (4D) already with a pattern size of 2, provided we stick to a binary pattern in each dimension.

hypercubes at 2 and 4 alphabets 2 character alphabet, pattern size 4 4 character alphabet, pattern size 2

Three Symbols Alphabet suggests fractal representation

3 fractal enlarge fill in outer pattern repeats inner pattern = self similar = fractal

3 character alphabet 3 pattern fractal

3 character alphapet 4 pattern fractal Conjecture: For n -> infinity, the fractal midght fill a 2D triangle Note: check Mandelbrot

Same for 4 character alphabet 1 position 2 positions 3 positions

4 character alphabet continued (with cheating I didn’t actually add beads) 4 positions

4 character alphabet continued (with cheating I didn’t actually add beads) 5 positions

4 character alphabet continued (with cheating I didn’t actually add beads) 6 positions

4 character alphabet continued (with cheating I didn’t actually add beads) 7 positions

Animated GIf 1-12 positions

Protein Space in JalView

Alignment of V F A ATPase ATP binding SU (catalytic and non- catalytic SU)

UPGMA tree of V F A ATPase ATP binding SU with line dropped to partition (and colour) the 4 SU types (VA cat and non cat, F cat and non cat). Note that details of the tree

PCA analysis of V F A ATPase ATP binding SU using colours from the UPGMA tree

Same PCA analysis of V F A ATPase ATP binding SU using colours from the UPGMA tree, but turned slightly. (Giardia A SU selected in grey.)

Same PCA analysis of V F A ATPase ATP binding SU Using colours from the UPGMA tree, but replacing the 1st with the 5th axis. (Eukaryotic A SU selected in grey.)

Same PCA analysis of V F A ATPase ATP binding SU Using colours from the UPGMA tree, but replacing the 1st with the 6th axis. (Eukaryotic B SU selected in grey - forgot rice.)

Problems Jalview’s approach requires an alignment - only homologous sequences can be depicted in the same space Solution: One could use pattern absence / presence as coordinates

muscle alignment

muscle vs clustal more on alignment programs (statalign, pileup, SAM) herehere

the same region using tcoffee with default settings more on alignment programs (statalign, pileup, SAM) herehere

Sequence editors and viewers Jalview Homepage, DescriptionHomepageDescription Jalview as Java Web Start Application (other JAVA applications are here)Java Web Start Applicationhere Jalview is easy to install and run. Test file is here (ATPase subunits)here (Intro to ATPases: 1bmf in spdbv) (gif of rotation here, movies of the rotation are here and here)here here (Load all.txt into Jalview, colour options, mouse use, PID tree, Principle component analysis -> sequence space) More on sequence space herehere

seaview – phylo_win Another useful multiple alignment editor is seaview, the companion sequence editor to phylo_win. It runs on PC and most unix flavors, and is the easiest way to get alignments into phylo_win.seaview phylo_win

Why phylogenetic reconstruction of molecular evolution? Systematic classification of organisms. E.g.: Who were the first angiosperms? (i.e. where are the first angiosperms located relative to present day angiosperms?) Where in the tree of life is the last common ancestor located? Evolution of molecules. E.g.: domain shuffling, reassignment of function, gene duplications, horizontal gene transfer, drug targets, detection of genes that drive evolution of a species/population (e.g. influenca virus, see here for more examples)here

Small subunit ribosomal RNA (16S) based tree of life. Carl Woese, George Fox, and many others.

Phylogenetic analysis is an inference of evolutionary relationships between organisms. Phylogenetics tries to answer the question “How did groups of organisms come into existence?” Those relationships are usually represented by tree-like diagrams. Note: the assumption of a tree-like process of evolution is controversial! Steps of the phylogenetic analysis

trees: an unrooted tree a branch, a split, or a bipartition sometimes represented as * * *... or... * * * with sequences in order A B C a leaf, or an OTU attached to a terminal branch #1 #2 #3 #B #A #C #1 #2 #3 #A #B #C rooted trees #1 #2 #3 #A #B #C molecular phylogenies are usually scaled with respect to substitutions and not with respect to time.

What is in a tree? Trees form molec u lar da t a are usually calculated as unrooted trees (at least they should be - if they are not this is usually a mistake). To root a tree you either can assume a molecular clock (substitutions occur at a constant rate, again this assumption is usually not warranted and needs to be tested), or you can use an outgroup (i.e. something that you know forms the deepest branch). For example, to root a phy l ogeny of birds, you could use the homologous characters from a reptile as outgroup; to find the root in a tree depicting the relations between different human mitochondria, you could use the mitochondria from chimpanzees or from Neanderthals as an outgroup; to root a phylogeny of alpha hemoglobins you could use a beta hemoglobin sequence, or a myoglobin sequence as outgroup. Trees have a branching pa t tern (also called the topology), and branch lengths. Often the branch lengths are ignored in depicting trees (these trees often are referred to as cladograms - note that cladograms should be considered rooted). You can swap branches attached to a node, and in an unrooted you can depict the tree as rooted in any branch you like without changing the tree.

Test:Which of these trees is different? More tests herehere

homology Two sequences are homologous, if there existed an ancestral molecule in the past that is ancestral to both of the sequences Types of Homology Orthologs: “deepest” bifurcation in molecular tree reflects speciation. These are the molecules people interested in the taxonomic classification of organisms want to study. Paralogs: “deepest” bifurcation in molecular tree reflects gene duplication. The study of paralogs and their distribution in genomes provides clues on the way genomes evolved. Gen and genome duplication have emerged as the most important pathway to molecular innovation, including the evolution of developmental pathways. Xenologs: gene was obtained by organism through horizontal transfer. The classic example for Xenologs are antibiotic resistance genes, but the history of many other molecules also fits into this category: inteins, selfsplicing introns, transposable elements, ion pumps, other transporters, Synologs: genes ended up in one organism through fusion of lineages. The paradigm are genes that were transferred into the eukaryotic cell together with the endosymbionts that evolved into mitochondria and plastids (the -logs are often spelled with "ue" like in orthologues) see Fitch's article in TIG 2000 for more discussion.TIG 2000

Branches, splits, bipartitions In a rooted tree: clades Mono-, Para-, polyphyletic groups, cladists and a natural taxonomy Terminology The term cladogram refers to a strictly bifurcating diagram, where each clade is defined by a common ancestor that only gives rise to members of this clade. I.e., a clade is monophyletic (derived from one ancestor) as opposed to polyphyletic (derived from many ancestors). (note you need to know where the root is!) A clade is recognized and defined by shared derived characters (= synapomorphies). Shared primitive characters (= sympleisiomorphies, aternativie spelling is symplesiomorphies) do not define a clade. (see in class example drawing ala Hennig). To use these terms you need to have polarized characters; for most molecular characters you don't know which state is primitive and which is derived (exceptions:....).

The Coral of Life (Darwin)

Coalescence – the process of tracing lineages backwards in time to their common ancestors. Every two extant lineages coalesce to their most recent common ancestor. Eventually, all lineages coalesce to the cenancestor. t/2 (Kingman, 1982) Illustration is from J. Felsenstein, “Inferring Phylogenies”, Sinauer, 2003

Coalescence of ORGANISMAL and MOLECULAR Lineages 20 lineages One extinction and one speciation event per generation One horizontal transfer event once in 5 generations (I.e., speciation events) RED: organismal lineages (no HGT) BLUE: molecular lineages (with HGT) GRAY: extinct lineages 20 lineages One extinction and one speciation event per generation One horizontal transfer event once in 5 generations (I.e., speciation events) RED: organismal lineages (no HGT) BLUE: molecular lineages (with HGT) GRAY: extinct lineages RESULTS: Most recent common ancestors are different for organismal and molecular phylogenies Different coalescence times Long coalescence time for the last two lineages RESULTS: Most recent common ancestors are different for organismal and molecular phylogenies Different coalescence times Long coalescence time for the last two lineages Time

Adam and Eve never met  Albrecht Dürer, The Fall of Man, 1504 Mitochondrial Eve Y chromosome Adam Lived approximately 50,000 years ago Lived 166, ,000 years ago Thomson, R. et al. (2000) Proc Natl Acad Sci U S A 97, Underhill, P.A. et al. (2000) Nat Genet 26, Cann, R.L. et al. (1987) Nature 325, 31-6 Vigilant, L. et al. (1991) Science 253, The same is true for ancestral rRNAs, EF, ATPases!

EXTANT LINEAGES FOR THE SIMULATIONS OF 50 LINEAGES Modified from Zhaxybayeva and Gogarten (2004), TIGs 20,

green: organismal lineages ; red: molecular lineages (with gene transfer) Lineages Through Time Plot 10 simulations of organismal evolution assuming a constant number of species (200) throughout the simulation; 1 speciation and 1 extinction per time step. (green O) 25 gene histories simulated for each organismal history assuming 1 HGT per 10 speciation events (red x) log (number of surviving lineages)

Bacterial 16SrRNA based phylogeny (from P. D. Schloss and J. Handelsman, Microbiology and Molecular Biology Reviews, December 2004.) The deviation from the “long branches at the base” pattern could be due to under sampling an actual radiation due to an invention that was not transferred following a mass extinction

More Terminology Related terms: autapomorphy = a derived character that is only present in one group; an autapomorphic character does not tell us anything about the relationship of the group that has this character ot other groups. homoplasy = a derived character that was derived twice independently (convergent evolution). Note that the characters in question might still be homologous (e.g. a position in a sequence alignment, frontlimbs turned into wings in birds and bats). paraphyletic = a taxonomic group that is defined by a common ancestor, however, the common ancestor of this group also has decendants that do not belong to this taxonomic group. Many systematists despise paraphyletic groups (and consider them to be polyphyletic). Examples for paraphyletic groups are reptiles and protists. Many consider the archaea to be paraphyletic as well. holophyletic = same as above, but the common ancestor gave rise only to members of the group.