BuildingTrees.

Slides:



Advertisements
Similar presentations
Introduction to Molecular Evolution
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Phylogenetic Analysis 1 Phylogeny (phylo =tribe + genesis)
Based on lectures by C-B Stewart, and by Tal Pupko Phylogenetic Analysis based on two talks, by Caro-Beth Stewart, Ph.D. Department of Biological Sciences.
Phylogenetic Analysis
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Phylogenetic reconstruction
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
The Tree of Life From Ernst Haeckel, 1891.
Phylogenetic reconstruction
Sequence similarity.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic Analysis
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Topic : Phylogenetic Reconstruction I. Systematics = Science of biological diversity. Systematics uses taxonomy to reflect phylogeny (evolutionary history).
Phylogenetic trees Sushmita Roy BMI/CS 576
What Is Phylogeny? The evolutionary history of a group.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Molecular Phylogeny and Evolution.
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
plants animals monera fungi protists protozoa invertebrates vertebrates mammals Five kingdom system (Haeckel, 1879)
Phylogentic Tree Evolution Evolution of organisms is driven by Diversity  Different individuals carry different variants of.
Models of sequence evolution GTR HKY Jukes-Cantor Felsenstein K2P Tree building methods: some examples Assessing phylogenetic data Popular phylogenetic.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Evolutionary Biology Concepts Molecular Evolution Phylogenetic Inference BIO520 BioinformaticsJim Lund Reading: Ch7.
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Phylogeny & Systematics
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Part 9 Phylogenetic Trees
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Ch. 26 Phylogeny and the Tree of Life. Opening Discussion: Is this basic “tree of life” a fact? If so, why? If not, what is it?
5.4 Cladistics The images above are both cladograms. They show the statistical similarities between species based on their DNA/RNA. The cladogram on the.
Maximum Parsimony Phenetic (distance based) methods are fast and often accurate but discard data and are not based on explicit character states at each.
What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: (1) Phylogeny inference or “tree building”
Bioinformatics Lecture 3 Molecular Phylogenetic By: Dr. Mehdi Mansouri Mehr 1395.
Evolutionary genomics can now be applied beyond ‘model’ organisms
Inferring a phylogeny is an estimation procedure.
In-Text Art, Ch. 16, p. 316 (1).
Patterns in Evolution I. Phylogenetic
The Tree of Life From Ernst Haeckel, 1891.
Presentation transcript:

BuildingTrees

What is a Tree? A tree is a visualization of the mathematical analysis of a comparison of characteristics in multiple individuals or species. The multiples can also be tissues or developmental stages in the case of microarrays. The closer branches share more similarities and the more distant branches are less similar.

Phylogeny (phylo =tribe + genesis) 1.Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) 2.Character and rate analysis — using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest

Start with a group of species and establish relationships based on measurements birds snakes rodents primates crocodiles marsupials lizards

This is an example of a phylogenetic tree. crocodiles birds lizards snakes rodents primates marsupials

Homology & Similarity Homology Similarity Conserved sequences arising from a common ancestor Orthologs: homologous genes that share a common ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin) Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin) Similarity Genes that share common sequences but are not necessarily related Before we continue, I would like to re-emphasize Fran’s earlier discussion on the definitions of Homology and similarity. Homology is…for example the RELAPSE and GREASER sequences. You must be able to prove a common ancestor. While Similarity is any set of genes that share common sequences but are not necessarily related. Evolution can drive two vastly different ancestors to converge on a similar sequences, even though they are not related via a common ancestor.

Sequences As Modules Proteins are derived from a limited number of basic building blocks (Modules) Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences Proteins can share a global or local relationships specific to a single DOMAIN Global Local

Sequence Domains Modules Define Functional/Structural Domains

Defining A Sequence Family Family B Family E Family D Family A Family C

Global vs. Local Alignments Search for alignments, matching over entire sequences Local Examine regions of sequence for conserved segments Both Consider: Matches, Mismatches, Gaps MSAs can be examined at two levels: globally or locally. Global MSAs search for alignments, attempt to match residues over the entire length of the sequences On the other hand, local MSAs examine select regions of sequence for conserved segments. In both cases, you are considering matches, mismatches, and gaps.

Global Sequence Alignments Yeast Prion-Like Proteins Here is an example msa of a portion of several related kinases. Each position is aligned into columns. Columns are colored according to similar properties. A Star indicated perfect conservation across all the sequences, whereas a colon or dot demonstrate sequence similarity. The more sequences you analyze, the less likely that there is going to be conservation at each position. However, you can see from this example that there is a very distinct set of conserved columns, where the sequences share the same identical residue at a given position. This may indicate an importance of these residues in strucuture and function.

How To Make A Global MSA On The Web On Your Computer http://pir.georgetown.edu/pirwww/search/multaln.html On Your Computer ClustalX: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/

MSA Example Sequences Standard FASTA Sequence Format >KSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH >ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL >KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY >MATK_HUMAN WFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY >CSK_CHICK WFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY >CRKL_HUMAN WYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY >YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY >FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY >SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY

MSA Example Result YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKL FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKL SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKL MATK_HUMAN WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHR CSK_CHICK WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYS CRKL_HUMAN WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSL ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQD KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKD KSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE :: . : :: : * :*:* * : * : ** : YES_XIPHE DNGGYYITTRTQFMSLQMLVKHY FGR_HUMAN DMGGYYITTRVQFNSVQELVQHY SRC_RSVP YSGGFYITSRTQFGSLQQLVAYY MATK_HUMAN -DGHLTIDEAVFFCNLMDMVEHY CSK_CHICK -SSKLSIDEEVYFENLMQLVEHY CRKL_HUMAN PNRRFKIGDQE-FDHLPALLEFY ZA70_HUMAN KAGKYCIPEGTKFDTLWQLVEYL KSYK_PIG KTGKLSIPGGKNFDTLWQLVEHY KSYK_HUMAN LNGTYAIAGGRTHASPADLCHYH * . : .

Steps to Build Trees from MSA 1) identify taxa to be considered 2) choose characters (independent, “unit”) 3) construct character matrix for each taxon: 4) After performing alignment, use mathematical formula to describe degree of similarity for each taxon: e.g. simple matching coefficient # matches total # of characters S =

Steps to Build Trees 5) construct matrix with pairwise S values 6) use clustering technique to produce a tree (dendrogram) Unweighted/Equal weighting = all characters given equal consideration UPGMA (Unweighted Pair Group Method with Arithmetic Averaging) Neighbour-joining Unweighting is a form of weighting

Building Matrices Taxon 1 2 3 4 5 6 7 8 9 10 A B C D Taxon A B C D -- B C D Character Matrix Taxon A B C D -- 0.3 0.4 0.7 0.5 S-value Matrix

Joining Clusters into a Tree Closest: A&D = 0.7 2nd Closest B&C = 0.5 When does A&D join B&C ? (A&B) + (A&C) + (D&B) + (D&C) 4 = (0.3 + 0.4 + 0.4 + 0.3)/4 = 0.35

Problems Different methods or characters = different dendrograms If we use all possible characteristics this would be a natural classification The tree is an accurate phylogeny if differences in characters between taxa proportional to time elapsed since common ancestor

Convergent Evolution Similar phenotypic response to similar ecological conditions Different developmental pathways

Reversal of Evolution An altered character reverts to the ancestral form. In a DNA molecule, a nucleotide position may change from a C to a T and then back to a C. This frog reverted to teeth.

Trees are hypotheses about evolutionary history Different methods may result in different trees. How to chose between the different models? One way is to compare different types of character data and see if the trees make sense.

Haplotype Network in 3 Elephant Species with 3 DNA sequences biogeog3e-fig-11-14-0.jpg

Parsimonious choices reflect fewer changes The assumptions of parsimony Reversals and convergence require more changes Parsimonious trees represent best estimates of phylogenetic relationships

Use of DNA, RNA, or Protein For phylogeny, DNA can be more informative. The protein-coding portion of DNA has synonymous and nonsynonymous substitutions. Some DNA changes do not have corresponding protein changes See arrows 14, 21, 25, 27, 29 in the retinol-binding protein figure.

For phylogeny, DNA can be more informative. If the synonymous substitution rate (dS) is greater than the nonsynonymous substitution rate (dN), the DNA sequence is under negative (purifying) selection. This limits change in the sequence. If dS < dN, positive selection occurs. For example, a duplicated gene may evolve rapidly to assume new functions.

Models of nucleotide substitution- Transitions > Transversions G transversion transversion C T transition

Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions sequential substitutions coincidental substitutions

Additional mutational events can be inferred by analysis of ancestral sequences. These changes include parallel substitutions convergent substitutions back substitutions

Advantages of DNA Noncoding regions (such as 5’ and 3’ untranslated regions) may be analyzed using molecular phylogeny. See Figure 11.10 (arrows 4-10 and 35-38) Pseudogenes (nonfunctional genes) are studied by molecular phylogeny Rates of transitions and transversions can be measured. Transitions: purine (A to G) or pyrimidine (C to T) substitutions Transversion: purine to pyrimidine

Protein sequences are also used for phylogeny Proteins have 20 states (amino acids) instead of only four for DNA, so there is more phylogenetic information. Nucleotides are unordered characters: any one nucleotide can change to any other in one step. An ordered character must pass through one or more intermediate states before reaching the final state. Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.

Amino acid sequences From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be functionally devastating Tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks are used: PAM and BLOSSUM

Sequence-Based Comparisons Identify sequences within an organism that are related to each other and/or across different species Within: Fetal and adult hemoglobin Across : Human and chimpanzee hemoglobin Generate an evolutionary history of related genes Locate insertions, deletions, and substitutions that have occurred during evolution (C) Cysteine (R) Arginine (E) Glutamate (A) Alanine (T) Threonine (S) Serine (L) Leucine (P) Proline (G) Glycine Draw comparisons among different proteins to answer these questions. Begin by identifying proteins with an organism that are related to each other and across different species Generate an evolutionary history of the genes, see which ones are more closely related Proceed to specifically locate insertions, deletions, and substitutions For example, how did the sequences RELAPSE and GREASER evolve from a common ancestor? CREATE CREASE -RELAPSE [Ancestor] [Progenitors] GREASER

Multiple Sequence Alignments Place residues in columns that are derived from a common ancestral residue Identify Matches, Mismatches, and Gaps MSA can reveal sequence patterns Demonstration of homology between >2 sequences Identification of functionally important sites Protein function prediction Structure prediction CREASE CREATE RELAPSE GREASER SeqA CRE-A-TE- SeqB CRE-A-SE- The idea behind a multiple sequence alignment is to place residues into columns that are derived from a a common ancestral residue. Similar to a phylogenetic tree, but on a discrete residue-by-residue basis. MSAs are very useful in revealing sequence patterns. For instance: SeqC GRE-A-SER SeqD -RELAPSE- 123456789

MSA and Tree Relationship “The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves” (Mount, 2001) CREATE CREASE CREATE CRE-A-TE- SeqA CREATE CREASE CRE-A-SE- SeqB (animate this) MSAs and trees are interrelated. One assists In the construction of the other. Mount describes this relationship as “The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves. For instance, here is a MSA of 4 related sequences. Matches, insertion, deletions, mismatches. This can be represented in a tree structure. +R GRE-A-SER SeqC T to S GREASE C to G +L +P -RELAPSE- SeqD -G

Multiple Sequence Alignments Confirm that all sequences are homologous Adjust gap creation and extension penalties as needed to optimize the alignment Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data). Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)

Problems in Reconstructing Phylogeny Characters sometimes conflict It is sometimes difficult to tell homology from homoplasy Analogy- characters similar because of convergent evolution Reversal- character reverts to ancestral form With morphological characters, careful examination may distinguish homoplasy (orthologs) from homology With molecular characters (DNA/Protein sequences), orthologs sometimes impossible to distinguish from homologs and paralogs.

A Phylogenetic Tree Taxon -- Any named group of organisms – evolutionary theory not necessarily involved. Clade -- A monophyletic taxon (evolutionary theory utilized)

A phylogenetic tree with branch lengths Branch length can be significant… In this case it is and mouse is slightly more similar to fly than human is to fly (sum of branches 1+2+3 is less than sum of 1+2+4)

Divergence Points (represent hypothetical ancestors of the taxa) Common Phylogenetic Tree Terminology Terminal Nodes Branches or Lineages A Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny B C D Ancestral Node or ROOT of the Tree E Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa)

Phylogenetic trees diagram the evolutionary relationships between the taxa Taxon A Taxon B Taxon C Taxon E Taxon D No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom. This dimension either can have no scale (for ‘cladograms’), can be proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), or can be proportional to time (for ‘ultrametric trees’ or true evolutionary trees). ((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of D and E. If the tree has a time scale, then D and E are the most closely related.

Three types of trees Cladogram Phylogram Ultrametric tree 6 Taxon B Taxon B Taxon B 1 1 Taxon C Taxon C 3 Taxon C 1 Taxon A Taxon A Taxon A Taxon D 5 Taxon D Taxon D no meaning genetic change time All show the same evolutionary relationships, or branching orders, between the taxa.

Types of trees: Cladogram relative recent common descent. Does not imply that ancestors on the same line necessarily speciated at the same time. t1 can be before or after t2 but not before t3 t3 (no time scale)

Types of trees: Phylogram (additive tree: branch lengths can be summed) relative recenct common descent, and branch lengths = amount of change

Types of trees: Ultrametric Ultrametric tree (linearized tree) divergence All tree tips are equidistant from the root Amount of change can be scaled to time scale = time

Completely unresolved bifurcating phylogeny The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny A B C E D Polytomy or multifurcation A bifurcation

There are three possible unrooted trees for four species (A, B, C, D) Goal of phylogenetic tree building methods is discovery which of the possible unrooted trees is "correct". This should be the “true” biological tree, accurately representing the evolutionary history of the species. However, it is only possible to discover the computationally correct or optimal tree for the phylogenetic method of choice.

The number of unrooted trees increases in a greater than exponential manner with number of species (taxa) C A B D E F (2N - 5)!! = # unrooted trees for N taxa

Inferring evolutionary relationships between the taxa requires rooting the tree: C Root D To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Unrooted tree A B C D Root Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree

Try it again with the root at another position B C Root Unrooted tree D A A B B C D Rooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. Root

An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees Rooted tree 1b A B C D 2 A Rooted tree 1d C D A B 4 C Rooted tree 1a B A C D 1 The unrooted tree 1: Rooted tree 1e D C A B 5 Rooted tree 1c A B C D 3 B D These trees show five different evolutionary relationships among the taxa!

Trick Question Warning Sometimes two trees may look very different but, in fact, differ only in the position of the root. Don’t forget rotational symmetry!

All of these rearrangements show the same evolutionary relationships between the taxa C D B A C D Rooted tree 1a B A C D B C A D D C A B A B D C A B C D

There are two major ways to root trees By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins). outgroup By midpoint or distance: Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods. A d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9 10 C 3 2 B 2 5 D

Rooting Using an Outgroup The outgroup should be a sequence (or set of sequences) known to be less closely related to the rest of the sequences than they are to each other. It should ideally be as closely related as possible to the rest of the sequences while still satisfying the first condition. The root must be somewhere between the outgroup and the rest (either on the node or in a branch).

Automatic Rooting Many software packages will root trees automatically (e.g. mid-point rooting in NJPlot) This normally involves assumptions… BE AWARE what those are.

Each unrooted tree theoretically can be rooted anywhere along any of its branches x = C A B D E F (2N - 3)!! = # unrooted trees for N taxa

Molecular phylogenetic tree building methods Mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available, each with strengths and weaknesses.

Types of data used in phylogenetic inference Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference. Taxa Characters Species A ATGGCTATTCTTATAGTACG Species B ATCGCTAGTCTTATATTACA Species C TTCACTAGACCTGTGGTCCA Species D TTGACCAGACCTGTGGTCCG Species E TTGACCAGTTCTCTAGTTCG Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building. A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ---- Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)

Similarity vs. Evolutionary Relationship Similarity and relationship are not the same thing, even though evolutionary relationship is inferred from certain types of similarity. Similar: having likeness or resemblance (an observation) Related: genetically connected (an historical fact) Two taxa can be most similar without being most closely-related: Taxon A Taxon B Taxon C Taxon D 1 6 3 5 C is more similar in sequence to A (d = 3) than to B (d = 7), but C and B are most closely related (that is, C and B shared a common ancestor more recently than either did with A).

Types of Similarity Observed similarity between two entities can be due to: Evolutionary relationship: Shared ancestral characters (‘plesiomorphies’) Shared derived characters (‘’synapomorphy’) Homoplasy (independent evolution of the same character): Convergent events (in either related on unrelated entities), Parallel events (in related entities), Reversals (in related entities) C G G C G C C G T G G C Character-based methods can tease apart types of similarity and theoretically find the true evolutionary tree. Similarity = relationship only if certain conditions are met (if the distances are ‘ultrametric’).

METRIC DISTANCES between any two or three taxa (a, b, and c) have the following properties: Property 1: d (a, b) ≥ 0 Non-negativity Property 2: d (a, b) = d (b, a) Symmetry Property 3: d (a, b) = 0 if and only if a = b Distinctness Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality: a b c 6 9 5

ULTRAMETRIC DISTANCES must satisfy the previous four conditions, plus: Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)] a b 4 6 c This implies that the two largest distances are equal, so that they define an isosceles triangle: Similarity = Relationship if the distances are ultrametric! a b c 2 4 If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates.

ADDITIVE DISTANCES: Property 6: d (a, b) + d (c, d) ≤ maximum [d (a, c) + d (b, d), d (a, d) + d (b, c)] For distances to fit into an evolutionary tree, they must be either metric or ultrametric, and they must be additive. Estimated distances often fall short of these criteria, and thus can fail to produce correct evolutionary trees.

Tree-building methods: UPGMA UPGMA is: unweighted pair group method using arithmetic mean 1 2 3 4 5

Tree-building methods: UPGMA Step 1: compute the pairwise distances of all the proteins. Get ready to put the numbers 1-5 at the bottom of your new tree. 1 2 3 4 5

Tree-building methods: UPGMA Step 2: Find the two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 1 2

Tree-building methods: UPGMA Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 7 1 2 4 5

Tree-building methods: UPGMA Step 4: Keep going. Cluster. 1 2 3 4 5 8 7 6 1 2 4 5 3

Tree-building methods: UPGMA Step 4: Last cluster! This is your tree. 1 2 3 4 5 9 8 7 6 1 2 4 5 3

Distance-based methods: UPGMA trees UPGMA is a simple approach for making trees. An UPGMA tree is always rooted. An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong. While UPGMA is simple, it is less accurate than the neighbor-joining approach

Making trees using Neighbor-Joining The neighbor-joining method of Saitou and Nei (1987) Is especially useful for making a tree having a large number of taxa. Begin by placing all the taxa in a star-like structure.

Tree-building methods: Neighbor joining Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.

Tree-building methods: Neighbor joining Define the distance from X to Y by dXY = 1/2(d1Y + d2Y – d12)

Example of a neighbor-joining tree: phylogenetic analysis of 13 Retinol Binding Proteins

Tree-building methods: character based Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters). Tree-building methods based on characters include maximum parsimony and maximum likelihood.

Making trees using character-based methods The main idea of character-based methods is to find the tree with the shortest branch lengths possible: the most parsimonious (“simple”) tree. Identify informative sites. For example, constant characters are not parsimony-informative. Construct trees, counting the number of changes required to create each tree. For about 12 taxa or fewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search. Select the shortest tree (or trees).

As an example of tree-building using maximum parsimony, consider these four taxa: AAG AAA GGA AGA How might they have evolved from a common ancestor such as AAA?

Tree-building methods: Maximum parsimony 1 AAA AAA AAA AAA AGA AAA AAA AAA AAA 1 1 1 1 2 1 2 1 AAG AAA GGA AGA AAG AGA AAA GGA AAG GGA AAA AGA Cost = 3 Cost = 4 Cost = 4 In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Types of computational methods

Clustering algorithms: Use pairwise distances. Are purely algorithmic methods, in which the algorithm itself defines the the tree selection criterion. Tend to be very fast programs that produce singular trees rooted by distance. No objective function to compare to other trees, even if numerous other trees could explain the data equally well. Warning: Finding a singular tree is not necessarily the same as finding the "true” evolutionary tree.

Optimality approaches: Use either character or distance data. First define an optimality criterion (minimum branch lengths, fewest number of events, highest likelihood), and then use a specific algorithm for finding trees with the best value for the objective function. Can identify many equally optimal trees, if such exist. Warning: Finding an optimal tree is not necessarily the same as finding the "true” tree.

Computational methods for finding optimal trees: Exact algorithms: "Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building: Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method. Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions. Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searches often operate by “hill-climbing” methods.

Exact searches become increasingly difficult, and eventually impossible, as the number of taxa increases: A B C C A B D A D B E C A D B E C F (2N - 5)!! = # unrooted trees for N taxa

Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima Rerunning heuristic searches using different input orders of taxa can help find global minima or maxima Search for global maximum Search for global minimum GLOBAL MAXIMUM GLOBAL MAXIMUM local maximum local minimum GLOBAL MINIMUM GLOBAL MINIMUM

Assumptions made by phylogenetic methods: The sequences are correct The sequence are homologous Each position is homologous The sampling of taxa or genes is sufficient to resolve the problem of interest Sequence variation is representative of the broader group of interest Sequence variation contains sufficient phylogenetic signal (as opposed to noise) to resolve the problem of interest Each position in the sequence evolved independently

Problems with Phylogenetic Inference How do we know what the potential candidate trees are? How do we choose which tree is (most likely) the true tree? The best tree is the one that produces consistent results.

Recipe for reconstructing a phylogeny Select an optimality criterion Select a search strategy Use the selected search strategy to generate a series of trees, and apply the selected optimality criterion to each tree, always keeping track of the “best” tree examined thus far. How do you know the “best” tree? Which is the “true” tree?

Search strategy: Which is the right tree? When m is the number of taxa, the number of possible trees is: [(2m-3)!]/[2m-2(m-2)!] For 10 taxa, the number of trees is 34,459,425 Many trees can be discarded because they are obviously wrong Sometimes, there is a general or even specific grouping that can serve as a start for the tree search There are a number of approaches to tree searches that can be used

Evaluating the best tree Maximum likelihood (ML) tests the hypothesis by using a mathematical formula that Tests the probability that a nucleotide substitution will occur Tests a tree with known branch lengths and how likely the DNA sequences will occur. If similar trees have the same probability as the one with higher likelihood then the hypothesis is weaker.

Evaluating the best tree Bayesian Markov Chain Monte Carlo (BMCMC) Asks the probability that a particular tree is correct given data and a model of how traits change Distance methods Looks at the changes in a character and converts it into distance Assumes a specific model of character changes are clustered so that the more similar forms are close together.

Current strategy Produce a consensus tree using parsimony Evaluated best tree using statistical tests found in MC and BMCMC Compare the best trees using parsiomony MC and BMCMC The best tree is the one that produces consistent results

Evaluating branches Evaluation is done statistically Trees based on MC and BMCMC compare tress with and without the branch Trees based on maximum parsimony use bootstrapping 97

Bootstrapping In bootstrapping, for example you analyze a 300 base pair of a gene and the computer program makes 300 choices out of that sequence to determine the frequency that the branch in question would occur in all of the trees that are generated. If the branch occurs under 50% of the time, there is too much uncertainty and the branch is collapsed into a polytomy or point of uncertainty. Bootstrap support of around 70% is associated with true phylogeny.

Resolving conflict Researchers have more confidence in trees that use Larger data sets Characters that are not subject to homoplasy Appropriate inference methods Sometimes researchers have to wait for more data

Molecular clocks Timing and rate of evolution can be determined by looking at changing molecular traits. Changes in DNA sequences that are not tied to phenotypes and cannot be selected against can be tracked Neutral theory of molecular evolution predicts neutral changes in DNA should occur at the same rate as mutation rate. Method used takes documentation of the number of different neutral mutations observed in two species and multiplies it by a calibration rate that represents the number of changes that occur per million years. This helps estimate when the two species diverged. 100

Caveats for molecular dating Mutation rates to neutral alleles vary in the different genes, lineages and bases Third position of codons are more likely neutral and change in a clock-like fashion Rapid changes in allele frequencies in response to selection pressure produce unreliable clocks Calibration rates for a particular gene or lineage cannot be used for other groups that have different generation times and selection histories

When did humans start wearing clothes? Pediculus corporis Pediculus capitis The origin of body lice. Both of the species above are restricted in their location. Ralf Kittler and colleagues hypothesized that body lice adapted to live in clothing, therefore they diverged from head lice at the time when humans started wearing clothes. They took mt DNA and two nuclear genes, RNA polymerase II and elongation factor 1alpha. They used a chimpanzee as the outgroup. 102

Kittler et al., 2003 Figure 1. Neighbor-Joining Tree Based on Kimura-2-Parameter Distances for the Concatenated Sequences of ND4 and CYTB from 40 Lice. Identical topologies were obtained for maximum parsimony and minimum evolution trees for these sequences (results not shown). The tree was rooted with the corresponding sequence of P. schaeffi (chimpanzee louse); alternative placements of the root at any of the first three deepest branches (with three African, 6 European, and one African head lice sequence, respectively) are not significantly different and do not alter any conclusions. Bootstrap values (500 replications) are indicated on each interior branch. The arrows indicate the estimated age of particular nodes of the tree, based on Poisson-corrected amino acid distances. The tree based on amino acid distances (not shown) is virtually identical in topology to the tree shown, except for some sequences that differ only by silent substitutions. B: body louse, H: head louse; the frequency of a haplotype is indicated in brackets. Geographic origin of lice: Et: Ethiopia, Pa: Panama, Ge: Germany, Ph: Philippines, Ir: Iran, Ec: Ecuador, La: Laos, PNG: Papua New Guinea, Fl: Florida (USA), Ta: Taiwan, Ne: Nepal, UK: United Kingdom. 103

Summary of the lousy results Greater diversity in African than in non-African lice Lice probably originated in Africa along with humans Clothing appeared 30,000 to 114,000 years ago The expansion of lice diversity represents the migration of humans out of Africa Clothing may have allowed the successful movement of humans out of Africa into colder climates.

Which species are the closest living relatives of modern humans? Gorillas Chimpanzees Chimpanzees Bonobos Bonobos Gorillas Orangutans Orangutans Humans 14 15-30 MYA MYA Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas. The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.

Did the Florida Dentist infect his patients with HIV? Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: Patient C Patient A Patient G Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. Patient B Patient E Patient A DENTIST Local control 2 Local control 3 No Patient F Local control 9 Local control 35 Local control 3 No Patient D From Ou et al. (1992) and Page & Holmes (1998)

What was the most likely geographical location of the common ancestor of the African apes and humans? Scenario A: Africa as species fountain Scenario B: Eurasia as ancestral homeland Scenario B requires four fewer dispersal events Modified from: Stewart, C.-B. & Disotell, T.R. (1998) Current Biology 8: R582-588. Eurasia = Black Africa = Red = Dispersal

How can we choose between competing hypotheses on phylogeny of whales?

Phylogenetic Reconstruction of Whales Whales belong to artiodactyla (ungulate mammals), which includes camels, pigs, hippos, cows, deer Outgroup is rhinos/horses Difficult to place them because they lack many characters present in terrestrial mammals (e.g. hind limbs) Are whales sister to entire group or to hippos?

DNA Sequence Data and Whale Evolution Data collected from beta-casein gene for all taxa and sequences aligned. Nucleotide changes between outgroup and ingroup species indicate shared derived homologies. Most nucleotides are identical in all taxa, these are uninformative for phylogeny. Some nucleotides indicate that whales belong with cows, deer, and hippos (162). Others indicate that whales and hippos are sister groups (166). Others contradict sister group status of whale/hippo and cow deer (177) and may indicate a reversal.

Phylogeny results should be treated as informative but not authoritative