Presentation is loading. Please wait.

Presentation is loading. Please wait.

BuildingTrees.

Similar presentations


Presentation on theme: "BuildingTrees."— Presentation transcript:

1 BuildingTrees

2 What is a Tree? A tree is a visualization of the mathematical analysis of a comparison of characteristics in multiple individuals or species. The multiples can also be tissues or developmental stages in the case of microarrays. The closer branches share more similarities and the more distant branches are less similar.

3 Phylogeny (phylo =tribe + genesis)
1.Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) 2.Character and rate analysis — using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest

4 Start with a group of species and establish relationships based on measurements
birds snakes rodents primates crocodiles marsupials lizards

5 This is an example of a phylogenetic tree.
crocodiles birds lizards snakes rodents primates marsupials

6 Homology & Similarity Homology Similarity
Conserved sequences arising from a common ancestor Orthologs: homologous genes that share a common ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin) Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin) Similarity Genes that share common sequences but are not necessarily related Before we continue, I would like to re-emphasize Fran’s earlier discussion on the definitions of Homology and similarity. Homology is…for example the RELAPSE and GREASER sequences. You must be able to prove a common ancestor. While Similarity is any set of genes that share common sequences but are not necessarily related. Evolution can drive two vastly different ancestors to converge on a similar sequences, even though they are not related via a common ancestor.

7 Sequences As Modules Proteins are derived from a limited number of basic building blocks (Modules) Evolution has shuffled these modules giving rise to a diverse repertoire of protein sequences Proteins can share a global or local relationships specific to a single DOMAIN Global Local

8 Sequence Domains Modules Define Functional/Structural Domains

9 Defining A Sequence Family
Family B Family E Family D Family A Family C

10 Global vs. Local Alignments
Search for alignments, matching over entire sequences Local Examine regions of sequence for conserved segments Both Consider: Matches, Mismatches, Gaps MSAs can be examined at two levels: globally or locally. Global MSAs search for alignments, attempt to match residues over the entire length of the sequences On the other hand, local MSAs examine select regions of sequence for conserved segments. In both cases, you are considering matches, mismatches, and gaps.

11 Global Sequence Alignments
Yeast Prion-Like Proteins Here is an example msa of a portion of several related kinases. Each position is aligned into columns. Columns are colored according to similar properties. A Star indicated perfect conservation across all the sequences, whereas a colon or dot demonstrate sequence similarity. The more sequences you analyze, the less likely that there is going to be conservation at each position. However, you can see from this example that there is a very distinct set of conserved columns, where the sequences share the same identical residue at a given position. This may indicate an importance of these residues in strucuture and function.

12 How To Make A Global MSA On The Web On Your Computer
On Your Computer ClustalX:

13 MSA Example Sequences Standard FASTA Sequence Format >KSYK_HUMAN
FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVAHGRKAHHYTIERELNGTYAIAGGRTHASPADLCHYH >ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRKEQGTYALSLIYGKTVYHYLISQDKAGKYCIPEGTKFDTLWQLVEYL >KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRARDNGSYALGLLHEGKVLHYRIDKDKTGKLSIPGGKNFDTLWQLVEHY >MATK_HUMAN WFHGKISGQEAVQQLQPPEDGLFLVRESARHPGDYVLCVSFGRDVIHYRVLHRDGHLTIDEAVFFCNLMDMVEHY >CSK_CHICK WFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCEGKVEHYRIIYSSSKLSIDEEVYFENLMQLVEHY >CRKL_HUMAN WYMGPVSRQEAQTRLQGQRHGMFLVRDSSTCPGDYVLSVSENSRVSHYIINSLPNRRFKIGDQEFDHLPALLEFY >YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKLDNGGYYITTRTQFMSLQMLVKHY >FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKLDMGGYYITTRVQFNSVQELVQHY >SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKLYSGGFYITSRTQFGSLQQLVAYY

14 MSA Example Result YES_XIPHE WYFGKLSRKDTERLLLLPGNERGTFLIRESETTKGAYSLSLRDWDETKGDNCKHYKIRKL FGR_HUMAN WYFGKIGRKDAERQLLSPGNPQGAFLIRESETTKGAYSLSIRDWDQTRGDHVKHYKIRKL SRC_RSVP WYFGKITRRESERLLLNPENPRGTFLVRKSETAKGAYCLSVSDFDNAKGPNVKHYKIYKL MATK_HUMAN WFHGKISGQEAVQQLQPPED--GLFLVRESARHPGDYVLCVS-----FGRDVIHYRVLHR CSK_CHICK WFHGKITREQAERLLYPPET--GLFLVRESTNYPGDYTLCVS-----CEGKVEHYRIIYS CRKL_HUMAN WYMGPVSRQEAQTRLQGQRH--GMFLVRDSSTCPGDYVLSVS-----ENSRVSHYIINSL ZA70_HUMAN WYHSSLTREEAERKLYSGAQTDGKFLLRPRK-EQGTYALSLI-----YGKTVYHYLISQD KSYK_PIG WFHGKISRDESEQIVLIGSKTNGKFLIRAR--DNGSYALGLL-----HEGKVLHYRIDKD KSYK_HUMAN FFFGNITREEAEDYLVQGGMSDGLYLLRQSRNYLGGFALSVA-----HGRKAHHYTIERE :: . : :: : * :*:* * : * : ** : YES_XIPHE DNGGYYITTRTQFMSLQMLVKHY FGR_HUMAN DMGGYYITTRVQFNSVQELVQHY SRC_RSVP YSGGFYITSRTQFGSLQQLVAYY MATK_HUMAN DGHLTIDEAVFFCNLMDMVEHY CSK_CHICK SSKLSIDEEVYFENLMQLVEHY CRKL_HUMAN PNRRFKIGDQE-FDHLPALLEFY ZA70_HUMAN KAGKYCIPEGTKFDTLWQLVEYL KSYK_PIG KTGKLSIPGGKNFDTLWQLVEHY KSYK_HUMAN LNGTYAIAGGRTHASPADLCHYH * : .

15 Steps to Build Trees from MSA
1) identify taxa to be considered 2) choose characters (independent, “unit”) 3) construct character matrix for each taxon: 4) After performing alignment, use mathematical formula to describe degree of similarity for each taxon: e.g. simple matching coefficient # matches total # of characters S =

16 Steps to Build Trees 5) construct matrix with pairwise S values
6) use clustering technique to produce a tree (dendrogram) Unweighted/Equal weighting = all characters given equal consideration UPGMA (Unweighted Pair Group Method with Arithmetic Averaging) Neighbour-joining Unweighting is a form of weighting

17 Building Matrices Taxon 1 2 3 4 5 6 7 8 9 10 A B C D Taxon A B C D --
B C D Character Matrix Taxon A B C D -- 0.3 0.4 0.7 0.5 S-value Matrix

18 Joining Clusters into a Tree
Closest: A&D = 0.7 2nd Closest B&C = 0.5 When does A&D join B&C ? (A&B) + (A&C) + (D&B) + (D&C) 4 = ( )/4 = 0.35

19 Problems Different methods or characters = different dendrograms
If we use all possible characteristics this would be a natural classification The tree is an accurate phylogeny if differences in characters between taxa proportional to time elapsed since common ancestor

20 Convergent Evolution Similar phenotypic response to similar ecological conditions Different developmental pathways

21 Reversal of Evolution An altered character reverts to the ancestral form. In a DNA molecule, a nucleotide position may change from a C to a T and then back to a C. This frog reverted to teeth.

22 Trees are hypotheses about evolutionary history
Different methods may result in different trees. How to chose between the different models? One way is to compare different types of character data and see if the trees make sense.

23 Haplotype Network in 3 Elephant Species with 3 DNA sequences
biogeog3e-fig jpg

24 Parsimonious choices reflect fewer changes
The assumptions of parsimony Reversals and convergence require more changes Parsimonious trees represent best estimates of phylogenetic relationships

25 Use of DNA, RNA, or Protein
For phylogeny, DNA can be more informative. The protein-coding portion of DNA has synonymous and nonsynonymous substitutions. Some DNA changes do not have corresponding protein changes See arrows 14, 21, 25, 27, 29 in the retinol-binding protein figure.

26

27

28 For phylogeny, DNA can be more informative.
If the synonymous substitution rate (dS) is greater than the nonsynonymous substitution rate (dN), the DNA sequence is under negative (purifying) selection. This limits change in the sequence. If dS < dN, positive selection occurs. For example, a duplicated gene may evolve rapidly to assume new functions.

29 Models of nucleotide substitution- Transitions > Transversions
G transversion transversion C T transition

30 Some substitutions in a DNA sequence alignment can be directly observed:
single nucleotide substitutions sequential substitutions coincidental substitutions

31

32 Additional mutational events can be inferred by analysis of ancestral sequences. These changes include parallel substitutions convergent substitutions back substitutions

33

34 Advantages of DNA Noncoding regions (such as 5’ and 3’ untranslated regions) may be analyzed using molecular phylogeny. See Figure (arrows 4-10 and 35-38) Pseudogenes (nonfunctional genes) are studied by molecular phylogeny Rates of transitions and transversions can be measured. Transitions: purine (A to G) or pyrimidine (C to T) substitutions Transversion: purine to pyrimidine

35 Protein sequences are also used for phylogeny
Proteins have 20 states (amino acids) instead of only four for DNA, so there is more phylogenetic information. Nucleotides are unordered characters: any one nucleotide can change to any other in one step. An ordered character must pass through one or more intermediate states before reaching the final state. Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value.

36 Amino acid sequences From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be functionally devastating Tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks are used: PAM and BLOSSUM

37 Sequence-Based Comparisons
Identify sequences within an organism that are related to each other and/or across different species Within: Fetal and adult hemoglobin Across : Human and chimpanzee hemoglobin Generate an evolutionary history of related genes Locate insertions, deletions, and substitutions that have occurred during evolution (C) Cysteine (R) Arginine (E) Glutamate (A) Alanine (T) Threonine (S) Serine (L) Leucine (P) Proline (G) Glycine Draw comparisons among different proteins to answer these questions. Begin by identifying proteins with an organism that are related to each other and across different species Generate an evolutionary history of the genes, see which ones are more closely related Proceed to specifically locate insertions, deletions, and substitutions For example, how did the sequences RELAPSE and GREASER evolve from a common ancestor? CREATE CREASE -RELAPSE [Ancestor] [Progenitors] GREASER

38 Multiple Sequence Alignments
Place residues in columns that are derived from a common ancestral residue Identify Matches, Mismatches, and Gaps MSA can reveal sequence patterns Demonstration of homology between >2 sequences Identification of functionally important sites Protein function prediction Structure prediction CREASE CREATE RELAPSE GREASER SeqA CRE-A-TE- SeqB CRE-A-SE- The idea behind a multiple sequence alignment is to place residues into columns that are derived from a a common ancestral residue. Similar to a phylogenetic tree, but on a discrete residue-by-residue basis. MSAs are very useful in revealing sequence patterns. For instance: SeqC GRE-A-SER SeqD -RELAPSE-

39 MSA and Tree Relationship
“The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves” (Mount, 2001) CREATE CREASE CREATE CRE-A-TE- SeqA CREATE CREASE CRE-A-SE- SeqB (animate this) MSAs and trees are interrelated. One assists In the construction of the other. Mount describes this relationship as “The optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves. For instance, here is a MSA of 4 related sequences. Matches, insertion, deletions, mismatches. This can be represented in a tree structure. +R GRE-A-SER SeqC T to S GREASE C to G +L +P -RELAPSE- SeqD -G

40 Multiple Sequence Alignments
Confirm that all sequences are homologous Adjust gap creation and extension penalties as needed to optimize the alignment Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data). Many experts recommend that you delete any column of an alignment that contains gaps (even if the gap occurs in only one taxon)

41 Problems in Reconstructing Phylogeny
Characters sometimes conflict It is sometimes difficult to tell homology from homoplasy Analogy- characters similar because of convergent evolution Reversal- character reverts to ancestral form With morphological characters, careful examination may distinguish homoplasy (orthologs) from homology With molecular characters (DNA/Protein sequences), orthologs sometimes impossible to distinguish from homologs and paralogs.

42 A Phylogenetic Tree Taxon -- Any named group of organisms – evolutionary theory not necessarily involved. Clade -- A monophyletic taxon (evolutionary theory utilized)

43 A phylogenetic tree with branch lengths
Branch length can be significant… In this case it is and mouse is slightly more similar to fly than human is to fly (sum of branches is less than sum of 1+2+4)

44 Divergence Points (represent hypothetical ancestors of the taxa)
Common Phylogenetic Tree Terminology Terminal Nodes Branches or Lineages A Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny B C D Ancestral Node or ROOT of the Tree E Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa)

45 Phylogenetic trees diagram the evolutionary
relationships between the taxa Taxon A Taxon B Taxon C Taxon E Taxon D No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom. This dimension either can have no scale (for ‘cladograms’), can be proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), or can be proportional to time (for ‘ultrametric trees’ or true evolutionary trees). ((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of D and E. If the tree has a time scale, then D and E are the most closely related.

46 Three types of trees Cladogram Phylogram Ultrametric tree 6 Taxon B Taxon B Taxon B 1 1 Taxon C Taxon C 3 Taxon C 1 Taxon A Taxon A Taxon A Taxon D 5 Taxon D Taxon D no meaning genetic change time All show the same evolutionary relationships, or branching orders, between the taxa.

47 Types of trees: Cladogram
relative recent common descent. Does not imply that ancestors on the same line necessarily speciated at the same time. t1 can be before or after t2 but not before t3 t3 (no time scale)

48 Types of trees: Phylogram
(additive tree: branch lengths can be summed) relative recenct common descent, and branch lengths = amount of change

49 Types of trees: Ultrametric
Ultrametric tree (linearized tree) divergence All tree tips are equidistant from the root Amount of change can be scaled to time scale = time

50 Completely unresolved bifurcating phylogeny
The goal of phylogeny inference is to resolve the branching orders of lineages in evolutionary trees Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny A B C E D Polytomy or multifurcation A bifurcation

51 There are three possible unrooted trees for four species (A, B, C, D)
Goal of phylogenetic tree building methods is discovery which of the possible unrooted trees is "correct". This should be the “true” biological tree, accurately representing the evolutionary history of the species. However, it is only possible to discover the computationally correct or optimal tree for the phylogenetic method of choice.

52 The number of unrooted trees increases in a greater than exponential manner with number of species (taxa) C A B D E F (2N - 5)!! = # unrooted trees for N taxa

53 Inferring evolutionary relationships between the taxa requires rooting the tree:
C Root D To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Unrooted tree A B C D Root Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree

54 Try it again with the root at another position
B C Root Unrooted tree D A A B B C D Rooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. Root

55 An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees Rooted tree 1b A B C D 2 A Rooted tree 1d C D A B 4 C Rooted tree 1a B A C D 1 The unrooted tree 1: Rooted tree 1e D C A B 5 Rooted tree 1c A B C D 3 B D These trees show five different evolutionary relationships among the taxa!

56 Trick Question Warning
Sometimes two trees may look very different but, in fact, differ only in the position of the root. Don’t forget rotational symmetry!

57 All of these rearrangements show the same evolutionary relationships between the taxa
C D B A C D Rooted tree 1a B A C D B C A D D C A B A B D C A B C D

58 There are two major ways to root trees
By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins). outgroup By midpoint or distance: Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods. A d (A,D) = = 18 Midpoint = 18 / 2 = 9 10 C 3 2 B 2 5 D

59 Rooting Using an Outgroup
The outgroup should be a sequence (or set of sequences) known to be less closely related to the rest of the sequences than they are to each other. It should ideally be as closely related as possible to the rest of the sequences while still satisfying the first condition. The root must be somewhere between the outgroup and the rest (either on the node or in a branch).

60 Automatic Rooting Many software packages will root trees automatically (e.g. mid-point rooting in NJPlot) This normally involves assumptions… BE AWARE what those are.

61 Each unrooted tree theoretically can be rooted anywhere along any of its branches
x = C A B D E F (2N - 3)!! = # unrooted trees for N taxa

62 Molecular phylogenetic tree building methods
Mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available, each with strengths and weaknesses.

63 Types of data used in phylogenetic inference
Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference. Taxa Characters Species A ATGGCTATTCTTATAGTACG Species B ATCGCTAGTCTTATATTACA Species C TTCACTAGACCTGTGGTCCA Species D TTGACCAGACCTGTGGTCCG Species E TTGACCAGTTCTCTAGTTCG Distance-based methods: Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building. A B C D E Species A Species B Species C Species D Species E Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)

64 Similarity vs. Evolutionary Relationship
Similarity and relationship are not the same thing, even though evolutionary relationship is inferred from certain types of similarity. Similar: having likeness or resemblance (an observation) Related: genetically connected (an historical fact) Two taxa can be most similar without being most closely-related: Taxon A Taxon B Taxon C Taxon D 1 6 3 5 C is more similar in sequence to A (d = 3) than to B (d = 7), but C and B are most closely related (that is, C and B shared a common ancestor more recently than either did with A).

65 Types of Similarity Observed similarity between two entities can be due to: Evolutionary relationship: Shared ancestral characters (‘plesiomorphies’) Shared derived characters (‘’synapomorphy’) Homoplasy (independent evolution of the same character): Convergent events (in either related on unrelated entities), Parallel events (in related entities), Reversals (in related entities) C G G C G C C G T G G C Character-based methods can tease apart types of similarity and theoretically find the true evolutionary tree. Similarity = relationship only if certain conditions are met (if the distances are ‘ultrametric’).

66 METRIC DISTANCES between any two or three taxa
(a, b, and c) have the following properties: Property 1: d (a, b) ≥ 0 Non-negativity Property 2: d (a, b) = d (b, a) Symmetry Property 3: d (a, b) = 0 if and only if a = b Distinctness Property 4: d (a, c) ≤ d (a, b) + d (b, c) Triangle inequality: a b c 6 9 5

67 ULTRAMETRIC DISTANCES must satisfy the previous four conditions, plus:
Property 5 d (a, b) ≤ maximum [d (a, c), d (b, c)] a b 4 6 c This implies that the two largest distances are equal, so that they define an isosceles triangle: Similarity = Relationship if the distances are ultrametric! a b c 2 4 If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner, thus can be used in UPGMA trees and for the most precise calculations of divergence dates.

68 ADDITIVE DISTANCES: Property 6: d (a, b) + d (c, d) ≤ maximum [d (a, c) + d (b, d), d (a, d) + d (b, c)] For distances to fit into an evolutionary tree, they must be either metric or ultrametric, and they must be additive. Estimated distances often fall short of these criteria, and thus can fail to produce correct evolutionary trees.

69 Tree-building methods: UPGMA
UPGMA is: unweighted pair group method using arithmetic mean 1 2 3 4 5

70 Tree-building methods: UPGMA
Step 1: compute the pairwise distances of all the proteins. Get ready to put the numbers 1-5 at the bottom of your new tree. 1 2 3 4 5

71 Tree-building methods: UPGMA
Step 2: Find the two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 1 2

72 Tree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them. 1 2 3 4 5 6 7 1 2 4 5

73 Tree-building methods: UPGMA
Step 4: Keep going. Cluster. 1 2 3 4 5 8 7 6 1 2 4 5 3

74 Tree-building methods: UPGMA
Step 4: Last cluster! This is your tree. 1 2 3 4 5 9 8 7 6 1 2 4 5 3

75 Distance-based methods: UPGMA trees
UPGMA is a simple approach for making trees. An UPGMA tree is always rooted. An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong. While UPGMA is simple, it is less accurate than the neighbor-joining approach

76 Making trees using Neighbor-Joining
The neighbor-joining method of Saitou and Nei (1987) Is especially useful for making a tree having a large number of taxa. Begin by placing all the taxa in a star-like structure.

77 Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closely related. Connect these neighbors to other OTUs via an internal branch, XY. At each successive stage, minimize the sum of the branch lengths.

78 Tree-building methods: Neighbor joining
Define the distance from X to Y by dXY = 1/2(d1Y + d2Y – d12)

79 Example of a neighbor-joining tree: phylogenetic analysis of 13 Retinol Binding Proteins

80 Tree-building methods: character based
Rather than pairwise distances between proteins, evaluate the aligned columns of amino acid residues (characters). Tree-building methods based on characters include maximum parsimony and maximum likelihood.

81 Making trees using character-based methods
The main idea of character-based methods is to find the tree with the shortest branch lengths possible: the most parsimonious (“simple”) tree. Identify informative sites. For example, constant characters are not parsimony-informative. Construct trees, counting the number of changes required to create each tree. For about 12 taxa or fewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search. Select the shortest tree (or trees).

82 As an example of tree-building using maximum
parsimony, consider these four taxa: AAG AAA GGA AGA How might they have evolved from a common ancestor such as AAA?

83 Tree-building methods: Maximum parsimony
1 AAA AAA AAA AAA AGA AAA AAA AAA AAA 1 1 1 1 2 1 2 1 AAG AAA GGA AGA AAG AGA AAA GGA AAG GGA AAA AGA Cost = 3 Cost = 4 Cost = 4 In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

84 Types of computational methods

85 Clustering algorithms:
Use pairwise distances. Are purely algorithmic methods, in which the algorithm itself defines the the tree selection criterion. Tend to be very fast programs that produce singular trees rooted by distance. No objective function to compare to other trees, even if numerous other trees could explain the data equally well. Warning: Finding a singular tree is not necessarily the same as finding the "true” evolutionary tree.

86 Optimality approaches:
Use either character or distance data. First define an optimality criterion (minimum branch lengths, fewest number of events, highest likelihood), and then use a specific algorithm for finding trees with the best value for the objective function. Can identify many equally optimal trees, if such exist. Warning: Finding an optimal tree is not necessarily the same as finding the "true” tree.

87 Computational methods for finding optimal trees:
Exact algorithms: "Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building: Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method. Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions. Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searches often operate by “hill-climbing” methods.

88 Exact searches become increasingly difficult, and
eventually impossible, as the number of taxa increases: A B C C A B D A D B E C A D B E C F (2N - 5)!! = # unrooted trees for N taxa

89 Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima
Rerunning heuristic searches using different input orders of taxa can help find global minima or maxima Search for global maximum Search for global minimum GLOBAL MAXIMUM GLOBAL MAXIMUM local maximum local minimum GLOBAL MINIMUM GLOBAL MINIMUM

90 Assumptions made by phylogenetic methods:
The sequences are correct The sequence are homologous Each position is homologous The sampling of taxa or genes is sufficient to resolve the problem of interest Sequence variation is representative of the broader group of interest Sequence variation contains sufficient phylogenetic signal (as opposed to noise) to resolve the problem of interest Each position in the sequence evolved independently

91 Problems with Phylogenetic Inference
How do we know what the potential candidate trees are? How do we choose which tree is (most likely) the true tree? The best tree is the one that produces consistent results.

92 Recipe for reconstructing a phylogeny
Select an optimality criterion Select a search strategy Use the selected search strategy to generate a series of trees, and apply the selected optimality criterion to each tree, always keeping track of the “best” tree examined thus far. How do you know the “best” tree? Which is the “true” tree?

93 Search strategy: Which is the right tree?
When m is the number of taxa, the number of possible trees is: [(2m-3)!]/[2m-2(m-2)!] For 10 taxa, the number of trees is 34,459,425 Many trees can be discarded because they are obviously wrong Sometimes, there is a general or even specific grouping that can serve as a start for the tree search There are a number of approaches to tree searches that can be used

94 Evaluating the best tree
Maximum likelihood (ML) tests the hypothesis by using a mathematical formula that Tests the probability that a nucleotide substitution will occur Tests a tree with known branch lengths and how likely the DNA sequences will occur. If similar trees have the same probability as the one with higher likelihood then the hypothesis is weaker.

95 Evaluating the best tree
Bayesian Markov Chain Monte Carlo (BMCMC) Asks the probability that a particular tree is correct given data and a model of how traits change Distance methods Looks at the changes in a character and converts it into distance Assumes a specific model of character changes are clustered so that the more similar forms are close together.

96 Current strategy Produce a consensus tree using parsimony
Evaluated best tree using statistical tests found in MC and BMCMC Compare the best trees using parsiomony MC and BMCMC The best tree is the one that produces consistent results

97 Evaluating branches Evaluation is done statistically
Trees based on MC and BMCMC compare tress with and without the branch Trees based on maximum parsimony use bootstrapping 97

98 Bootstrapping In bootstrapping, for example you analyze a 300 base pair of a gene and the computer program makes 300 choices out of that sequence to determine the frequency that the branch in question would occur in all of the trees that are generated. If the branch occurs under 50% of the time, there is too much uncertainty and the branch is collapsed into a polytomy or point of uncertainty. Bootstrap support of around 70% is associated with true phylogeny.

99 Resolving conflict Researchers have more confidence in trees that use
Larger data sets Characters that are not subject to homoplasy Appropriate inference methods Sometimes researchers have to wait for more data

100 Molecular clocks Timing and rate of evolution can be determined by looking at changing molecular traits. Changes in DNA sequences that are not tied to phenotypes and cannot be selected against can be tracked Neutral theory of molecular evolution predicts neutral changes in DNA should occur at the same rate as mutation rate. Method used takes documentation of the number of different neutral mutations observed in two species and multiplies it by a calibration rate that represents the number of changes that occur per million years. This helps estimate when the two species diverged. 100

101 Caveats for molecular dating
Mutation rates to neutral alleles vary in the different genes, lineages and bases Third position of codons are more likely neutral and change in a clock-like fashion Rapid changes in allele frequencies in response to selection pressure produce unreliable clocks Calibration rates for a particular gene or lineage cannot be used for other groups that have different generation times and selection histories

102 When did humans start wearing clothes?
Pediculus corporis Pediculus capitis The origin of body lice. Both of the species above are restricted in their location. Ralf Kittler and colleagues hypothesized that body lice adapted to live in clothing, therefore they diverged from head lice at the time when humans started wearing clothes. They took mt DNA and two nuclear genes, RNA polymerase II and elongation factor 1alpha. They used a chimpanzee as the outgroup. 102

103 Kittler et al., 2003 Figure 1. Neighbor-Joining Tree Based on Kimura-2-Parameter Distances for the Concatenated Sequences of ND4 and CYTB from 40 Lice. Identical topologies were obtained for maximum parsimony and minimum evolution trees for these sequences (results not shown). The tree was rooted with the corresponding sequence of P. schaeffi (chimpanzee louse); alternative placements of the root at any of the first three deepest branches (with three African, 6 European, and one African head lice sequence, respectively) are not significantly different and do not alter any conclusions. Bootstrap values (500 replications) are indicated on each interior branch. The arrows indicate the estimated age of particular nodes of the tree, based on Poisson-corrected amino acid distances. The tree based on amino acid distances (not shown) is virtually identical in topology to the tree shown, except for some sequences that differ only by silent substitutions. B: body louse, H: head louse; the frequency of a haplotype is indicated in brackets. Geographic origin of lice: Et: Ethiopia, Pa: Panama, Ge: Germany, Ph: Philippines, Ir: Iran, Ec: Ecuador, La: Laos, PNG: Papua New Guinea, Fl: Florida (USA), Ta: Taiwan, Ne: Nepal, UK: United Kingdom. 103

104 Summary of the lousy results
Greater diversity in African than in non-African lice Lice probably originated in Africa along with humans Clothing appeared 30,000 to 114,000 years ago The expansion of lice diversity represents the migration of humans out of Africa Clothing may have allowed the successful movement of humans out of Africa into colder climates.

105 Which species are the closest living relatives of modern humans?
Gorillas Chimpanzees Chimpanzees Bonobos Bonobos Gorillas Orangutans Orangutans Humans 14 15-30 MYA MYA Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas. The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least MYA.

106 Did the Florida Dentist infect his patients with HIV?
Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: Patient C Patient A Patient G Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. Patient B Patient E Patient A DENTIST Local control 2 Local control 3 No Patient F Local control 9 Local control 35 Local control 3 No Patient D From Ou et al. (1992) and Page & Holmes (1998)

107 What was the most likely geographical location of the
common ancestor of the African apes and humans? Scenario A: Africa as species fountain Scenario B: Eurasia as ancestral homeland Scenario B requires four fewer dispersal events Modified from: Stewart, C.-B. & Disotell, T.R. (1998) Current Biology 8: R Eurasia = Black Africa = Red = Dispersal

108 How can we choose between competing hypotheses on phylogeny of whales?

109 Phylogenetic Reconstruction of Whales
Whales belong to artiodactyla (ungulate mammals), which includes camels, pigs, hippos, cows, deer Outgroup is rhinos/horses Difficult to place them because they lack many characters present in terrestrial mammals (e.g. hind limbs) Are whales sister to entire group or to hippos?

110 DNA Sequence Data and Whale Evolution
Data collected from beta-casein gene for all taxa and sequences aligned. Nucleotide changes between outgroup and ingroup species indicate shared derived homologies. Most nucleotides are identical in all taxa, these are uninformative for phylogeny. Some nucleotides indicate that whales belong with cows, deer, and hippos (162). Others indicate that whales and hippos are sister groups (166). Others contradict sister group status of whale/hippo and cow deer (177) and may indicate a reversal.

111 Phylogeny results should be treated as informative but not authoritative


Download ppt "BuildingTrees."

Similar presentations


Ads by Google