Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson.

Similar presentations


Presentation on theme: "Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson."— Presentation transcript:

1 Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson

2 C-B Stewart, NHGRI lecture, 12/5/00 What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1.Phylogeny inference or “tree building” — evolutionary relationships between genes or species 2.Character and rate analysis — mapping information onto trees

3 C-B Stewart, NHGRI lecture, 12/5/00 Ancestral Node or ROOT of the Tree Internal Nodes (represent hypothetical ancestors of the taxa) Branches or Lineages Terminal Nodes A B C D E Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny Common Phylogenetic Tree Terminology CLADE

4 A B C D X and Y are defined to be more closely related to each other than to Z if, and only if, they share a more recent common ancestor than they do with Z DCABBACD

5 C-B Stewart, NHGRI lecture, 12/5/00 All of these rearrangements show the same evolutionary relationships between the taxa B A C D A B D C B C A D B D A C B A C D Rooted tree 1a B A C D A B C D

6 C-B Stewart, NHGRI lecture, 12/5/00

7 Taxon A Taxon B Taxon C Taxon D no meaning Three types of trees Cladogram All show the same branching orders between taxa. groupings

8 C-B Stewart, NHGRI lecture, 12/5/00 Taxon A Taxon B Taxon C Taxon D 1 1 1 6 3 5 evolutionary distance Taxon A Taxon B Taxon C Taxon D no meaning Three types of trees Cladogram Phylogram All show the same branching orders between taxa. groupingsgroupings + distance

9 C-B Stewart, NHGRI lecture, 12/5/00 Taxon A Taxon B Taxon C Taxon D 1 1 1 6 3 5 Evolutionary distance Taxon A Taxon B Taxon C Taxon D time Taxon A Taxon B Taxon C Taxon D no meaning Three types of trees Cladogram Phylogram Ultrametric tree All show the same branching orders between taxa. groupingsgroupings + distancegroupings + time

10 C-B Stewart, NHGRI lecture, 12/5/00 Similarity vs. Evolutionary Relationship: Since taxa evolve at different rates, your closest relative could be very different Taxon A Taxon B Taxon C (think lamprey) Taxon D 1 1 1 6 3 5 C is closer to A but more closely related to B This is why the closest BLAST hit is not necessarily the closest relative, and why you need to make trees.

11 Types of Similarity Observed similarity between two entities can be due to: Evolutionary relationship: Shared ancestral characters (‘plesiomorphies’) Shared derived characters (‘’synapomorphy’) Homoplasy (independent evolution of the same character): Convergent events,Parallel events, Reversals C C G G C C G G C G G C C G G T

12 C-B Stewart, NHGRI lecture, 12/5/00 A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: Which species are the closest living relatives of modern humans? Did the infamous Florida Dentist infect his patients with HIV? What were the origins of specific transposable elements?

13 Which species are the closest living relatives of modern humans? Classical view Humans Bonobos Gorillas Orangutans Chimpanzees MYA 0 15-30

14 Which species are the closest living relatives of modern humans? Molecular view Classical view MYA Chimpanzees Orangutans Humans Bonobos Gorillas Humans Bonobos GorillasOrangutans Chimpanzees MYA 0 15-30 0 14

15 Did the Florida Dentist infect his patients with HIV? DENTIST Patient D Patient F Patient C Patient A Patient G Patient B Patient E Patient A Local control 2 Local control 3 Local control 9 Local control 35 Local control 3 Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. No From Ou et al. (1992) and Page & Holmes (1998) Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People:

16 C-B Stewart, NHGRI lecture, 12/5/00 Uses of character mapping: Dating adaptive evolutionary events Ancestral reconstruction Testing biological hypotheses of correlated function or change

17 Ex: Where geographically was the common ancestor of African apes and humans? Eurasia = Black Africa = Red = Dispersal Modified from: Stewart, C.-B. & Disotell, T.R. (1998) Current Biology 8: R582-588. Scenario B requires four fewer dispersal events Scenario A: Africa as species fountainScenario B: Eurasia as ancestral homeland

18 C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

19 C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

20 C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

21 Types of data: Character-data: Taxa Characters Species AATGGCTATTCTTATAGTACG Species BATCGCTAGTCTTATATTACA Species CTTCACTAGACCTGTGGTCCA Species DTTGACCAGACCTGTGGTCCG Species ETTGACCAGTTCTCTAGTTCG Distance-based data: pairwise distances (dissimilarities) A B C D E Species A---- 0.20 0.50 0.45 0.40 Species B0.23 ---- 0.40 0.55 0.50 Species C0.87 0.59 ---- 0.15 0.40 Species D0.73 1.12 0.17 ---- 0.25 Species E0.59 0.89 0.61 0.31 ---- Uncorrected “p” distance Example 2: Kimura 2-parameter distance

22 C-B Stewart, NHGRI lecture, 12/5/00

23

24 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

25 Parsimony n Given two trees, the one requiring the lowest number of character changes to explain the observations is the better – Parsimony score for a tree is the minimum number of required changes – This score is frequently referred to as number of steps or tree length

26 Parsimony – an example  acgtatgga  acgggtgca  aacggtgga  aactgtgca  : c  : c  : a  : a  : c  : c  : a  : a  : c  : a  : a  : c Total tree length: 7Total tree length: 8

27 C-B Stewart, NHGRI lecture, 12/5/00 Building Trees COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES

28 Using models Observed differences Actual changes AG CT Example: Jukes-Cantor, if i=j, if i≠j AC G C ACGT A C G T

29

30 C-B Stewart, NHGRI lecture, 12/5/00

31

32

33

34 30 nucleotides from  -globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities tt lnL  t= 0.02327 lnL= -51.133956 Likelihood of a one-branch tree…

35 A recipe for phylogenetic inference n Collect your data n Select an optimality criterion (“which tree is better?”, tree score) n Optional: do data transformation (“corrections”) n Select a search strategy to find the best tree n Find the best hypothesis according to that criterion n Assess the variation in your data in some way

36 Finding the best tree n Number of (rooted) trees – 3 taxa -> 3 trees – 4 taxa -> 15 trees – 10 taxa -> 34 459 425 trees – 25 taxa -> 1,19·10 30 trees – 52 taxa -> 2,75·10 80 trees n Finding the optimal tree is an NP-complete problem –Search strategies Exact n Exhaustive n Branch and bound Algorithmic n Greedy algorithms, a.k.a. hill-climbing (including Neighbor-joining) Heuristic n Systematic; branch- swapping (NNI, SPR, TBR) n Stochastic – Markov Chain Monte Carlo (MCMC) – Genetic algorithms

37 C-B Stewart, NHGRI lecture, 12/5/00 Completely unresolved or "star" phylogeny Partially resolved phylogeny Fully resolved, bifurcating phylogeny AAA B BB C C C E E E D DD Polytomy or multifurcationA bifurcation “Star-Decomposition”

38 C-B Stewart, NHGRI lecture, 12/5/00 There are three possible unrooted trees on four taxa (A, B, C, D) AC B D Tree 1 AB C D Tree 2 AB D C Tree 3

39 C-B Stewart, NHGRI lecture, 12/5/00 The number of unrooted trees increases in a greater than exponential manner with number of taxa (2N - 5)!! = # unrooted trees for N taxa C A B D A B C A D B E C A D B E C F

40 C-B Stewart, NHGRI lecture, 12/5/00

41 What is a “good” method? n Efficiency n Power n Consistency n Robustness n Falsifiability – Time to find a/the solution – Rate of convergence/how much data are needed – Convergence to “correct” solution as data are added – Performance when assumptions are violated – Rejection of the model when inadequate

42 C-B Stewart, NHGRI lecture, 12/5/00

43

44

45 Frequency of correct inference Sequence length All 0.50 0.30 and 0.05 respectively Performance on simulated data

46 + and – of the methods Pair-wise, NJ, distance approach + Fast (efficiency) + Models can be used to make distances (can be consistent) – pairwise distances throw out information (loss of power) – One will get a tree, but no score to compare with other trees or hypotheses Parsimony and tree-search + Philosophically appealing – Occam’s razor – Can be inconsistent – Can be computationally slow due to a huge number of possible trees Maximum likelihood and tree-search + Model-based, can be consistent, powerful, gain biological info – Model-based, bad when you have the wrong model – Computationally veeeeery slow due to heavy calculations in determining the tree score and a huge number of possible trees

47 The quick and dirty, pretty good tree n Calculate model-based pairwise distances. n Make a Neighbor-Joining Tree n Do a bootstrap

48 A recipe for phylogenetic inference n Collect your data n Select an optimality criterion (“which tree is better”?) n Optional: do data transformation (“corrections”) n Select a search strategy to find the best tree n Find the best hypothesis according to that criterion n Assess the variation in your data in some way

49 Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement

50 Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size

51 Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size 2. Analyze this resampled dataset in the same way as you did the original sample

52 Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size 2. Analyze this resampled dataset in the same way as you did the original sample 3. Repeat this 100+ times, making 100 bootstrap trees

53 Assessing the variation Jackknife – resampling without replacement Bootstrap – resampling with replacement 1. Resample columns from an alignment with replacement to make a simulated sample of the same size 2. Analyze this resampled dataset in the same way as you did the original sample 3. Repeat this 100+ times, making 100 bootstrap trees 4. Summarize, for example, as a majority-rule consensus tree 5. Clades in 50% of the trees will be shown, need 70% to be called “weakly supported”

54 Original data set with n characters. Draw n characters randomly with re- placement. Repeat m times. m pseudo-replicates, each with n characters. Aus Beus Ceus Deus Original analysis, e.g. MP, ML, NJ. Aus Beus Ceus Deus 75% Evaluate the results from the m analyses. Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Aus Beus Ceus Deus Repeat original analysis on each of the pseudo- replicate data sets. Bootstrap NB! The consensus tree is not a phylogenetic hypothesis, but a way to summarize other trees – in this case bootstrapped trees

55 C-B Stewart, NHGRI lecture, 12/5/00 Rooting To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: A B C Root D A B C D Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Rooted tree Unrooted tree

56 C-B Stewart, NHGRI lecture, 12/5/00 Now, try it again with the root at another position: A B C Root D Unrooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. C D Root Rooted tree A B

57 C-B Stewart, NHGRI lecture, 12/5/00 An unrooted, four-taxon tree can be rooted in five different places The unrooted tree 1: AC B D Rooted tree 1d C D A B 4 Rooted tree 1c A B C D 3 Rooted tree 1e D C A B 5 Rooted tree 1b A B C D 2 Rooted tree 1a B A C D 1

58 Outgroup rooting: Uses taxa or sequences (the “outgroup”) known to fall outside all the others (the “ingroup”). Requires prior knowledge. There are two major ways to root trees: A B C D 10 2 3 5 2 Midpoint rooting: Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes clock-like evolution. outgroup d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9

59 C-B Stewart, NHGRI lecture, 12/5/00 x = C A B D AD B E C A D B E C F (2N - 3)!! = # unrooted trees for N taxa Each unrooted tree theoretically can be rooted anywhere along any of its branches

60 We have arrived at a tree – can we trust it as a good hypothesis of the phylogeny? What can go wrong? n Sampling error – Assessed by - for example - the bootstrap n Too superficial tree search – Remember – finding the best tree is really hard – Systematic error (inconsistent method) – Tests of the adequacy of models used – Premeditated use of different methods n Reality – A tree may be a poor model of the real history – Information has been lost by subsequent evolutionary changes n “Species” vs. “gene” trees

61 CanisMusGadus What is wrong with this tree? n Negligible (within sequence) sampling error n Tree estimated by a consistent method 100

62 Gene duplication “Species” tree “Gene” trees The expected tree…

63 CanisMusGadus MusCanis Two copies (paralogs) present in the genomes Paralogous Orthologous

64 CanisGadusMus What we have studied…

65 CanisGadusMus What we have studied… Message: specific loss patterns of paralogs can disrupt species trees if we don’t know what is a paralog And what is an ortholog

66 To conclude– n Phylogenetic inference deals with historical events and information transfer through time n Results from phylogenetic analyses are hypotheses for further testing; the true history will remain unknown n Inference is mathematical intricate and computational heavy, and as a result methods for phylogenetic inference are legio n There are several pitfalls to avoid when doing the analyses and when interpreting them n But… Ignoring the shared histories can sometimes give completely bogus results in comparative studies

67 Phylogenetic trees diagram the evolutionary relationships between the taxa ((A,(B,C)),(D,E)) Taxon A Taxon B Taxon C Taxon E Taxon D No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom.


Download ppt "Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson."

Similar presentations


Ads by Google