Phylogenetic reconstruction

Slides:



Advertisements
Similar presentations
Phylogenetic reconstruction
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Phylogenetic inference or “How to recognize a tree from quite a long way away” Mikael Thollesson Evolutionary Biology Centre, Uppsala University Slides.
Molecular Phylogeny Analysis, Part II. Mehrshid Riahi, Ph.D. Iranian Biological Research Center (IBRC), July 14-15, 2012.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Based on lectures by C-B Stewart, and by Tal Pupko Phylogenetic Analysis based on two talks, by Caro-Beth Stewart, Ph.D. Department of Biological Sciences.
Phylogenetic Analysis
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Building phylogenetic trees Jurgen Mourik & Richard Vogelaars Utrecht University.
Phylogenetic reconstruction
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogeny Tree Reconstruction
Building Phylogenies Parsimony 1. Methods Distance-based Parsimony Maximum likelihood.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
Phylogentic Tree Evolution Evolution of organisms is driven by Diversity  Different individuals carry different variants of.
Models of sequence evolution GTR HKY Jukes-Cantor Felsenstein K2P Tree building methods: some examples Assessing phylogenetic data Popular phylogenetic.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetic Inference Data Optimality Criteria Algorithms Results Practicalities BIO520 BioinformaticsJim Lund Reading: Ch8.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Intro. To Phylogenetic Analysis Slides modified by David Ardell From Caro-Beth Stewart, Paul Higgs, Joe Felsenstein and Mikael Thollesson.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Part 9 Phylogenetic Trees
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Phylogenetic Inference
Multiple Alignment and Phylogenetic Trees
Lecture 7 – Algorithmic Approaches
Phylogeny.
Presentation transcript:

Phylogenetic reconstruction

Types of data used in phylogenetic inference: Distance-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building. A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ---- Character-based methods: Use the aligned characters directly. Taxa Characters Species A ATGGCTATTCTTATAGTACG Species B ATCGCTAGTCTTATATTACA Species C TTCACTAGACCTGTGGTCCA Species D TTGACCAGACCTGTGGTCCG Species E TTGACCAGTTCTCTAGTTCG Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)

Tree-Building Methods Distance UPGMA, NJ, FM, ME Character Maximum Parsimony Maximum Likelihood

Distance Methods Measure distance (dissimilarity) Methods UPGMA (Unweighted Pair Group Method with Arithmetic Mean) NJ (Neighbor joining) ME (Minimal Evolution)

UPGMA: Distance measure Clustering: All leaves are assigned to a cluster, which then are iteratively merged according to their distance. The distance between two clusters i and j can be defined as: Number of mismatches in gap free alignment

UPGMA: Replacing Node k replaces nodes i and j with their union: (1) The new distances between the new node k and all other clusters l are computed according to: (2)

UPGMA: Algorithm Initialization: Iteration Termination Assign each sequence i to its own cluster Ci . Define one leaf of T for each sequence, and place at height zero. Iteration Determine the two clusters i, j for which di,j is minimal. Define a new cluster k by Ck = Ci U Cj, and define dkl for all l by (2). Define a node k with daughter nodes i and j, and place it at height di,j/2. Add k to the current clusters and remove i. Termination When only two clusters i, j remain, place the root at height di,j/2.

UPGMA example: Step 1 Alignment -> distance Example: observed percent sequence difference A B C D E F G - 63 94 79 111 96 47 67 20 83 100 23 58 89 106 62 107 92 43 16 102 Distance matrix: DNA/RNA overview

Step 2: distance -> clade B C D E F G - 63 94 79 111 96 47 67 23 83 100 20 58 89 106 62 107 92 43 16 102 DNA/RNA overview

Step 3: merge D and G A B C D E F G - 63 94 79 111 96 47 67 23 83 100 20 58 89 106 62 107 92 43 16 102 A B C E F DG - 63 94 79 67 23 83 20 58 89 62 109 45 98 104 DNA/RNA overview

Step 4 A B C E F DG - 63 94 79 67 23 83 20 58 89 62 109 45 98 104 DNA/RNA overview

Step 5 AF B C E DG - 61 92 79 65 23 62 107 94 45 98 A B C E F DG - 63 67 23 83 20 58 89 62 109 45 98 104 DNA/RNA overview

Step 6 AF B C E DG - 61 92 79 65 23 62 107 94 45 98 DNA/RNA overview

Step 7 AF BE C DG - 63 92 71 107 96 45 DNA/RNA overview

Step 8 AF BE C DG - 63 92 71 107 96 45 DNA/RNA overview

Step 9 AF BE CDG - 63 102 88 DNA/RNA overview

Step 10 AF BE CDG - 63 102 88 DNA/RNA overview A F

UPGMA: distance -> phylogeny AFBE CDG - 94 A F A F DNA/RNA overview Root

Additivity A C a c e Neighbors B b d D Fourpoint condition

Neighbor-joining (N-J) Starts with a star-like tree Neighbors are sequentially merged, if they minimize the total length of branches B B C C A A D F F E D E Find 1 and 2 that give minimum Q

Clustering methods (UPGMA & N-J) Optimality criterion: NONE. The algorithm itself builds ‘the’ tree. UPGMA: rootet tree N-J: unrootet tree Advantages: Can be used on indirectly-measured distances (immunological, hybridization). Distances can be ‘corrected’ for unseen events. The fastest of the methods available. Can therefore analyze very large datasets quickly (needed for HIV, etc.). Can be used for some types of rate and date analysis. Disadvantages: Similarity and relationship are not necessarily the same thing, so clustering by similarity does not necessarily give an evolutionary tree. Cannot be used for character analysis! Have no explicit optimization criteria, so one cannot even know if the program worked properly to find the correct tree for the method.

Minimum evolution (ME) methods Optimality criterion: The tree with the shortest overall tree length is chosen as the best tree. Advantages: Can be used on indirectly-measured distances (immunological, hybridization). Distances can be ‘corrected’ for unseen events. Usually faster than character-based methods. Can be used for some rate analyses. Has an objective function (as compared to clustering methods). Disadvantages: Information lost when characters transformed to distances. Cannot be used for character analysis. Slower than clustering methods.

Character Methods Maximum Parsimony Maximum Likelihood minimal changes to produce data Maximum Likelihood

Parsimony methods: Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences. Occam’s razor: “Of two competing theories or explanations, all other things being equal, the simpler one is to be preferred.” (William of Occam, 1280-1347) Advantages: Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). Can be used on molecular and non-molecular (e.g., morphological) data. Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors. Disadvantages: Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!) Can become positively misleading in the “Felsenstein Zone”:

Parsimony Comparison # changes 1-2 1 1-3 2 1-4 2-3 2-4 3-4 1 CCCAGG 2 CCCAAG 3 CCCAAA 4 CCCAAC Comparison # changes 1-2 1 1-3 2 1-4 2-3 2-4 3-4 1,2 can be sister taxa AND 3,4 can be sister taxa Infer ancestor of 1,2 and 3,4

Parsimony CCCAGG CCCAAG-> CCCAAG CCCAAA-> CCCAAA CCCAAC 3 changes

Calculate # changes | tree a acgtatgga b acgggtgca g aacggtgga d aactgtgca a: c g: a a: c g: a a: c g: a b: c d: a d: a b: c b: c d: a Total tree length: 7 Total tree length: ? Total tree length: ?

Calculate # changes | tree a acgtatgga b acgggtgca g aacggtgga d aactgtgca a: c g: a a: c g: a a: c g: a b: c d: a d: a b: c b: c d: a Total tree length: 7 Total tree length: 8 Total tree length: 8

Maximum likelihood (ML) methods Optimality criterion: ML methods evaluate phylogenetic hypotheses in terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree would give rise to the observed data. The tree found to have the highest ML value is considered to be the preferred tree. Advantages: Are inherently statistical and evolutionary model-based. Usually the most ‘consistent’ of the methods available. Can be used for character (can infer the exact substitutions) and rate analysis. Can be used to infer the sequences of the extinct (hypothetical) ancestors. Can help account for branch-length effects in unbalanced trees. Can be applied to nucleotide or amino acid sequences, and other types of data. Disadvantages: Are not as simple and intuitive as many other methods. Are computationally very intense (Iimits number of taxa and length of sequence). Like parsimony, can be fooled by high levels of homoplasy. Violations of the assumed model can lead to incorrect trees.

Using models Example: Jukes-Cantor , if i≠j , if i=j Observed differences A G C T , if i=j , if i≠j Actual changes pt : proportion of different nucleotides

Maximum likelihood Given two trees, the one with the higher likelihood, i.e. the one with the higher conditional probability of observing the data, is the better

30 nucleotides from yh-globin genes of two primates on a one-edge tree * * Gorilla GAAGTCCTTGAGAAATAAACTGCACACTGG Orangutan GGACTCCTTGAGAAATAAACTGCACACTGG There are two differences and 28 similarities at at= 0.02327 lnL= -51.133956 lnL

Likelihoods of a more interesting tree Data for one site is shown on the tree Edge lengths are defined as di=3(at)i Computational root is chosen arbitrarily (homogenous models) at an internal node (arrow) u is the state at the root node, v at the other internal node A C d1 d3 d5 d4 d2 A T u Maximize L with respect to d’s

Confidence assesment Bootstrap

Bootstrap Original analysis, e.g. MP, ML, NJ. Aus Beus Ceus Deus Original analysis, e.g. MP, ML, NJ. Original data set with n characters. Draw n characters randomly with re-placement. Repeat m times. m pseudo-replicates, each with n characters. Aus Beus Ceus Deus Repeat original analysis on each of the pseudo-replicate data sets. Aus Beus Ceus Deus 75% Evaluate the results from the m analyses.

Pros and cons of some methods Pair-wise, algorithmic approach + Fast + Models can be used when transforming to distances - Information is lost when transforming to pair-wise distances - One will get a tree, but no measure of goodness to compare with other hypotheses Parsimony + Philosophically appealing - Models can not be used - Can be computationally slow Maximum likelihood + Model based - Model based - Computationally veeeeery slow

Computation For large data sets (many taxa) exact solutions for any method employing an optimality criterion (parsimony, maximum likelihood, minimum evolution) are not possible => Use Neghbor-Joining

What can go wrong? Reality A tree may be a poor model of the real history Information has been lost by subsequent evolutionary changes “Species” vs. “gene” trees

What is wrong with this tree? Canis Mus Gadus 100 100

The expected tree… Gene duplication “Species” tree “Gene” trees

Two copies (paralogs) present in the genomes Orthologous Orthologous Canis Mus Gadus Paralogous Two copies (paralogs) present in the genomes

What we have studied… Canis Gadus Mus

HIV Genome Diversity Error prone (RT) replication High rate of replication 1010 virions/day In vivo selection pressure

HIV tree ENV AIDS 1996, 10:S13 GAG Recombinants!

To conclude– Trash in, trash out : Alignment crucial Try several methods, for consistency Beware of paraloges Choos outgroup wisly: related sequence of more ancient origin than ”ingroup” If recombinations possible: each site has its own tree.