COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.

Slides:

Advertisements

Similar presentations

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.

Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.

. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.

An Introduction to Phylogenetic Methods

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.

1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.

Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.

Phylogenetic reconstruction

Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.

Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.

Molecular Evolution Revised 29/12/06

© Wiley Publishing All Rights Reserved. Phylogeny.

Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.

BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.

. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.

UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.

Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.

In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.

The Tree of Life From Ernst Haeckel, 1891.

Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.

Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.

Probabilistic methods for phylogenetic trees (Part 2)

Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.

Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.

Phylogenetic trees Sushmita Roy BMI/CS 576

Phylogenetic Analysis

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Terminology of phylogenetic trees

Molecular phylogenetics

Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.

Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.

Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)

1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.

Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.

BINF6201/8201 Molecular phylogenetic methods

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.

Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.

Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)

Introduction to Phylogenetics

Calculating branch lengths from distances. ABC A B C----- a b c.

Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.

Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.

Doug Raiford Lesson 9.  3 Approaches  Distance  Parsimony  Maximum Likelihood  Have already seen a distance method 12/18/20152Phylogenetics Part.

Phylogeny Ch. 7 & 8.

Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.

Applied Bioinformatics Week 8 Jens Allmer. Theory I.

Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?

Step 3: Tools Database Searching

1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.

Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.

Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.

Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.

Phylogenetic basis of systematics

DEPTT OF APPLIED MATHEMATICS

Multiple Alignment and Phylogenetic Trees

Goals of Phylogenetic Analysis

Inferring phylogenetic trees: Distance and maximum likelihood methods

Lecture 7 – Algorithmic Approaches

Presentation transcript:

COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL

Phylogenetic Analysis From a given set of sequences, it should be possible to reconstruct the evolutionary relationships i.e. ancestral relationships, among genes and among organisms. Phylogenetic analysis involves creating a branching or tree structure, termed as phylogeny, which illustrates the relationship between sequences. A phylogenetic analysis of a family of related nucleic acid or protein sequences is a determination of how the family might have been derived during evolution.

Phylogenetic Trees Sequence alignment methods lead to identification of similar sequences, multiple sequence alignment methods are applied to a set of related sequences before a phylogenetic analysis can be performed. It seems logical to reconstruct the evolutionary/ancestral relationships among the genes and among the organisms from a given set of sequences. This involves creating a branching structure called phylogeny or tree that illustrates the relationships between the sequences.

Basics of Trees A tree is a 2-Dimensional graph showing evolutionary relationships among organisms or in certain genes from separate organisms. These separate source of sequences referred as taxa (taxon - singular), defined as phylogenetically distinct units on the tree. Tree is composed of nodes representing the taxa and branches representing the relationships among the taxa.

Basic Properties of Trees The root is the common ancestor of all taxa. If we do not have taxa to define the root, we can predict relationships by an uprooted tree. Leaves represent things like genes, species being compared. Paralogous are genes that diverged within the same species. Orthologous are genes that diverged with species.

Rooted & Unrooted Trees In rooted trees a single node is designated as a common ancestor, and a unique path leads from it through evolutionary time to any other node. In a rooted tree, path from root to a node represents an evolutionary paths. An unrooted tree specifies relationships among things, but not evolutionary paths. Unrooted trees only specify the relationship between nodes and say nothing about the direction in which evolution occurred. Roots can usually be assigned to unrooted trees through the use of an outgroup. Outgroup – species that have unambiguously separated the earliest from the other species being studied.

Styles of Trees - I Cladogram – Nodes are connected to other nodes and to tips by straight lines going directly from one to the other, and gives a V-shaped appearance. Curvogram – Nodes are connected to other nodes and to tips by a curve which is one fourth of an ellipse, starting out horizontally and then curving upwards to become vertical. Phenogram - Nodes are connected to other nodes and to other tips by a horizontal and then by a vertical line. This gives a precise idea of horizontal levels.

Styles of Trees - II Eurogram – So-called because it is a version of cladogram diagram popular in Europe. Nodes are connected to other nodes and to tips by a diagonal line that goes outward and goes at most one-third of the way up to the next node, then turns sharply upwards and is vertical. Swoopogram – connects two nodes or a node and a tip using two curves that are actually each one-quarter of an ellipse. The first part starts out vertical and then bends over to become horizontal. The second part starts out horizontal and then bends up to become vertical.

Steps in Phylogenetic analysis In general it is a four step method – Alignment strategy. Determination of the substitution model. Tree building. Tree evaluation.

Methods of phylogenetic analysis 1)Distance Matrix Methods (MD) Methods of calculation of distance matrices The Neighbor-joining method (NJ) The Fitch / Margoliash method UPGMA 1)Character Based Methods ► Maximum Parsimony (MP) ► Maximum Likelihood (ML)

Distance Matrix Methods (MD) 1.Methods of calculation of distance matrices – DNA distance matrices are calculated such that each mismatch between two sequences adds to the distances. The simplest scoring method is of Jukes and Cantor, in which all possible nucleotide substitutions are of equal value. This model also assumes that each base will eventually have the same frequency in DNA sequences once equilibrium has been reached.

ACGT A-(a1+a2+a3)a1a2a3 Ca4-(a4+a5+a6)a5a6 Ga7a8-(a7+a8+a9)a9 Ta10a11a12-(a10+a11+a12) GENERAL SUBSTITUTION MATRIX

2. Un-weighted-pair-group method with Arithmetic mean (UPGMA) The oldest and simplest distance matrix method for tree reconstruction. The un-weighted-pair-group method with arithmetic mean is largely statistically based and like all distance-based methods requires data that can be condensed to a measure of genetic distance between all pairs of taxa being considered.

UPGMA The UPGMA method requires a distance matrix such as one that might be created for a group of four taxa called A, B, C, D. Assume that the pairwise distances between each of the taxa are given in tha folloing matrix – SpeciesABC Bd AB -- Cd AC d BC - Dd AD d BD d CD Here dAB represents the distance between species A and B, while dAC is the distance between taxa A and C, and so on.

UPGMA UPGMA begins by clustering the two species with the smallest distance separating them into a single, composite group. Assume that the smallest value in the distance matrix corresponds to d AB in which case species A and B are the first to be grouped (AB). After the first clustering, a new distance matrix is computed with the distance between the new group (AB) and species C and D being calculated as – d (AB)C =1/2(d AC + d BC ) and d (AB)D =1/2(d AD + d BD ) The process is repeated until all the species have been grouped.

3. THE NEIGHBOR-JOINING METHOD The Neighbor-Joining method begins by choosing the two most closely-related sequences, and then adding the next most distant sequence as a third branch to the tree. A tree with 3 sequences A,B, and C and the distances between nodes x, y, and z is shown here -

THE NEIGHBOR-JOINING METHOD BC A2428 B32 Simultaneous linear equations can be used to calculate the branch lengths – A to B: x+y = 24 A to C: x+z = 28 B to C: y+z = 32 Thus with 3 equations and 3 unknowns we can calculate that x=10, y=14, and z= 18.

4. The Fitch / Margoliash method The Neighbor-Joining method attempts to build only one tree. However, the raw pairwise distances may not always be perfectly additive. Fitch and Margolish showed that different sets of internal branch lengths could be obtained by considering alternate trees which moved one or more branches to different parts of the tree. Consider a distance matrix for 4 sequences with pairwise distances D ij :

The Fitch / Margoliash method

If we recalculate the pairwise distances d ij from the tree, they are different from the original distances: For each tree considered, a different matrix of distances will be generated (dij). The best tree is defined as that tree which minimizes:

Character Based Methods ► Maximum Parsimony (MP) – Character methods such as MP attempt to reconstruct mutational events leading to the currently observed sequences. The most parsimonious tree is therefore that tree which requires fewer mutational steps to visit each node.

The output from the PHYLIP DNAPARS program lists 3 most parsimonious trees, one such tree is -

Maximum Likelihood (ML) The term maximum likelihood does not refer to a single statistical method, but rather to a general approach. ML methods in their simplest form begin by listing all possible models, and then calculating the probability that each model would generate the data actually observed. The model with the highest probability of generating the observed data is chosen as the best model.

Maximum Likelihood (ML)

Methods of Phylogenetic Evaluation All phylogenetic trees represent hypotheses regarding the evolutionary history of the sequences that makeup a data set. Like any good hypothesis, it is reasonable to ask two questions about how well it describes the underlying data – 1.How much confidence can be attached to the overall tree and its component parts i.e. branches ? 2.How much more likely is one tree to be correct than a particular or randomly chosen alternative tree ?

Methods of Phylogenetic Evaluation It is important to remember that the output from Phylogenetic analysis is one answer obtained using one set of conditions. The input data may simply not be robust i.e. data itself may contain more noise than evolutionary signal. Two methods of Phylogenetic evaluations are – 1.Jumbling Sequence Addition Order 2.Bootstrapping

Jumbling Sequence Addition Order The simplest way to test a phylogeny is to repeat the analysis several times with different addition orders. All PHYLIP programs and most other phylogeny programs have an option called JUMBLE, that uses a random number generator to choose which sequence to add at each step, rather than adding them in the order in which they appear in the file. It is important to remember the order in which sequences appear in a file. Non-random sequence order might introduce a bias into the data set. Therefore, even when doing only one run on a phylogeny, it is probably a good idea to jumble the order of sequences.

Bootstrapping When sequences are short or polymorphism is minimal, we can have little confidence that the tree inferred from that data is the correct one. The more is the data, the less likely it is for an artifactual phylogeny to be produced. This method is based on the assumption that the statistical properties of a sample should be similar to the statistical properties of the population from which that sample was drawn. The large the sample, the more representative it should be of the population.

Bootstrapping In a physical sense the process is equivalent to taking the print out of a multiple alignment, cutting it up into pieces, each of which contains a different column from the alignment; placing all those pieces in to a bag; randomly reaching in to the bag and drawing out a piece. Copying down the information from that piece before returning it to the bag; then repeating the drawing step until an artificial data set has been created that is as long as the original alignment. The whole process is repeated to create hundreds or thousands of resampled data sets, and portions of the inferred tree that have the same groupings in many of the repetitions are those that are especially well supported by the entire original data set.

Bootstrapping Bootstrap resampling is sampling with replacement. In the case of a MSA, sites are sampled at random until the data set is equal in length to the original alignment. In each of the bootstrapped replicates, most sites are sampled once, some are sampled twice and a small number of sites are sampled three times. Some sites are never sampled. For Bootstrap resampling of a sequence alignment, it is best to create at least 100 bootstrapped datasets, and redo the phylogeny for each one. The one major disadvantage of Bootstrap resampling is that it drastically increases the time required to construct a phylogeny.

Assumptions of multiple alignment process Assumptions of phylogenetic analysis process All sequences are homologous. No duplicate sequences are present. In each column, amino acid residues are homologous. The alignment is optimal, with minimal gaps. All sequences are homologous. No duplicate sequences are present. In each column, amino acid residues are homologous. The alignment is optimal, with minimal gaps. No back mutation has occurred. All sequences are of the same length.