Multiple Alignment, Distance Estimation, and Phylogenetic Analysis

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
Input and output. What’s in PHYLIP Programs in PHYLIP allow to do parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Phylogenetic reconstruction
Lecture 24 Inferring molecular phylogeny Distance methods
Probabilistic methods for phylogenetic trees (Part 2)
Phylogenetic Analysis
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Christian M Zmasek, PhD 15 June 2010.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Phylogenetic trees School B&I TCD Bioinformatics May 2010.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Copyright OpenHelix. No use or reproduction without express written consent1.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
From basic Concepts to Advanced applications Molecular Evolution & Phylogeny By Ofir Cohen The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Molecular Evolution Distance Methods Biol. Luis Delaye Facultad de Ciencias, UNAM.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Phylip PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). PHYLIP is the most widely-distributed.
Molecular phylogenetics continued…
Introduction to Bioinformatics Resources for DNA Barcoding
Inferring phylogenetic trees: Distance methods
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Linkage and Linkage Disequilibrium
Maximum likelihood (ML) method
Phylogenetic Inference
Clustering methods Tree building methods for distance-based trees
Goals of Phylogenetic Analysis
Molecular basis of evolution.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Summary and Recommendations
The Most General Markov Substitution Model on an Unrooted Tree
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
Phylogeny.
A bacterial antibiotic resistance gene with eukaryotic origins
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
A bacterial antibiotic resistance gene with eukaryotic origins
Presentation transcript:

Multiple Alignment, Distance Estimation, and Phylogenetic Analysis Database search (keyword, similarity) Conserved Regions Multiple alignment • find oligonucleotide primers for PCR • predict secondary and tertiary structures of new sequences • detect similarity between new sequences and existing sequence families • find diagnostic patterns to characterize protein families Distance estimation Phylogenetic reconstruction Function Prediction April 2, 2004 BIOS816/VBMS818

Distance estimation ACTGTAGGAATCGC :X::X:X::::::: AATGAAAGAATCGC nd = 3 L = 14 p = 3/14 = 0.214 The easiest Number of nucleotide substitutions per site (p) p = nd / L nd: the number of different nucleotides between the two sequences L: the number of nucleotides compared • It can be applied both for DNA and protein sequences April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC Ancestral ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC Single substitution Ancestral ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC Single substitution ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC Single substitution ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

No hidden substitution Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC 3 substitutions No hidden substitution April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC G ACTGTAGGAATCGC AATGAAAGAATCGC 2 substitutions G ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC G C ACTGTAGGAATCGC AATGAAAGAATCGC 2 substitutions 2 substitutions G C ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC G C ACTGTAGGAATCGC AATGAAAGAATCGC 2 substitutions 2 substitutions G C ACTGTAGGAATCGC AATGAAAGAATCGC Observed number of differences = 3 April 2, 2004 BIOS816/VBMS818

Actual number of substitutions = 6 Distance estimation AATGTAGGAATCGC 2 substitutions 2 substitutions G C Multiple hit! ACTGTAGGAATCGC AATGAAAGAATCGC Observed number of differences = 3 < Actual number of substitutions = 6 April 2, 2004 BIOS816/VBMS818

Effect of multiple substitutions (hits) (Actual number of substitutions) Actual divergence Divergence Time April 2, 2004 BIOS816/VBMS818

Effect of multiple substitutions (hits) (Actual number of substitutions) Actual divergence Divergence Observed divergence Time April 2, 2004 BIOS816/VBMS818

Effect of multiple substitutions (hits) (Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818

Effect of multiple substitutions (hits) Actual > Observed (Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818

Effect of multiple substitutions (hits) Actual >> Observed (Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818

Distance estimation with multiple-hit corrections (nucleotide substitutions) Jukes-Cantor method (one-parameter method) Kimura’s 2-parameter method A C G T -  A C G T -   k = -3/4ln(1-4p/3) k: the expected number of nucleotide substitutions per site p: the proportion of nucleotide differences All substitutions are equally likely Transitions and Transversions have different rates k = -1/2ln[1/(1-2P-Q)]+1/4ln[1/(1-2Q)] P: the proportion of transitional (Ts) differences Q: the proportion of transversional (Tv) differences April 2, 2004 BIOS816/VBMS818

Distance estimation with multiple-hit corrections (nucleotide substitutions) k = -3/4ln(1-4p/3) If p ≥ 0.75, JC distance cannot be estimated k (Jukes-Cantor distance) (k = p) p = 0.75 p (uncorrected nucleotide difference) April 2, 2004 BIOS816/VBMS818

Distance estimation with multiple-hit corrections (nucleotide substitutions) There are many distance estimation methods based on different models. 1. More parameters: • 1-parameter (Jukes-Cantor method) • 2-parameter (Kimura’s 2-p method) • 3, 4, 6, ... up to 12 parameters!! 2. Variation in base composition (A  C  G  T): • 1-p & base comp. (Tajima & Nei or F81 method) • 2-p & base comp. (HKY85 or F84 method), etc. 3. Rate-variation among sites: approximated by a gamma-distribution CV (coefficient of variation of the rate): smaller  less variation • 1-p & rate variation (Jin & Nei method), etc. 4. LogDet method: • No constrains on parameters, base composition can be varied among sequences • No among-site rate variation can be considered A C G T - C G T A G T A C April 2, 2004 BIOS816/VBMS818

Distance estimation with multiple-hit corrections (nucleotide substitutions) Which distance method should we choose? Substitution pattern (e.g., Ts/Tv) Things to consider: Base composition bias Rate-heterogeneity among sites 1. More parameters  more flexible, more realistic 2. More parameters  larger sampling errors (lower precision) 3. More parameters  more “undefined” distance problem (e.g., if p ≥ 0.75 in JC method, k becomes “undefined” or “infinite”) [k = -3/4ln(1-4p/3)] April 2, 2004 BIOS816/VBMS818

Distance estimation with multiple-hit corrections (amino acid substitutions) 1. Poisson distance: k = -ln(1-p) k: the expected number of amino acid substitutions per site p: the proportion of amino acid differences 2. Kimura’s distance: k = -ln(1-p-0.2p2) • Approximation of PAM distance below (accurate when p < 0.75) • Distance becomes infinite when p ≥ 0.8541 3. PAM distance & JTT distance • Distance based on PAM or JTT amino acid substitution matrix • JTT matrix is newer and based on much larger protein sample NOTE for ClustalW/ClustalX Kimura’s distance Hybrid between Kimura’s and PAM distances p ≤ 0.75 Use Kimura’s correction 0.75 < p ≤ 0.93 Use a conversion table with 0.01 interval (.75, .751, ...) 0.93 ≤ p k = 10.0 April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction methods Neighbor Joining (NJ) Maximum Parsimony (MP) Maximum Likelihood (ML) Data type: Distance Minimum evolution (shortest total branch length) *NJ does not search the ME tree. NJ provides a simplified (approximated) algorithm to find the ME tree. Sequence (or other) data Maximum parsimony (smallest number of evolutionary changes) Maximum Likelihood (highest probability of observing the data under a given tree and a given model of substitutions) Optimality criterion: Fastest Slowest April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: distance matrix methods UPGMA (unweighted pair-group method with arithmetic mean) Example: a distance matrix for 5 sequences. B C D E C/D A/B/E A .53 .99 1.02 .82 A/B .90 .98 .78 .94 .80 .93 .73 .65  .86 .81  The pair with the smallest distance is grouped  until all of the sequences are clustered in a tree • assumes all sequences evolve at the same rate • generates a rooted tree April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: distance matrix methods NJ (neighbor joining method) Example: a distance matrix for 5 sequences. B C D E A .53 .99 1.02 .82 .80 .93 .73 .65 .81 .94 1) Start with a star-like phylogeny. 2) The total length of the tree (the sum of the branch lengths) is estimated. 3) Find neighbors sequentially that minimize the total length of the tree. • does not assume a constant rate • generates a unrooted tree April 2, 2004 BIOS816/VBMS818

Rooted trees vs. unrooted trees A B C D (Time) A B C D A B C D April 2, 2004 BIOS816/VBMS818

Rooted trees vs. unrooted trees A B C D (Time) A B C D Root A B C D B A C D April 2, 2004 BIOS816/VBMS818

Rooted trees vs. unrooted trees A B C D (Time) A B C D A B C D B A C D C D A B April 2, 2004 BIOS816/VBMS818

Rooted trees vs. unrooted trees A B C D (Time) A B C D Root A B C D April 2, 2004 BIOS816/VBMS818

Rooted trees vs. unrooted trees A B C D (Time) A B C D A B C D Outgroup April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: bootstrap analysis used to estimate the confidence level of phylogenetic hypotheses Multiple alignment S5 S4 S3 S2 S1 T C 8 G A 1 2 3 7 6 5 4 Site # April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: bootstrap analysis used to estimate the confidence level of phylogenetic hypotheses T C 8 G A 1 2 3 7 6 5 4 Each column is independent sample April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: bootstrap analysis used to estimate the confidence level of phylogenetic hypotheses T C 8 G A 1 2 3 7 6 5 4 14614853 74761232 85851124 S1 GGAGGTTA S1 CGCAGCAC S1 TTTTGGCG ... Many S2 GGAGGTTA S2 CGCAGTAT S2 TTTTGGTG resamplings S3 AAAAACTA S3 CACAACAC S3 CTCTAACA (~1000 replications) S4 AAAAATCA S4 CACAACAC S4 TCTCAACA S5 GGAGGTCG S5 TGTAGCGC S5 TCTCGGCG April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: bootstrap analysis used to estimate the confidence level of phylogenetic hypotheses 14614853 74761232 85851124 S1 GGAGGTTA S1 CGCAGCAC S1 TTTTGGCG ... Many S2 GGAGGTTA S2 CGCAGTAT S2 TTTTGGTG resamplings S3 AAAAACTA S3 CACAACAC S3 CTCTAACA (~1000 replications) S4 AAAAATCA S4 CACAACAC S4 TCTCAACA S5 GGAGGTCG S5 TGTAGCGC S5 TCTCGGCG S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 ... A phylogeny is reconstructed from each pseudoreplicate April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: bootstrap analysis used to estimate the confidence level of phylogenetic hypotheses S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 ... Bootstrap support (%) 100 S1 100 S1 S2 S2 S3 S3 S4 S4 40 S5 S5 April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction by ClustalX/W & Phylip Phylogeny programs: ClustalW/ClustalX (DNA/protein distance, NJ) Phylip3.5 & Phylip3.6a3 (standalone, web-interface) PAUP (also included in GCG) Visualization: Phylip (treegram, etc.), TreeView, PAUP, NJplot More phylogeny programs  http://evolution.genetics.washington.edu/phylip/software.html April 2, 2004 BIOS816/VBMS818

Phylip programs Bootstrap: seqboot (sequence data) DNA distance: dnadist (nucleotide sequence data) Protein distance: protdist (amino acid sequence data) Neighbor joining: neighbor (distance matrix) Consensus tree: consense (tree file) Tree drawing: drawgram, drawtree, retree (treefile) Phylip3.5 Phylip3.6a3 Input file: infile infile, intree Output file: outfile, treefile outfile, outtree dnadist: Kimura, Jin/Nei, ML (F84), F84, Kimura, JC, LogDet JC (rate variation can be incorporated with all but LogDet) protdist: Kimura, PAM (Dayhoff) Kimura, PAM (Dayhoff), JTT April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction by ClustalW & Phylip Bioinformatics Core Facility Web server (Phylip3.5) http://biocore.unl.edu/WEBPHYLIP Bioinformatics Web: IU Center for Genomics & Bioinformatics (Phylip3.5) http://sunflower.bio.indiana.edu/bioweb/phylip/index.html Institut Pasteur, Biological Software list (Phylip3.6a3) http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html Phylip download site (Windows, Macintosh, Linux/Unix) Phylip3.5: http://evolution.genetics.washington.edu/phylip/getme.html Phylip3.6b: http://evolution.genetics.washington.edu/phylip/phylip36.html TreeView download site (Windows, Macintosh, Linux/Unix) http://taxonomy.zoology.gla.ac.uk/rod/treeview.html NJPlot download site (Windows, Macintosh, Linux/Unix) http://pbil.univ-lyon1.fr/software/njplot.html April 2, 2004 BIOS816/VBMS818

Multiple Alignment by CLUSTALW Bioinformatics Core Facility Web server http://biocore.unl.edu/Pise/5.a/clustalw.html Bioinformatics Web: IU Center for Genomics & Bioinformatics http://sunflower.bio.indiana.edu/bioweb/seqanal/interfaces/clustalw.html Institut Pasteur, Biological Software list http://bioweb.pasteur.fr/seqanal/interfaces/clustalw.html EMBL-EBI ClustalW Form http://www.ebi.ac.uk/clustalw ClustalX FTP site (Windows, Macintosh, Linux/Unix) ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ April 2, 2004 BIOS816/VBMS818

ClustalW/Phylip Exercise 1. Download the two sample data from the course web site: http://bioinfolab.unl.edu/unlbioinfo/docs/bios816/spring_2004/ bglobin.seq - protein sequences Dloop.seq - DNA sequences [Use either DOS format or non-DOS format whichever the ones that work for you.]  These sequences are in FASTA format. >HBB_HUMAN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >HBB_HORSE VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK DFTPELQASYQKVVAGVANALAHKYH April 2, 2004 BIOS816/VBMS818

ClustalW/Phylip Exercise (continued) 2. Go to this ClustalW web site: http://bioweb.pasteur.fr/seqanal/interfaces/clustalw.html 3. Enter your email address. 4. Copy and paste bglobin.seq data. 5. Check “Phylip alignment ouput format”. 6. Check the available options. 7. Click “Run clustalw” button to start the alignment. Wait until the page changes to the results page... 8. Click “infile.phy” link to open the multiple alignment in Phylip format. 9. From the pull-down menu, choose “protdist” program. April 2, 2004 BIOS816/VBMS818

ClustalW/Phylip Exercise (continued) 10. Click “Run the selected program on infile.phy” button to start protdist. 11. Choose “Jones-Taylor-Thornton (JTT) matrix” as the distance model. 12. Check “Perform a bootstrap ...” option. Enter a “Random number seed” Enter 10 as the number of replicates.  For the real analysis, use more than 500 or more (~1000). 13. Click “Run protdist” button to run the program. Wait until the page changes to the results page... 14. Click “outfile” link to check the distance matrix file you generated. 15. From the pull-down menu, choose “neighbor” program. April 2, 2004 BIOS816/VBMS818

ClustalW/Phylip Exercise (continued) 16. Click “Run the selected program on outfile” button to start neighbor. 17. Choose “Neighbor-joining” from the distance method. 18. Check Randomize (jumble) input order. Enter a “Random number seed”  Using “randomization” option slows down the program. But usually it is a better idea to use this option to avoid any artifact. 19. Check “Analyze multiple data set”.  If you are not doing boostrap analysis, you don’t have to check this option. Enter the number of data set (10 for this example) Check “Compute a consense tree” 20. Click “Run neighbor” to run the program. Wait until the page changes to the results page... April 2, 2004 BIOS816/VBMS818

ClustalW/Phylip Exercise (continued) 21. Click “outfile.consense” to open the output file. 22. Click “outtree.consense” to open the output file.  Note that the numbers after the taxon names are not branch lengths but bootstrap values. 23. Click “outtree” to open the output file. This file contain 10 trees based on bootstrapped alignment. Save the first tree in a file to use it for TreeView demonstration.  We are using this tree as an example. For the real analysis, you should create a NJ tree without doing bootstrap analysis to create a real NJ tree from the original multiple alignment.  In the item 12, uncheck “Perform a bootstrap ...” option to generate the NJ tree without bootstrap analysis. April 2, 2004 BIOS816/VBMS818

TreeView Exercise 24. Find “TreeView” software on your machine and start the program. 25. From File menu, open the tree file you saved. 26. Try to click different tree icons to change the phylogeny format.  Which format shows different branch lengths? 27. From Tree menu, select “Define outgroup” Choose one sequence as an outgroup. 28. From Tree menu, select “Root with outgroup” 28. From Edit menu, select “Edit tree”. Check how you can edit your tree. The assignment from my lectures (Assignment #4) is found in the Blackboard Assignment page . April 2, 2004 BIOS816/VBMS818