Phylogenetic trees School B&I TCD Bioinformatics May 2010.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Introduction to Phylogenies
Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
© Wiley Publishing All Rights Reserved. Phylogeny.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Sequence alignment: Removing ambiguous positions: Generation of pseudosamples: Calculating and evaluating phylogenies: Comparing phylogenies: Comparing.
BME 130 – Genomes Lecture 26 Molecular phylogenies I.
Phylogenetic reconstruction
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Sushmita Roy BMI/CS 576
Phylogenetic Analysis
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Molecular phylogenetics
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
COMPUTATIONAL MODELS FOR PHYLOGENETIC ANALYSIS K. R. PARDASANI DEPTT OF APPLIED MATHEMATICS MAULANA AZAD NATIONAL INSTITUTE OF TECHNOLOGY (MANIT) BHOPAL.
plants animals monera fungi protists protozoa invertebrates vertebrates mammals Five kingdom system (Haeckel, 1879)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Day 8,9 Carlow Bioinformatics Phylogenetic inferences Trees.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Chapter 8 Molecular Phylogenetics: Measuring Evolution.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Lecture 2: Principles of Phylogenetics
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Phylogeny and Genome Biology Andrew Jackson Wellcome Trust Sanger Institute Changes: Type program name to start Always Cd to phyml directory before starting.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Phylogeny Ch. 7 & 8.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Phylogenetics.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Introduction to Bioinformatics Resources for DNA Barcoding
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
Goals of Phylogenetic Analysis
Multiple Alignment, Distance Estimation, and Phylogenetic Analysis
Methods of molecular phylogeny
Phylogenetic Trees.
Summary and Recommendations
BNFO 602 Phylogenetics Usman Roshan.
Why Models of Sequence Evolution Matter
Lecture 7 – Algorithmic Approaches
Phylogeny.
Summary and Recommendations
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Phylogenetic trees School B&I TCD Bioinformatics May 2010

Why do trees?

Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal, often living species, individuals) Branches length scaled (length propn evo dist) Branches length unscaled, nominal, arbitrary Outgroupan OTU that is most distantly related to all the other OTUs in the study. Choose outgroup carefully

Phylogeny 102 Trees rooted N=(2n-3)! / 2 n-2 (n-2)! Trees unrooted N=(2n-5)! / 2 n-3 (n-3)! OTUs #rooted trees #unrooted trees *10 6 8*10 21

Four key aspects of tree A DC B A B C D Topology Branch lengths Root Confidence A B C D Basic tree D C B A D C B A

Distances from sequence Use Phylip Protdist or DNAdist D= non-ident residues/total sequence length Correction for multiple hits necessary because Jukes-Cantor assumes all subs equally likely Kimura: transition rate NE transversion rate Ts usually > Tv G AA

Methods Distance matrix –UPGMA –Neighbour joining NJ Maximum parsimony MP –tree requiring fewest changes Maximum likelihood ML –Most likely tree Bayesian: sort of ML –Samples large number of “pretty good” trees

Trees NJ Distance matrix Neighbor joining is very fast Often a “good enough” tree Embedded in ClustalW

Trees MP Maximum parsimony Minimum # mutations to construct tree Better than NJ – information lost in distance matrix – but much slower Sensitive to long-branch attraction –Long branches clustered together No explicit evolutionary model Protpars refuses to estimate branch lengths Informative sites

Long-branch attraction True tree MusHBA MusHBB HumHBB HumHBA Rodents evolve faster than primates False “LBA” tree MusHBA MusHBB HumHBA HumHBB

Maximum parsimony Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * * It is a good alignment clearly aligning homologous sites without gaps. Here we have a representative alignment. Want to determine the phylogenetic relationships among the OTUs

There are 3 possible trees for 4 taxa (OTUs): \_____/ \_____/ \_____/ / \ / \ / \ Or (1,2)(3,4) (1,3)(2,4) and (1,4)(2,3) Aim to identify (phylogenetically) informative sites and use these to determine which tree is most parsimonious.

The identical sites 1, 6, 8 are useless for phylogenetic purposes. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

Site 2 also useless: OTU1’s A could be grouped with any of the Gs. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

Site 4 is uniformative as each site is different. UNLESS transitions weighted in which case (1,4)(2,3) Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

For site 3 each tree can be made with (minimum) 2 mutations: Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

(1,2)(3,4) G A G A G A \ / \ / \ / G---A C---A A---A / \ / \ / \ C A C A C A

(1,3)(2,4) G C can do worse:G C \ / \ / A---A G---A / \ / \ A A

(1,4)(2,3) G C \ / A---A / \ A So site 3 is (Counterintuitively) NOT informative

Site 5, however, is informative because one tree shortest. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

(1,2)(3,4) (1,3)(2,4) (1,4)(2,3) G A G G G G \ / \ / \ / G---A A---A G---G / \ / \ / \ G A A A A A

Likewise sites 7 and 9. By majority rule most parsimonious tree is (1,2)(3,4) supported by 2/3 informative sites. Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * *

Protpars infile: BRU MSQNSLRLVE DNSV-DKTKA LDAALSQIER RLR V-DKSKA LEAALSQIER NGR MSD-DKSKA LAAALAQIEK ECO AIDE-NKQKA LAAALGQIEK YPR M AIDE-NKQKA LAAALGQIEK PSE MDD-NKKRA LAAALGQIER TTH MEE-NKRKS LENALKTIEK ACD MDEPGGKIE FSPAFMQIEG

Protpars treefile: (((((ACD,TTH),(PSE,(YPR,ECO)) ),NGR),RLR),BRU);

outfile: One most parsimonious tree found: +-ACD ! +-TTH +-6 ! ! +----PSE ! ! +-YPR ! ! +-4 ! ! +-ECO +-2 ! ! ! NGR --1 ! ! RLR ! BRU remember: this is an unrooted tree! requires a total of steps

Clustalw ****** PHYLOGENETIC TREE MENU ****** 1. Input an alignment 2. Exclude positions with gaps? = ON 3. Correct for multiple substitutions? = ON 4. Draw tree now 5. Bootstrap tree 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Trees General guidelines – NOT rules More data is better Excellent alignment = few informative sites Exclude unreliable data – toss all gaps? Use seqs/sites evolving at appropriate rate – Phylip DISTANCE – 3 rd positions saturated – 2 nd positions invariant – Fast evolving seqs for closely related taxa – Eliminate transition - homoplasy

Trees Beware base composition bias in unrelated taxa Are sites (hairpins?) independent? Are substitution rates equal across dataset? Long branches prone to error – remove them? –Choose outgroup carefully

Bootstrapping

Random re-sampling of the data –with replacement The MSA stays the same Each column of aligned residues in the MSA is a “site”. The sites are what is re-sampled.

Bootstrap 2 Having resampled the data –to get a new dataset/alignment –based on the original –the same length Redraw the tree from that dataset For each node –ask is this node retained in the resampled data. Re-iterate 100, 1000 or 10,000 times

Boostrap dataset Site: OTU1 A A G A G T G C A OTU2 A G C C G T G C G OTU3 A G A T A T C C A OTU4 A G A G A T C C G * * * 4 OTUs and 9 “sites”

What do the little numbers mean?

Why does it work? The tree based on the real data is the best tree – the best estimate of what happened in evolution. If a node is based on many bits of info then some of these will be resampled If the node is based on a single site then it is unlikely to be resampled so we are less confident in that node.