Molecular Phylogenetics (part 2 of 2)

Molecular Phylogenetics (part 2 of 2)
Computational Biology Course João André Carriço

Neighbor-joining

Neighbor joining algorithm
Similar to UPGMA but allows unequal rates of evolution on different branches of the tree If the input distance matrix is correct i.e. is an accurate reflection of the real tree, the inferred tree will be the real tree Fast algorithm (even if runs in polynomial time O(n3) – good for large datasets) There are variants that improve accuracy – BIONJ and Weighbor- by making use of the fact that shorter distances are better known than the longer distances The algorithm always starts with and unrooted tree with a star like topology with all branches of unknown length

Neighbor joining algorithm
The algorithm always starts with and unrooted tree with a star like topology with all branches of unknown length Yang, Z. & Rannala, B., Molecular phylogenetics: principles and practice. Nature Reviews Genetics, 13(5), pp.303–314.

Neighbour joining algorithm
MU2F= = -11 Choose one of these (CU1 here) Cycle 2 example: DcU1=(DAC+DBC-DAB)*0.5=(4+7-5)*0.5=3 DDU1=(DAD+DBD-DAB)*0.5=(7+10-5)*0.5=6 … ( )*1/2

Rooted form Unrooted form

Disadvantages of NJ: information is reduced (distance matrix based) gives only one tree (out of several possible trees) the resulting tree depends on the model of evolution used Neighbor joining has the undesirable feature that it often assigns negative lengths to some of the branches.

Maximum Parsimony Occam’s Razor:
William of Ockham (c. 1285–1349) "Pluralitas non est ponenda sine necessitate", i.e., "Plurality is not to be posited without necessity" Occam’s Razor: Among competing hypotheses, the one that makes fewest assumptions should be selected. Law of parsimony, economy or succinctness In phylogeny, the tree that requires least evolutionary change is the preferred to explain the observed data

Maximum Parsimony Identification of parsimony informative sites
An informative site is a position in the relevant set of sequences at which there are: at least two different character states at that point in the sequences each of those states occurs in at least two of the sequences.

Maximum Parsimony Tree 1 a c 1+1+2=4 changes b d Tree 2 a b

Maximum Parsimony Computationally intensive. It is a NP-hard problem, therefore most implementations are heuristics to search the best tree: Branch and bound algorithm, Sankoff-Morel-Cedergren algorithm, etc… Labeled rooted bifurcating trees Labeled unrooted bifurcating trees Species #trees 3 3 4 15 10 34,459,525 x1038 x1057 x1076 How many trees are there?:

Maximum Parsimony Appropriate taxon sampling is needed: not only the ratio of taxa to characters (informative sites) is relevant, but also the combination of character states that new taxa can bring. Major weakness of parsimony analysis is the long branch attraction problem: Highly dissimilar nodes cluster together, due to the algorithms choosing preferentially similar nodes to group. The ones that are far apart can end up together in a cluster.

Maximum Likelihood methods
Statistical approach. The “best” tree is identified as the one that has the highest probability of producing the observed character data, assuming a particular model of how characters change over time L=Prob(Data|Tree). L is calculated by summing subcomponents of L for each site in the sequence alignment. An explicit probabilistic model of of the character states change in the MSA (multiple sequence alignment) is needed. (Data is a MSA) Very computationally intensive. Don’t scale well to large datasets (thousands of taxa).

Bayesian inference methods
Bayesian phylogenetic reconstruction methods Similar to Maximum Likelihood. Tries to calculate the posterior probability (Prob (Tree|Data)). Instead of trying to calculate the actual probability of that Tree can could generate the data, Bayesian methods aim to calculate the actual probability of the hypothesis (Tree) by assigning a prior distributions to the model- probability of the data and probability of the hypothesis. Computationally challenging. Uses Markov Chain Monte Carlo simulations to scan through the tree and parameter space. Don’t scale up to large datasets.

(Short/Long Interspersed Elements)
Yang, Z. & Rannala, B., Molecular phylogenetics: principles and practice. Nature Reviews Genetics, 13(5), pp.303–314.

Maximum Parsimony Up to now all the methods that we saw share one characteristic: All the taxa are on the leaves and internal nodes are assumed not to have been sampled i.e. we never recovered the ancestral of a clade But what if we know that we have the ancestrals in our dataset ?

3 largest S aureus Clonal complexes goeBURST applied to
Staphylococcus aureus MLST data

MLST MultiLocus Sequence Typing
To each unique gene sequence (allele) is attributed an integer ID, by comparison with online DBs Allelic profile: Each allelic profile, aka ST, is unequivocally identified by an integer. Bacterial chromosome housekeeping gene Single locus variant (SLV): Double locus variant (DLV): Triple locus variant (TLV): 12 10 - 10 - 11 - 7 - 11 - 11 - 20 - 2 - 3

Why using allelic profiles
Assumes that we don’t really know if the change was due to mutation or recombination Tries to buffer the effect of recombination when using multiple loci Counts genetic events. A locus changes alleles and it could be due to mutation or recombination

Trees from allelic profiles
Assume that you have only 3 genes and each number corresponds to a different allele for each gene. The minimum assumption is assuming that a SLV may correspond to a possible phylogenetic descent. SLV 1-1-1 1-1-2 1-2-1 1-2-2 1-2-3 11 possible trees….

Model – inferring relationships
More similar STs should denote closely related strains from an evolutionary point of view. STs with more SLVs can be regarded has a common ancestor. Links between STs depict descent relations. With these assumptions, connected STs should share an evolutionary path. eBURST Feil E. et al, J Bac 2004 Maynard Smith J., et al Bioessays 22:

goeBURST Inferring phylogenetic relationships
Implementation of the eBURST rules as a graphic matroid problem, allows for a globally optimal solution of the placement of the ST links. Algorithm: Francisco et al, BMC Bioinf, 2009 Determine which Sequence Type(STs) /Allelic profile has the most SLVs. This is the founder and the first to be drawn Draw and connect all of the SLVs of the Allelic profile chosen Go to each of the SLVs drawn in 2) and choose the one with most SLVs. Go to 2) 4) Proceed until all the STs /allelic profiles are joined or there aren't any more SLVs of the trees drawn

goeBURST Inferring phylogenetic relationships
However when looking only to SLVs we usually have a lot of possible solutions. Therefore, in order to obtain an unique tree the following tie-break rules are used. In case of tie you simple move to the next rule..: Tie Break Rules : 1) Choose the ST with most # SLVs 2) Choose the ST with most # DLVs 3) Choose the ST with most # TLVs 4) Frequency: Choose the ST/Allelic profile with more identical strains in the dataset 5) ST ID: Choose the lowest ST ID. It should be the oldest in the database. If no ID are assigned choose the one that appears first in the list. This final tie-break guarantees a consistent solution, i.e. an unique tree, for a given dataset. Francisco et al, BMC Bioinf, 2009

unique solution guaranteed
Inferring phylogenetic relationships - goeBURST Francisco et al, BMC Bioinf, 2009 #SLV #DLVs #TLV Freq STid 1-1-1 1-1-2 Connects to ST4 because #SLVs 1-2-1 More SLVs / lower ID 1-2-2 1-2-3 Final goeBURST tree : unique solution guaranteed Implementation of the eBURST rules as a graphic matroid problem, allows for a globally optimal solution of the placement of the ST links.

Applying goeBURST 11 possible trees…. 1-1-1 1-1-2 1-2-1 1-2-2 1-2-3
SLV 11 possible trees…. All these are valid goeBURST solutions. The tie break would need to be the ST ID if all of them would have the same frequency in the dataset

goeBURST (http://goeburst.phyloviz.net)
Link confidence Identification of best DLV, TLV links Group Statistics DLVs highlight TLV Tiebreak DLV SLV No ties ST on mouse focus SLVs highlight Edge statistics

Why using allelic profiles instead of sequences
Assumes that we don’t really know if the change in an allele was due to mutation or recombination Tries to buffer the effect of recombination when using multiple loci Counts genetic events. A locus changes alleles and it could be due to mutation or recombination Easier to create a nomenclature

goeBURST The algorithm can be used similar to a Minimum Spanning Tree where the algorithm proceeds until all sequences /STs are connected in a single tree. In those cases the tie-break rules are still used but they go up to nLV level (and not only up to # TLVs) where n is the number of loci being used goeBURST can be used on allelic profiles or on comparison of sequences of the same lengths (gaps can be disregarded if needed): The same evolutionary hypothesis can be applied to any data

goeBURST goeBURST analysis are very useful when we are dealing with micro-evolution (short-term evolution) such as in outbreak situations The traditional use of goeBURST algorithm in MLST dictates that clonal complexes are defined including SLV links (or more rarely up to TLV level), which may resulting on disjoint trees from a given dataset In those cases, the STs with most SLVs in each disjoint tree (clonal complex) is assumed to be the founder of each tree. No relationship between trees is assumed

wwww.phyloviz.net Present the data in a meaningful way
Integrating Data Analysis and Visualization wwww.phyloviz.net Accessory data (“metadata”) Allelic profiles Antibiogram Analysis (goeBURST) Serotype Origin info (patient) Other typing method …. Present the data in a meaningful way

Integrating Data Analysis and Visualization
goeBURST algorithm

Integrating Data Analysis and Visualization

“Live” PHYLOViZ online Demonstration

Using Phyloviz (http://www.phyloviz.net)
And PHYLOViZ Online version in

Molecular epidemiology in Clinical Microbiology: How to identify different strains within a species?

What is Microbial Typing?
Identification of characteristics that can discriminate strains at subspecies level Study of the relationships between strains based on the assessed characteristics

Legionella outbreak in VFX

In the news… (7 November 2014)
Legionella pneumophila outbreak How to find the source of the outbreak ?

Microbial Typing Methods
Genotypic (sequence based) MultiLocus Sequence Typing PCR amplification bp fragments of 7 housekeeping genes Sequencing the 7 genes www Compare the sequences to known alleles Compare the allelic profile to database to assign Sequence Type (ST) Infer relationships goeBURST Phenotypic: Antibiotic Resistance Genotypic (molecular methods): Pulsed-Field Gel electrophoresis

Making the case for High Throughput Sequencing (HTS)
Completely characterized strain: Species Identification Serotype Multilocus Sequence Type (MLST) cgMLST / wgMLST / SNPs Antibiotic resistance profile Virulence factors Other SBTM information eg: spa (S. aureus) emm (Group A Streptococcus) The Ideal Scenario Microbiological Sample Magic Box of NGS Wonders for Clinical Microbiology Actionable information for : Diagnostics Surveillance Outbreak detection

Three current strategies
Do not need reference strains Very fast calculation The resulting trees are not phylogenetic trees K-mer based approaches Needs a reference strain: Useful in Outbreaks Comparative studies Monomorphic (Clonal) species Difficult to create a nomenclature Can be used for quick target finding Single Nucleotide Polimorphims (SNP) Analysis Gene-by-gene Methods Evolution of Multi-locus Sequence Typing (MLST) to hundreds or thousands of loci

Gene-by-gene approaches
Core Genome / Whole Genome Schema Central nomenclature server: Schemas, Allele definitions and identifiers 1 2 3 4 5 6 7 8 9 BLAST Draft genome (Contigs) Gene-by-gene approaches: No need for reference strain Buffers recombination effect Simpler to create a nomenclature Population structure of non-monomorphic species Whole Genome / Core Genome MLST Output : Allelic Profile

Gene-by-gene : databases and software
Non-commercial: BIGSdb ( Jolley, K. A. & Maiden, M. C. J. BMC Bioinf 11, 595 (2010).) – Enterobase ( GEP (Genome Profiler) (JCM May;53(5):1765-7) chewBBACA (Blast Score Ratio Based Alllele Calling Algorithm) - JAC Commercial : Ridom Seqsphere+ (Ridom) Bionumerics (Applied Maths)

Applications of microbial typing
Bacterial Population Genetics Pathogenesis and Natural History of Infection Outbreak Investigation and Control Surveillance of Infectious Diseases

Evaluating the methods and the trees
All methods offer strengths and weaknesses and the choice of method should reflect it Criteria for the evaluation of the methods: Efficiency – How fast a method performs the analysis Consistency – Given sufficient data a method should reproduce the correct tree Power – How much data is required to produce the correct result Robustness – how dependent the results are from the assumptions Falsifiability – Whether or not the results will allow us to determine if the underlying evolutionary assumptions have been violated Other ways to validate the methods are by testing the congruency of results with external criteria and by using simulated data.

Evaluating the trees Bootstrapping
Statistical technique created by Bradley Efron. Application to phylogenetics proposed by Joseph Felsenstein. It is performed by sampling with replacement the dataset (sequence alignment columns = traits). “To pull oneself by the bootstraps” The Adventures of Baron Munchausen

A new evolutionary distance matrix is calculated and a new tree is calculated. The process is repeated n times (typically 100 or 1000). Determine for each clade in a tree how many time it occurred in the n trees created. Measures the robustness of tree topology to the characters used to create it Other methods: Jackknifing approach (removing k characters or samples at a time , n times) Evaluating the trees

Can be used to create a Consensus tree i.e. the most likely to be observed tree inferred from the bootstrap values. Follows a majority rule (the clades must appear in at least 50% of the trees)

Moving Past trees TIME (generations)
In a Horizontal Gene Transfer event who is the ancestor ?

Moving Past trees In these cases we can have phylogenetic networks, where phylogenetic trees are just a particular case. Different algorithms produce different kinds of networks , but they always represent either possible evolutionary paths or the uncertainty of the path in some areas of the network

goeBURST vs SLV graph SLV graph goeBURST (SLV level)
SLV Graph: linking all SLVs instead of following the goeBURST rules. SLVs as minimum assumptions of evolutionary pathways.

Moving past the trees…. Huson, D.H. & Bryant, D., Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution, 23(2), pp.254–267.

Tree of Life project Ciccarelli, F.D. et al., Toward automatic reconstruction of a highly resolved tree of life. Science, 311(5765), pp.1283–1287.

Tree of Life Eukaryota (Animals, plants and fungi) Archaea Bacteria

Tree of Life (all sequenced genomes in 2006)
A phylogenetic tree of life, showing the relationship between species whose genomes had been sequenced as of The very center represents the last universal ancestor of all life on earth. The different colors represent the three domains of life: pink represents eukaryota (animals, plants and fungi); blue represents bacteria; and green represents archaea. Note the presence of homo sapiens (humans) second from the rightmost edge of the pink segment. The light and dark bands along the edge correspond to clades: the rightmost light red band is Metazoa, with dark red Ascomycota to its left, and light blue Firmicutes to its right.[1] Homo sapiens

‘Tree of Life’ for 2.3 Million Species Released
“A survey of more than 7,500 phylogenetic studies published between 2000 and 2012 in more than 100 journals found that only one out of six studies had deposited their data in a digital, downloadable format that the researchers could use.” ( Tree Limited to lineages with at least 500 descendants Reddit discussion: Article:

Conflicts in the ‘Tree of Life’
In practice it is a Multi Level SuperTree, i.e., a method to join multiple trees

Not a tree ( LUCA) Doolittle, W. F. (2000). Uprooting the tree of life. Scientific American, 282(2), 90–95.

Not a tree Doolittle, W. F. (2000). Uprooting the tree of life. Scientific American, 282(2), 90–95.

Take home messages Trees represent working hypothesis of an explanation for phylogenetic patterns of descent Presence of Recombination can disrupt the phylogenetic signal resulting on unreliable trees The process of making a tree is full of assumptions that need to be met or verified to validate the tree Trees need to be validated since most methods are just sampling a subset of all possible trees. Metadata can provide one of the best validations When you see a phylogenetic tree remember that it is only one possibility from a “forest” of possibilities!

The future : more data integration and visualization
GenGIS Providing appealing , easy to interpret ,trees or networks

Felsenstein’s phylogeny software repository
“Here are 392 phylogeny packages and 54 free web servers, (almost) all that I know about. It is an attempt to be completely comprehensive” Joseph “Joe” Felsenstein

If you are interested in:
MolecularMicrobiology and Infection Unit /Mramirez Evolution Algorithms NGS data analysis Web interfaces Data visualization And having a few laughs… MsC Thesis projects available! Contact us!!

END OF CLASS

Molecular Phylogenetics (part 2 of 2)

Similar presentations

Presentation on theme: "Molecular Phylogenetics (part 2 of 2)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Molecular Phylogenetics (part 2 of 2)

Similar presentations

Presentation on theme: "Molecular Phylogenetics (part 2 of 2)"— Presentation transcript:

Similar presentations

About project

Feedback