# Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms.

## Presentation on theme: "Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms."— Presentation transcript:

Lecture 13 CS5661 Phylogenetics Motivation Concepts Algorithms

Lecture 13 CS5662 Motivation Life arose just once - “Thou art my brethren but a few sequences removed” Phylogenetic trees = Topologies of evolutionary relationships between sequences, and, possibly, species - “Is it man or cow that is the true heir of the fabulous treasures of the woolly mammoth dynasty?” Phylogenetic tree as guide tree for multiple sequence alignment (déjà vu)

Lecture 13 CS5663 Concepts Mutation and Evolution –Mutations that persist over generations = Evolution Tree, not a lattice –Each species arose just once Species phylogeny (often) != Sequence phylogeny –Sequences evolve at different rates Within a single species Between different species Within a single sequence –Especially in bacteria, horizontal transfer (“Napster’s been around for ages”) quite common

Lecture 13 CS5664 Concepts Molecular clock assumption –Sequences drift apart at a constant rate –Aka edge length proportional to time –Aka satisfaction of ultrametricity For any 3 sequences (all pair-wise distances are equal) xor (2 distances are equal, and the third one smaller) –If true, then All path lengths from root to leaf nodes are equal Additivity –Distance metric chosen is True distance (fulfils triangular inequality) Such that cumulative sum of edge lengths along path between 2 sequences equals the distance between 2 sequences

Lecture 13 CS5665 Concepts Heuristic forays into intractable space Start with pairwise “distances” Path length = Distance (~Evolutionary time) Work from leaves to node to generate tree –(opposite of binary tree generation) “Its easier to be rootless than to be rooted” Binary tree approximation of higher order trees Edges do not imply direct links (Missing links/incomplete data), only a representation of sequence evolution

Lecture 13 CS5666 Algorithms Parsimony (Character-based) Distance based methods –Neighbor joining –UPGMA Maximum Likelihood IIncreasing Sequence Similarity

Lecture 13 CS5667 Algorithms UPGMA (Unweighted pair group method with arithmetic averages) –Caveat if molecular clock not applicable: “If my cousin looks more like me than my brother, he must be my lost brother, and perhaps my brother my cousin?” Neighbor joining –“Give me additive distances, and I shall give thee a tree, even if some sequences morph faster than others” Parsimony –“Its just a bruise, not Kaposi’s sarcoma!” Maximum Likelihood –“Given the facts, Watson, the answer is elementary!”

Lecture 13 CS5668 UPGMA Easiest to use if molecular clock and additivity are valid No. of clusters = no. of sequences = no. of leaf nodes Inter cluster distance = Average pairwise distance {While (no. of clusters > 1) –Connect pair of closest clusters (at distance d) with intermediate node at distance d/2 from each of them} Caveat: Satisfies minimal distance requirement, but may result in spurious topologies – because of constant rate evolution assumption

Lecture 13 CS5669 Parsimony Parsimony (“Miserliness in model space”): Pick the simplest explanation that fits the facts - “If I hear a blood-curdling scream, it’s just one of my sons trying to kill the other – not an invasion by aliens!” Every possible tree evaluated in terms of total number of steps needed to convert each sequence to another –Practical for only a few sequences High percentage of similarity a prerequisite –Neither identical or ‘completely different’ sequence positions useful –Each difference should represent a single step (WYSIWYG) and not a ‘full circle’ or ‘non- shortest route’

Lecture 13 CS56610 Parsimony 123456789………… 1.ACCEFAHIKLKNPR 2.ACCEFGHILLLNPR 3.ACDEFGHIKLINPK 4.AADEFGHILLNNPK * * * 1 C 2 C 3 D 4 D Candidate tree for position 3 C D

Lecture 13 CS56611 Parsimony 3 sets of 3 trees each compared The one with lowest total number of substitutions selected Refinements: –Branch and bound: Abandon a tree if subtree has a higher score than current minimal score tree –Heuristic branch-pattern representatives –Non-boolean costs: Tranversion > transition OR use of amino-acid substitution matrices

Lecture 13 CS56612 Neighbor Joining Generates unrooted tree, allowing for unequal branches Given: Distance matrix for sequences Steps: Repeat 1-3 till all branches generated 1.Take closest sequences i, j 2.Find branch lengths between i and j by treating remaining sequences as composite (c) 1.Calculate average i-C and j-C distances 2.Calculate branch lengths i and j 3.Treat ij as composite sequence now and generate new distance table. 4.Generate multiple trees by starting with different pairs 5.Compare resulting trees in terms of best fit to original distance matrix

Lecture 13 CS56613 Rooting trees Based on a “proxy ancestor” –Include a distant relative (“outgroup”) as the proxy ancestor –Add the outgroup as the last node –Point of attachment of outgroup represents root Diameter center –Place root at center of longest path through tree

Lecture 13 CS56614 Summary Parsimony and ML based approaches computationally intensive – scalability poor Neighbor joining adequate if additivity assumption is valid UPGMA adequate if both molecular clock and additivity assumptions are valid for given set of sequences

Lecture 13 CS56615 Summary Phylogenetics useful to understand sequence evolution Phylogenetics makes sense for –sequences with a high percentage of sequence identity –sequences not subject to ‘selection’ Sequence tree not the same as species tree