Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phylogenetics I.

Similar presentations


Presentation on theme: "Phylogenetics I."— Presentation transcript:

1 Phylogenetics I

2 Evolution Evolution of new organisms is driven by Mutations
The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. Selection bias

3 Theory of Evolution Basic idea
speciation events lead to creation of different species. Speciation caused by physical separation into groups where different genetic variants become dominant Any two species share a (possibly distant) common ancestor

4 The Tree of Life

5 Primate evolution A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

6 Morphological vs. Molecular
Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. Modern biological methods allow to use molecular features Gene sequences Protein sequences

7 Morphological topology
(Based on Mc Kenna and Bell, 1997) Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Tree shrew Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Horseshoe bat Little red flying fox Ryukyu flying fox Mouse Rat Vole Cane-rat Guinea pig Squirrel Dormouse Rabbit Pika Pig Hippopotamus Sheep Cow Alpaca Blue whale Fin whale Sperm whale Donkey Horse Indian rhino White rhino Elephant Aardvark Grey seal Harbor seal Dog Cat Asiatic shrew Long-clawed shrew Small Madagascar hedgehog Hedgehog Gymnure Mole Armadillo Bandicoot Wallaroo Opossum Platypus Archonta Glires Ungulata Carnivora Insectivora Xenarthra

8 From sequences to a phylogenetic tree
Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QEPGGLVVPPTDA Cat REPGGLVVPPTEG There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

9 Mitochondrial topology
(Based on Pupko et al.,) Perissodactyla Carnivora Cetartiodactyla Donkey Horse Indian rhino White rhino Grey seal Harbor seal Dog Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Alpaca Pig Little red flying fox Ryukyu flying fox Horseshoe bat Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Aardvark Elephant Armadillo Rabbit Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Squirrel Dormouse Cane-rat Guinea pig Mouse Rat Vole Hedgehog Gymnure Bandicoot Wallaroo Opossum Platypus Primates Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Rodentia 1 Hedgehogs Rodentia 2

10 Nuclear topology 1 2 3 4 Chiroptera Eulipotyphla Pholidota
(Based on Pupko et al. slide) (tree by Madsenl) Cetartiodactyla Afrotheria Chiroptera Eulipotyphla Glires Xenarthra Carnivora Perissodactyla Scandentia+ Dermoptera Pholidota Primate Round Eared Bat Flying Fox Hedgehog Mole Pangolin Whale Hippo Cow Pig Cat Dog Horse Rhino Rat Capybara Rabbit Flying Lemur Tree Shrew Human Galago Sloth Hyrax Dugong Elephant Aardvark Elephant Shrew Opossum Kangaroo 1 2 3 4

11 Phylogenenetic trees Aardvark Bison Chimp Dog Elephant Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next

12 Twists in molecular phylogenies
We have to emphasize that gene/protein sequence can be homologous for several different reasons: Orthologs -- sequences diverged after a speciation event Paralogs -- sequences diverged after a duplication event Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

13 Paralogs Consider evolutionary tree of three taxa:
Gene Duplication 1 2 3 …and assume that at some point in the past a gene duplication event occurred.

14 Paralogs The gene evolution is described by this tree (A, B are the copies of the same gene). Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B

15 Paralogs If we happen to consider genes 1A, 2B, and 3A of species 1,2,3, we get a wrong tree that does not represent the phylogeny of the host species Gene Duplication S S S Speciation events 1A 2A 3A 3B 2B 1B

16 Types of Trees A natural model to consider is that of rooted trees
Common Ancestor

17 Types of trees Unrooted tree represents the same phylogeny without the root node Depending on the model, data from current day species does not distinguish between different placements of the root.

18 Rooted versus unrooted trees
Tree a Tree b Tree c b a c Represents the three rooted trees

19 Total numbers of trees For N taxa, Rooted bifurcating trees:
(2n-3)!! = (2n-3)!/2n-2(n-2)! Unrooted bifurcating trees (2n-5)!! Tree shapes

20 Positioning Roots in Unrooted Trees
We can estimate the position of the root by introducing an outgroup: a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant

21 Type of Data Distance-based Character-based
Input is a matrix of distances between species Can be fraction of residue they disagree on, or alignment score between them, or … Character-based Examine each character (e.g., residue) separately

22 Two methods of tree Construction
Distance- A weighted tree that realizes the distances between the objects. Parsimony – A tree with a total minimum number of character changes between nodes. We start with distance based methods, considering the following question: Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.

23 Distance Matrix Given n species, we can compute the n x n distance matrix Dij Dij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species.

24 The distance between two sequences
Protein sequences: PAM BLOSUM DNA sequences Jukes-Cantor HGY Kimura 2-Parameter

25 General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA pArAC pGrGC pTrTC pArAG pCrCG pTrTG pArAT pCrCT pGrGT R = (Diagonal elements such that rows sum to zero) Time reversibility: pirij = pjrji

26 General Stationary Time-reversible Model
P(t) = eRt Given rates, one can find transition probabilities, and vice-versa.

27 Jukes-Cantor . u/3 R =

28 Jukes-Cantor P(no mutation) = e-4/3ut
P(at least one mutation) = 1-e-4/3ut Ds = ¾ * (1-e-4/3ut) D  ut = -3/4 ln (1-4/3 * Ds)

29 Kimura 2-Parameter R = a/b = transition/transversion bias  R
A C G T . b a R = a/b = transition/transversion bias  R a+2b = 1 per unit time

30 Kimura 2-Parameter a=R/(R+1), b=0.5/(R+1)

31 HKY (Hasegawa, Kishino, Yano)
. mpC mkpG mpT mpA mpG mkpT mkpA mkpC R = Some rules of thumb: Use simpler models with shorter sequences (< 200 bp). Otherwise, use a model as complex as necessary. Compare results from more than one method. k = transversion / transition

32 Distances in Trees Edges may have weights reflecting:
Number of mutations on evolutionary path from one species to another Time estimate for evolution of one species into another In a tree T, we often compute dij(T) - the length of a path between leaves i and j

33 Distance in Trees: an Exampe
j d1,4 = = 68

34 Fitting Distance Matrix
Given n species, we can compute the n x n distance matrix Dij Evolution of these genes is described by a tree that we don’t know. We need an algorithm to construct a tree that best fits the distance matrix Dij

35 Reconstructing a 3 Leaved Tree
Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

36 Reconstructing a 3 Leaved Tree
dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik 2dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2

37 Trees with > 3 Leaves An tree with n leaves has 2n-3 edges
This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables This is not always possible to solve for n > 3

38 Additive Distance Matrices
Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise

39 Distance Based Phylogeny Problem
Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fitting D If D is additive, this problem has a solution and there is a simple algorithm to solve it

40 Using Neighboring Leaves to Construct the Tree
Find neighboring leaves i and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree

41 Finding Neighboring Leaves
To find neighboring leaves we simply select a pair of closest leaves.

42 Finding Neighboring Leaves
To find neighboring leaves we simply select a pair of closest leaves. WRONG

43 Finding Neighboring Leaves
Closest leaves aren’t necessarily neighbors i and j are neighbors, but (dij = 13) > (djk = 12) Finding a pair of neighboring leaves is a nontrivial problem!

44 Neighbor Joining Algorithm
In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

45 Constructing additive trees: The neighbor joining algorithm
Let i, j be neighboring leaves in a tree, let k be their parent, and let m be any other vertex. The formula shows that we can compute the distances of k to all other leaves. This suggest the following method to construct tree from a distance matrix: Find neighboring leaves i,j in the tree, Replace i,j by their parent k and recursively construct a tree T for the smaller set. Add i,j as children of k in T.

46 Neighbor Finding How can we find from distances alone a pair of nodes which are neighboring leaves? Closest nodes aren’t necessarily neighboring leaves. A B C D Next we show one way to find neighbors from distances.

47 Neighbor Finding: Seitou & Nei algorithm
Definitions Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.

48 Complexity of Neighbor Joining Algorithm
Naive Implementation: Initialization: θ(L2) to compute d(r,i) and C(i,j) for all i,jL. Each Iteration: O(L2) to find the maximal C(i,j). O(L) to compute {C(m,k):m L} for the new node k. Total of O(L3). r C(m,k) m k

49 Complexity of Neighbor Joining Algorithm
Using Heap to store the C(i,j)’s: Input: Distance matrix D= d(i,j), and an arbitrary object r. Initialization: θ(L2) to compute and heapify the C(i,j)’s in a heap H. Each Iteration: O(log L) to find and delete the maximal C(i,j) from H. O(L) to add the values {d(k,m)} to D, for all objects m. O(L) to delete {d(m,i), d(m,j)} from D (for all m). O(L log L) to delete {C(i,m), C(j,m)} and add C(k,m) from H, for all objects m. Total of O(L2 log L). (implementation details are omitted)

50 Neighbor Joining Algorithm
Applicable to matrices which are not additive Known to work good in practice The algorithm and its variants are the most widely used distance-based algorithms today.

51 The Four Point Condition
Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 2 3 1 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 1 represents a smaller number: the length of all edges – the middle edge

52 The Four Point Condition: Theorem
The four point condition for the quartet i,j,k,l is satisfied if two of these sums are the same, with the third sum smaller than these first two Theorem : An n x n matrix D is additive if and only if the four point condition holds for every quartet 1 ≤ i,j,k,l ≤ n

53 Least Squares Distance Phylogeny Problem
If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best: Squared Error : ∑i,j (dij(T) – Dij)2 Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non-additive matrix D (NP-hard).


Download ppt "Phylogenetics I."

Similar presentations


Ads by Google