Phylogenetic Trees Lecture 12

Phylogenetic Trees Lecture 12
Based on pages in Durbin et al (the black text book). This class has been edited from Nir Friedman’s lecture which was available at Pictures from Tal Pupko slides. Changes by Dan Geiger and Shlomo Moran. .

Evolution Evolution of new organisms is driven by Diversity
Different individuals carry different variants of the same basic blue print Mutations The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. Selection bias

The Tree of Life Source: Alberts et al

Tree of life- a better picture
D’après Ernst Haeckel, 1891

Primate evolution A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.

Morphological vs. Molecular
Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. Modern biological methods allow to use molecular features Gene sequences Protein sequences Analysis based on homologous sequences (e.g., globins) in different species Important for many aspects of biology Classification Understanding biological mechanisms

Morphological topology
(Based on Mc Kenna and Bell, 1997) Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Tree shrew Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Horseshoe bat Little red flying fox Ryukyu flying fox Mouse Rat Vole Cane-rat Guinea pig Squirrel Dormouse Rabbit Pika Pig Hippopotamus Sheep Cow Alpaca Blue whale Fin whale Sperm whale Donkey Horse Indian rhino White rhino Elephant Aardvark Grey seal Harbor seal Dog Cat Asiatic shrew Long-clawed shrew Small Madagascar hedgehog Hedgehog Gymnure Mole Armadillo Bandicoot Wallaroo Opossum Platypus Archonta Glires Ungulata Carnivora Insectivora Xenarthra

From sequences to a phylogenetic tree
Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QEPGGLVVPPTDA Cat REPGGLVVPPTEG There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).

Mitochondrial topology
(Based on Pupko et al.,) Perissodactyla Carnivora Cetartiodactyla Donkey Horse Indian rhino White rhino Grey seal Harbor seal Dog Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Alpaca Pig Little red flying fox Ryukyu flying fox Horseshoe bat Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Aardvark Elephant Armadillo Rabbit Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Squirrel Dormouse Cane-rat Guinea pig Mouse Rat Vole Hedgehog Gymnure Bandicoot Wallaroo Opossum Platypus Primates Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Rodentia 1 Hedgehogs Rodentia 2

Nuclear topology 1 2 3 4 Chiroptera Eulipotyphla Pholidota
(Based on Pupko et al. slide) (tree by Madsenl) Cetartiodactyla Afrotheria Chiroptera Eulipotyphla Glires Xenarthra Carnivora Perissodactyla Scandentia+ Dermoptera Pholidota Primate Round Eared Bat Flying Fox Hedgehog Mole Pangolin Whale Hippo Cow Pig Cat Dog Horse Rhino Rat Capybara Rabbit Flying Lemur Tree Shrew Human Galago Sloth Hyrax Dugong Elephant Aardvark Elephant Shrew Opossum Kangaroo 1 2 3 4

Theory of Evolution Basic idea
speciation events lead to creation of different species. Speciation caused by physical separation into groups where different genetic variants become dominant Any two species share a (possibly distant) common ancestor

Phylogenenetic trees Aardvark Bison Chimp Dog Elephant
Leafs - current day species Nodes - hypothetical most recent common ancestors Edges length - “time” from one speciation to the next

Dangers in Molecular Phylogenies
Gene and protein sequences can be homologous for various reasons: Orthologs -- sequences diverged after a speciation event. Indicative of a new specie. Paralogs -- sequences diverged after a duplication event. Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus).

Gene Phylogenies Phylogenies can be constructed to describe evolution genes. Speciation events Gene Duplication 1A 2A 3A 3B 2B 1B Species Phylogeny Three species termed 1,2,3. Two paralog genes A and B.

Dangers of Paralogs If we happen to consider only species 1A, 2B, and 3A, we get a wrong tree that does not represent the phylogeny of the host species of the given sequences because duplication does not create new species. Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B In the sequel we assume all given sequences are orthologs.

Types of Trees A natural model to consider is that of rooted trees
Common Ancestor

Types of trees Unrooted tree represents phylogeny without the root node Depending on the model, data from current day species does not distinguish between different placements of the root. In this example there are seven possible ways to place a root.

Rooted versus unrooted trees
Tree a Tree b Tree c b a c Represents the three rooted trees Slide by Tal Pupko

Positioning Roots in Unrooted Trees
We can estimate the position of the root by introducing an outgroup: a set of species that are definitely distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant

Type of Data Distance-based
Input is a matrix of distances between species Can be fraction of residue they disagree on, or alignment score between them, or … Character-based Examine each character (e.g., residue) separately

Three Methods of Tree Construction
Distance- A tree that recursively combines two nodes of the smallest distance. Parsimony – A tree with a total minimum number of character changes between nodes. Maximum likelihood - Finding the best Bayesian network of a tree shape. The method of choice nowadays. Most known and useful software called phylip uses this method.

Distance-Based (1st type Method)
Input: distance matrix between species Outline: Cluster species together Initially clusters are singletons At each iteration combine two “closest” clusters to get a new one

UPGMA Clustering Let Ci and Cj be clusters, define distance between them to be When we combine two cluster, Ci and Cj, to form a new cluster Ck, then Define a node K and place its daughter nodes at depth d(Ci,Cj)/2

Example UPGMA construction on five objects.
The length of an edge = its (vertical) height. 9 8 0.5d(7,8) 6 7 0.5d(2,3) 2 3 4 5 1

Molecular clock This phylogenetic tree has all leaves in the same level. When this property holds, the phylogenetic tree is said to satisfy a molecular clock. Namely, the time from a speciation event to the formation of current species is identical for all paths (wrong assumption in reality).

Molecular Clock UPGMA constructs trees that satisfy a molecular clock, even if the true tree does not satisfy a molecular clock. 2 3 4 1 1 2 3 4 UPGMA

Restrictive Correctness of UPGMA
Proposition: If the distance function is derived by adding edge distances in a tree T with a molecular clock, then UPGMA will reconstruct T. Proof idea: Move a horizontal line from the bottom of the T to the top. Whenever an internal node is formed, the algorithm will create it.

Additivity Molecular clock defines additive distances, namely, distances between objects can be realized by a tree: a b c i j k

Basic property of Additivity
Suppose input distances are additive For any three leaves Thus m c b j a k i

Constructing additive trees: The neighbor finding problem
Can we use this fact to construct trees assuming only additivity (but not a molecular clock)? Yes. The formula shows that if we knew that i and j are neighboring leaves, then we can construct their parent node k and compute the distances of k to all other leaves m. We remove nodes i,j and add k.

Neighbor Finding How can we find from distances alone that a pair of nodes i,j are neighboring leaves? Closest nodes aren’t necessarily neighbors. A B C D Next we show one way to find neighbors from additive distances.

Neighbor Finding Theorem (Saitou&Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. i j k l m T1 T2

Neighbor Joining Algorithm
Set L to contain all leaves Iteration: Choose i,j such that D(i,j) is minimal Create new node k, and set remove i,j from L, and add k Terminate: when |L| =2, connect two remaining nodes i j k m

Notations used in the proof
Neighbor Finding Notations used in the proof p(i,j) = the path from vertex i to vertex j; P(D,C) = (e1,e2,e3) = (D,E,F,C) For a vertex i, and an edge e=(i’,j’): Ni(e) = |{k : e is on p(i,k)}|. ND(e1) = 3, ND(e2) = 2, ND(e3) = 1 NC(e1) = 1 A B C D e1 e3 e2 F E

Neighbor Finding Notation: For e=(i,m), we denote d(i,m) by d(e).
Rest of T Lemma: For leaves i,j connected by a path (i,l,…,k,j), l k i j

Neighbor Finding Proof of Theorem: Assume by contradiction that D(i,j) is minimal for i,j which are not neighboring leaves. Let (i,l,...,k,j) be the path from i to j. Let T1 and T2 be the subtrees rooted at l and k. Let |T| denote the number of leaves in T. i j k l T1 T2

Neighbor Finding Case 1: i or j has a neighboring leaf. WLOG j and m are such leaves. A. D(i,j) - D(m,j)=(L-2)(d(i,j) - d(j,m) ) – (ri+rj) + rm+ rj {Definition} =(L-2)(d(i,k)-d(k,m) )+rm-ri {Figure} B. rm-ri ≥ (L-2)(d(k,m)-d(i,l)) + (4-L)d(k,l) {Lemma+Figure} (since for each edge eP(k,l), Nm(e)≥2 and Ni(e)  L-2, so Nm(e)- Ni(e ) ≥ 4-L ) Substituting B in A: D(i,j) - D(m,j) ≥ (L-2)(d(i,k)-d(i,l))+ (4-L)d(k,l) = 2d(k,l) > 0, contradicting the minimality assumption. i j k l m T2

Neighbor Finding Case 2: Not case 1. Then both T1 and T2 contain 2 neighboring leaves. We show that if D(i,j) is minimal, then we must have both |T1| > |T2| and |T2| > |T1| - which is a contradiction, hence D(i,j) is not minimal. i j k l m n p T1 T2 We prove that |T1| > |T2| by assuming that |T1| ≤ |T2| and reaching a contradiction. The proof that |T2| > |T1| is similar. Let n,m be neighboring leaves in T1.

Neighbor Finding A. 0 ≤ D(m,n) - D(i,j)= (L-2)(d(m,n) - d(i,j) ) + (ri+rj) – (rm+rn) B. rj-rm< (L-2)(d(j,k) – d(m,p)) + (|T1|-|T2|)d(k,p) (Because Nj(e)- Nm(e ) < |T1|-|T2|). i j k l m n p T1 T2 C. ri-rn < (L-2)(d(i,k) – d(n,p)) + (|T1|-|T2|)d(l,p) Adding B and C, noting that d(l,p)>d(k,p) and using the assumption |T1| - |T2| ≤ 0: D. (ri+rj) – (rm+rn) < (L-2)(d(i,j)-d(n,m)) + 2(|T1|-|T2|)d(k,p) Substituting D in the right hand side of A: 0 ≤ D(m,n) - D(i,j)< 2(|T1|-|T2|)d(k,p), hence |T1|-|T2| > 0, a contradiction.

Phylogenetic Trees Lecture 12

Similar presentations

Presentation on theme: "Phylogenetic Trees Lecture 12"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phylogenetic Trees Lecture 12

Similar presentations

Presentation on theme: "Phylogenetic Trees Lecture 12"— Presentation transcript:

Similar presentations

About project

Feedback