Phylogenetic Analysis

Phylogenetic Analysis

Review of Linux ls cd mkdir less cp mv cat pwd >

Perl Variables Statements $DNA="A"; @DATA=('A', 'B');
%TABLE=(A=>'A', N=>'[AC]',); Statements print length open close substr push pop shift unshift

#!/usr/bin/perl –w $word = 'MNIDDKL'; if($word eq 'QSTVSGE') { print "QSTVSGE\n"; } elsif($word eq 'MRQQDMISHDEL') { print "MRQQDMISHDEL\n"; elsif ( $word eq 'MNIDDKL' ) { print "MNIDDKL-the magic word!\n"; else { print "Is \”$word\“ a peptide?\n"; exit;

$x = 10; $y = -20; if ($x <= 10) { print "1st true\n";} if ($x > 10) {print "2nd true\n";} if ($x <= 10 || $y > -21) {print "3rd true\n";} if ($x > 5 && $y < 0) {print "4th true\n";} if (($x > 5 && $y < 0) || $y > 5) {print "5th true\n";}

$position = 0; while ( $position < length $DNA) { $base = substr($DNA, $position, 1); if ( $base eq 'C' or $base eq 'G') { ++$count_of_CG; } $position++; for ( $position = 0 ; $position < length $DNA ; ++$position ) {

The Most Common Sequence Formats

Converting Formats Don’t re-compute your MSA if it is not in the right format Convert your file using one of the online conversion tools The 3 most popular reformatting utilities: Fmtseq The most complete RESDSEQ Very popular and robust SeqCheck Can clean FASTA sequences

Editing your MSA If your MSA looks bad . . .
Don’t torture the online server Edit the MSA yourself locally Never, ever, ever (ever) use a standard word processor Always use a dedicated MSA editor The most popular online tool is Jalview You can get it at

MSA => LOGO Graph A LOGO graph summarizes an MSA
Tall letters indicate highly conserved positions Short letters indicate poorly conserved positions LOGO graphs are ideal for identifying conserved patterns weblogo.berkeley.edu/

Molecular Evolution and Phylogenetic Reconstruction

Evolutionary Tree of Bears and Raccoons

Human Evolutionary Tree (cont’d)

Human Migration Out of Africa
1. Yorubans 2. Western Pygmies 3. Eastern Pygmies 4. Hadza 5. !Kung 1 2 3 4 5

Reading Your Tree There’s a lot of vocabulary in a tree
Nodes correspond to common ancestors The root is the oldest ancestor Often artificial Only meaningful with a good outgroup Trees can be un-rooted Branch lengths are only meaningful when the tree is scaled Cladograms are often scaled Phenograms are usualy unscaled

Rooted and Unrooted Trees
In the unrooted tree the position of the root (“oldest ancestor”) is unknown. Otherwise, they are like rooted trees

Type of Trees (Cladogram)

Type of Trees (Phylogram)

Orthology and Paralogy
直系（垂直）同源和旁系（平行）同源 Orthologous genes Separated by speciation Often have the same function Paralogous genes Separated by duplications Can have different functions In the graph: A is paralogous with B A1 is orthologous with A2

Which Sequences ? Orthologous sequences Paralogous sequences
Produce a species tree Show how the considered species have diverged Paralogous sequences Produce a gene tree Show the evolution of a protein family

Building the Right MSA Your MSA should have as few gaps as possible. Most time should remove columns with gaps. Some variability but not too much! Some conservation but not too much!

Building the Right Tree
There are three types of tree-reconstruction methods Distance-based methods Statistical methods Parsimony methods Statistical methods are the most accurate Maximum likelihood of success Bayesian methods Statistical methods take more time Limited to small datasets

Distance in Trees: an Exampe
j d1,4 = = 68

Compute a Distance Matrix
Evolutionary Distance - number of substitutions per 100 amino acids (for proteins) or nucleotides (for DNA) A C T G T A G G A A T C G C A A T G A A A G A A T C G C 3 observed changes A C T G T A G G A A T C G C A C T G C A G G A A T A G C A A T G A A A G A A T C G C 6 actual changes

Edit Distance vs Tree Distance
j d1,4 = = 68 D1,4 may be smaller than 68, as some changes may not be observed

Fitting Distance Matrix
Given n species, we can compute the n x n distance matrix Dij Evolution of these genes is described by a tree that we don’t know. We need an algorithm to construct a tree that best fits the distance matrix Dij

Reconstructing a 3 Leaved Tree
Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Observe: dic + djc = Dij dic + dkc = Dik djc + dkc = Djk

Reconstructing a 3 Leaved Tree (cont’d)
dic + djc = Dij + dic + dkc = Dik 2dic + djc + dkc = Dij + Dik 2dic + Djk = Dij + Dik dic = (Dij + Dik – Djk)/2 Similarly, djc = (Dij + Djk – Dik)/2 dkc = (Dki + Dkj – Dij)/2

Additive Distance Matrices
Matrix D is ADDITIVE if there exists a tree T with dij(T) = Dij NON-ADDITIVE otherwise

The Four Point Condition (cont’d)
Compute: 1. Dij + Dkl, 2. Dik + Djl, 3. Dil + Djk 2 3 1 2 and 3 represent the same number: the length of all edges + the middle edge (it is counted twice) 1 represents a smaller number: the length of all edges – the middle edge

The Four Point Condition: Theorem
The four point condition for the quartet i,j,k,l is satisfied if two of these sums are the same, with the third sum smaller than these first two Theorem : An n x n matrix D is additive if and only if the four point condition holds for every quartet 1 ≤ i,j,k,l ≤ n

Distance Based Phylogeny Problem
Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix Dij Output: weighted tree T with n leaves fitting D If D is additive, this problem has a solution and there is a simple algorithm to solve it

Using Neighboring Leaves to Construct the Tree
Find neighboring leaves i and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: Dkm = (Dim + Djm – Dij)/2 Compress i and j into k, iterate algorithm for rest of tree

Finding Neighboring Leaves
To find neighboring leaves we simply select a pair of closest leaves.

To find neighboring leaves we simply select a pair of closest leaves. WRONG

Closest leaves aren’t necessarily neighbors i and j are neighbors, but (dij = 13) > (djk = 12) Finding a pair of neighboring leaves is a nontrivial problem!

Neighbor Joining Algorithm
In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

Overview Based on the current distance matrix calculate the matrix Q (defined later). Find the pair of taxa for which has its lowest value Qij. Add a new node to the tree, joining these taxa to the rest of the tree. Calculate the distance from each of the taxa in the pair to this new node. Calculate the distance from each of the taxa outside of this pair to the new node. Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step.

Basic Algorithm

Another Example A B C D E F 5 4 7 6 8 10 9 11

Q(ij)=(N-2)d(ij) - [r(i) + r(j)]
A B C D E F 5 4 7 6 8 10 9 11 Q(ij)=(N-2)d(ij) - [r(i) + r(j)] A B C D E F

Q A B C D E F -52 -46 -40 -42 -44

Tree (So far) D(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1
D(BU) =d(AB) -D(AU) = 4 Tree (So far)

d(CU) = [d(AC) + d(BC) - d(AB)] / 2 = 3
d(DU) = [d(AD) + d(BD) - d(AB) ]/ 2 = 6 d(EU) = [d(AE) + d(BE) - d(AB) ]/ 2 = 5 d(FU) = [d(AF) + d(BF) - d(AB) ]/ 2 = 7 New Matrix U C D E F 3 6 5 7 8 9

Q(ij)=(N-2)d(ij) - [r(i) + r(j)]
U C D E F 3 6 5 7 8 9 Q(ij)=(N-2)d(ij) - [r(i) + r(j)] r(U)= =21 r(C)=24 r(D)=27 r(E)=24 r(F)=32 U C D E F

U C D E F -36 -30 -32 D(UW) =d(UC) / 2 + [r(U)-r(C)] / 2(N-2) = 1
D(CW) =d(UC) -D(UW) = 2 U C D E F -36 -30 -32

Programs BIONJ WEIGHBOR FastME

UPGMA: Unweighted Pair Group Method with Arithmetic Mean
UPGMA is a clustering algorithm that: computes the distance between clusters using average pairwise distance assigns a height to every vertex in the tree, effectively assuming the presence of a molecular clock and dating every vertex

UPGMA’s Weakness The algorithm produces an ultrametric tree : the distance from the root to any leaf is the same UPGMA assumes a constant molecular clock: all species represented by the leaves in the tree are assumed to accumulate mutations (and thus evolve) at the same rate. This is a major pitfalls of UPGMA.

UPGMA’s Weakness: Example
2 3 4 1 Correct tree UPGMA

Clustering in UPGMA Given two disjoint clusters Ci, Cj of sequences,
1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Note that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|

UPGMA Algorithm Initialization: Assign each xi to its own cluster Ci
Define one leaf per sequence, each at height 0 Iteration: Find two clusters Ci and Cj such that dij is min Let Ck = Ci  Cj Add a vertex connecting Ci, Cj and place it at height dij /2 Delete Ci and Cj Termination: When a single cluster remains

UPGMA Algorithm (cont’d)
1 4 3 2 5

UPGMA Building Phylogenetic Trees by UPGMA: Example:
The distance matrix

UPGMA What are the distance between: W and X (Calculate).

UPGMA What are the distance between: Y and C (Calculate).

Phylogenetic Analysis

Similar presentations

Presentation on theme: "Phylogenetic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phylogenetic Analysis

Similar presentations

Presentation on theme: "Phylogenetic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback