EXAMPLE: LOON (bird): RED EYES, FEATHERS, 28 VERTEBRAE DOG: BROWN EYES, HAIR, 23 VERTEBRAE CROC: GREEN EYES, SCALES, 28 VERTEBRAE We would construct the matrix : LOON (bird): 000 DOG: 111 CROC: 220 With DNA sequences each possible character has the same 4 possible states (A, C, G, T). Protein sequences have 20 possible states. Multiple Alignment (Morphological Data):
Multiple Sequence Alignment - Definition A multiple alignment of sequences S1,S2,..,Sk is a series of sequences S1’, S2’,.., Sk’ with gaps such that: –all Si’ sequences are of equal lengths. –Sj’ is an extension of Sj, obtained by insertion of gaps. Example: ACTCGT, CAGTG, ACATCG AC__TCGT _CAGT_G_ ACA_TCG_
The Size Problem: If we consider only short sequences and only two taxa, we can handle the comparison manually. For example, 2 taxa matrix: But if you were to do this for 75 taxa, you'd have to use 75 dimensional space !!! In general, MSA methods are based on pairwise alignments between the sequences. Taxa 2 Taxa 1
LOON: AAC DOG: ACA CROC: CCA RAT: CAC There is one difference (two states) in each of the columns, thus the column- score for the alignment is 3. Determining Score: Most alignment algorithms determine the cost of an alignment column-wise. Example: Usually we will align the sequences in pairs, and then align the pairs. Possible scoring schemes include: Sum of pairs - sum of pairwise distances between all pairs of sequences. Distance from consensus - the consensus is a string of the most common character in each column.
MSA Approaches Progressive approach: Build MSA starting from most related sequences, and then progressively add less related sequences. ClustalW, Pileup. Iterative approach: Repeatedly realign subgroups of sequences. Objective: Improve the MSA score according to the scoring scheme, e.g., the sum of pairs score. Subgroups are based on phylogenetic tree or random selection. MultAlin, DiAlign. Problem: Errors in the initial alignment are propagated to the MSA.
ClustalW Algorithm: Compute pairwise alignment for all the pairs of sequences. Build a phylogenetic guide tree such that similar sequences are neighbors in the tree distant sequences are distant from each other in the tree. The sequences are progressively aligned according to the branching order in the guide tree.
Input data Pairwise alignment Multiple alignment
PHYLOGENETIC RECONSTRUCTION Goal: Given a set of species*, reconstruct the tree which best explains their evolutionary history.
All organisms undergo a slow process of transformation through the ages - Evolution. The process of speciation (creating new species) is described by phylogenetic trees. Trees are acyclic connected graphs. Example: Primate phylogenetic tree The common ancestor of human and chimp chimpanzee humangorillaorangutangibbonsiamang EVOLUTION and PHYLOGENY The common ancestor of all six primates
Nodes: External nodes (tips of tree) represent extant (existing) species. Internal nodes represent ancestral species (usually extinct). Branches: Length correspond to number of mutations. Longer branch means more mutations, usually implying longer evolutionary time. Typical time scale is mya (millions years ago). chimpanzeehumangorillaorangutangibbonsiamang External nodes Internal nodes Branch Tree Features:
Phylogenetic Reconstruction Goal: Given a set of taxa (a group of related biological species), build a tree which best represents the course of evolution for this set over time. Trees: Rooted or unrooted. Most reconstruction methods produce unrooted trees. To root a tree we need “external information’’ (e.g. outgroup ). human chimpanzee Unrooted chimpanzee human gorilla orangutan Rooted orangutan gorilla
Classical phylogenetic analysis: Darwin (origin of species, November 24, 1859) and his contemporaries based their work on morphological and physiological properties (e.g. cold/warm blood, existence of scales, number of teeth, existence of wings, etc., etc.) Modern biological methods are based on molecular features: homologous sequences (e.g., globins) in different species; use DNA or protein sequences. Trees are Based on What?
Homologous genes have a common ancestor. However gene duplications and losses events obscure evolutionary events.
Input Algorithm Tree Morphology Based Input: n-by-m table, with rows = species, columns = properties. Sequence Based Input: n aligned sequences, one per species. algorithm Phylogenetic tree Properties table or aligned sequences Major types of Algorithms: Distance Based Methods: UPGMA, Neighbor Joining. Character Based Methods: Maximum Parsimony, Maximum Likelihood.
The Methods: Distance- A tree that recursively combines two nodes of the smallest distance. Parsimony – A tree with a total minimum number of character changes between nodes. Maximum likelihood - Finds the most probable tree under a mutation model. The method of choice nowadays.
Distance Based Methods Iterative process, n-1 stages. Each stage consists of two steps: Step 1: Determine the closest pair of species v, u. “Merge’’ together these two “neighbors” to a new species w. Step 2: Update the distance matrix. Determine the distances from the new species w to the n-2 other. There are many distance based methods. Most popular are UPGMA and Bio-NJ. Different choices of the closest pair, and the ways to resolve ties.
UPGMA –Unweighted Pair Group Method with Arithmetic mean Algorithm - 2 stages: 1.Build a simple distance matrix: Distance between a pair of species may be the number of sites in which they differ. 2.Construct a tree by iteratively clustering species with small distances (“neighbors ”). ABCD B6 C57 D10127
EXAMPLE for UPGMA Find the pair with the closets distance: AC. Calculate distance between A and C: A | ----C 2.5 Merge A and C to AC and update distance matrix. Dist(AC,x) = [dist(A,x) + dist(C,x)]/2. ABCD B6 C57 D10127 ACBD B6.5 D8.512
EXAMPLE for UPGMA Next pair: AC,B A | |----C |2.5 | B 3.25 ACB D10.25 ACBD B6.5 D8.512 Next pair: ACB.D A | 1.875| ----C |2.5 | | | B | 3.25 | D 5.125
UPGMA Properties Builds a rooted tree. The output tree is ultrametric: the distance between the root and any leaf is the same. This leads to a similar molecular clock assumption, which is too good to be true. The tree is additive: the distance between any two nodes equals the sum of the lengths of the branches connecting them.
Neighbor Joining Builds an additive tree which does not assume an equal molecular clock. The tree is unrooted. Algorithm is similar: merge the pair of nodes whose distance is smallest. Merge nodes A and B such that M(A,B) is smallest: r(A) = [ x d (A,x)]/(N-2). M(A,B) = d (A,B)-[r(A)+r(B)]. d (A,AB) = 0.5[ d(A,B)+r(A)-r(B)] d (B,AB) = d (A,B) – d (A,AB).
Neighbor Joining Set N to contain all leaves Iteration: u Choose i,j such that M(i,j) is minimal u Create new node k, and set u remove i,j from N, and add k Terminate: when |N| =2, connect two remaining nodes i j k m
Neighbor Joining Example Compute r for every node, N=4. r(A)=0.5*(6+5+10); r(B)=0.5*(6+7+12); r(C) = 0.5*(5+7+7); r(D) = 0.5*( ); Compute M for every pair of nodes. M(A,B) = dist(A,B)-[r(A)+r(B)]=6-( ). In this example C and D are merged first. ABCD B6 C57 D10127 A B C D
If you break ties “systematically”, that is according to the order of appearance in the matrix, you'd get the UPGMA tree on the left if you completed this procedure. If you broke ties randomly, you might get the tree on the right here.
Maximum Parsimony We are looking for an “evolutionary explanation” for existing species that will minimize the number of mutations. Evolutionary explanation - a tree and series in internal nodes. The internal nodes stand for steps required to generate the observed variation in the sequences. This problem is NP-hard. However, for a given tree it is easy to find an assignment for the internal nodes that minimizes the number of mutations.
Calculating the minimal number of steps The intersection of C, T and C is (of course) C The intersection set of A, C and C is C We add a length of 1 Length=2 An intersection of A and A, it is A, thus we apply A to the node. Length =0 We add a length of 1 Length=1
Maximum Parsimony Problems It is possible for small datasets to evaluate all possible tree topologies. Done by adding taxa to the growing tree in all possible locations. Specifically, where the number of taxa t = 4, there are 3 un-rooted trees. The number of possible trees rapidly increases with increasing t. Number of trees: (2t - 5)!/[2 t-3 (t - 3)!] When t = 10, the number is more than two million. Maximum parsimony is not always real.
Maximum Likelihood Uses probability calculations to find a tree that best accounts for the variation in a set of sequences. In each tree the number of sequence changes is considered. Allows for variation in mutation rates, and can incorporate evolutionary models such as Jukes- Cantor. Like Maximum parsimony - analysis is performed on each column in a series, and all possible trees are considered. Computational intensive!
Comparison When the sequences are very similar all methods will produce a tree close to the real tree. When sequences are less related, neighbor joining and maximum likelihood are usually better than maximum parsimony.