Presentation on theme: "Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances."— Presentation transcript:
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances Ancestral reconstructions and estimating branch lengths Lecture 23
Phylogenetic analysis and sequence alignment If two nucleic acid or protein molecules show sufficient level of similarity it likely means that they derived from a common ancestor. The major confusing point is horizontal gene transfer, when a DNA sequence was brought by viruses, mobile elements or other process from a remote source. Another confusing element is an unequal rate of molecular evolution for the same gene in different species. The third potentially confusing point is comparisons between paralogous sequences. As soon as these three obstacles are taken into consideration, a comparison of similar sequences provides a sound foundation for reconstruction of phylogenetic or evolutionary relationships using molecular data.
Multiple sequence alignment and phylogenetic analysis Multiple alignment is a critical step in phylogenetic analysis. Multiple alignment procedure, while it can be reliable in many cases, has a potential problem. It depends on composition of a group of aligned sequences and if there is some bias of alike sequences, this will affect the alignment. The position of gaps might be affected by alike sequences. Bias multiple alignments will inevitably affect phylogenetic reconstructions.
Unrooted phylogenetic tree Branch (edge) Node Seq A Seq B Seq C Seq D
The number of rooted trees is much higher than unrooted trees. Number of OTUs Number of unrooted trees Number of rooted trees 2 3 4 5 6 7 8 9 10 1 3 15 105 945 10,395 135,135 2,027,025 1 3 15 105 945 10,395 135,135 2,027,025 34,459,425
A strategy of phylogenetic reconstruction Choose set of related sequences Obtain multiple sequence alignment Is there strong sequence similarity? Maximum parsimony methods Is there clearly recognizable sequence similarity? Maximum likelihood methods Distance methods Analyse how well data support prediction Yes No Maximum parsimony produces better results when sequence similarity between sequences is high and when amount of variation is small. Distance methods are generally better when variation between sequences is intermediate. Maximum likelihood is less sensitive to sequence variation but computationally very demanding.
Simplifying assumptions for phylogenetic reconstructions All nucleotide sites change independently. The substitution rate is constant over time and in different lineages. The base composition is at equilibrium. The conditional probabilities of nucleotide substitutions are the same for all sites and do not change over time. The number of gaps in MA should not be too large. MA of similar sequences is preferable. The simplest evolutionary models are preferable: a) no reversal (A T A), b) no multiple steps (A T G). Gaps in MA are usually not scored because there is no suitable model for the evolutionary mechanisms that produce them. MOST OF THESE CONDITIONS ARE VIOLATED IN SOME DEGREE AND TIME TO TIME.
Measuring genetic distances Calculation of distances between pairs of sequences is essential for phylogenetic reconstructions. Calculation of a genetic distance is similar to asking a question “how much evolutionary change has occurred between two sequences” If all earlier assumptions were correct, calculation of a genetic distance would be simple. Unfortunately this is not the case and evolutionary time itself makes a significant contribution.
Number of nucleotide substitutions between pairs of bovid mtDNA The observed number of substitutions is not linear with time.
The need to correct observed sequence differences Numerous distance correction techniques have been proposed to estimate the actual amount of evolutionary change. Many of the methods are interrelated.
Models of sequence evolution: basic approach Within a general framework assuming the probability of a given nucleotide substitution remains constant over time and other assumptions mentioned earlier, the substitution matrix is given by: p AA p AC p AG p AT P t = p CA p CC p CG p CT p GA p GC p GG p GT p TA p TC p TG p TT In most models the matrix is symmetric : p AC = p CA If the numbers 1-4 are assigned to each nucleotide (i.g. A = 1, C = 2, etc.) then the value of the diagonal elements is given by p ii = 1 - j i p ij In other words, the probability of observing A at a given site at time 0 and again at time t is 1 minus the probability of observing the substitution of A by any of C, G, or T. The base composition of the sequences can be represented by a vector f = [f A f C f G f T ]. In some models f values are equal, in others different.
Different models estimating nucleotide substitutions among a pair of DNA sequences
Observed and expected numbers of nucleotide pairs between human and chimpanzee mtDNA sequences for 3 models. As the models add parameters they more closely approximate the observed pattern
Measuring evolutionary change on a tree The evolutionary distance between b and d is the sum of the edge lengths along the path in the tree between the two sequences.
Inferring branch lengths There are many methods for inferring branch lengths, and these are directly related to the methods of tree construction. Here we shall focus on one approach (parsimony method) because this method unlike others explicitly seeks to reconstruct the ancestral sequences, rather than just the edge lengths. In the vast majority of cases the ancestral sequences are never known. 1 CGA 2 CGA 3 ATT 4 TTT 5 TGT 6 TGG TGT - consensus sequence
Inferring ancestor sequences: is consensus always correct ? Given a star tree the most frequent base at each position is the most parsimonious estimate of the ancestral nucleotide. If, however, tree is known using parsimony the common ancestral sequence is different. It means that simple consensus is not appropriate in many cases.
Basic rules and assigning state sets to internal nodes Two basic rules of parsimony are: a) if two sequences (nodes) have the same state, their common ancestor had the same state, b) if the two sequences have different states then the state set of the ancestor has both states or in other words there is a lack of information about its state.
Resolving the ambiguity in ancestral reconstructions By going back up from the root of the tree the ambiguities can be resolved. However, in more complex situations it is not always possible
Estimating branch lengths Distance method of estimation of branch lengths (next lecture) uses direct or indirect measurements of sequence similarity (genetic distances). Parsimony method uses number of events between two compared nodes. A lack of data about the intermediate sequences prevents reconstruction of the true evolutionary story and incorrect conclusions can be drawn (a). In reality at least 3 events took place between nodes 1 and 2 (b).
Summary Multiple alignment is an essential and initial step in any phylogeny reconstruction effort. Multiple substitutions at the same sequence position underestimate the actual number of substitutions that have taken place. The longer period of time since the common ancestor, the sharper this problem. Several correction methods can “soften” the problem. A number of basic assumptions used in the phylogenetic reconstructions, like standard mutation/substitution rate, independence of substitutions at different sites, etc., are generally not correct. The quality of phylogenetic reconstruction depends on how many sequences were sampled. A lack of data may significantly bias the reconstruction.