Presentation on theme: "Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)"— Presentation transcript:
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Why do we care about sequence alignment? It can tell us something about the evolution of organisms. We can see which regions of a gene (or its derived protein) are susceptible to mutation and which can have one residue replaced by another without changing function. Homologous genes (genes with share evolutionary origin) have similar sequences. Orthologs are genes that are evolutionarily related, have a similar function, but now appear in different species. Paralogs are evolutionarily related (share an origin) but no longer have the same function. You can uncover either orthologs or paralogs through sequence alignment.
Multiple Sequence Alignment Often applied to proteins Proteins that are similar in sequence are often similar in structure and function Sequence changes more rapidly in evolution than does structure and function.
Overview of Methods Dynamic programming – too computationally expensive to do a complete search; uses heuristics Progressive – starts with pair-wise alignment of most similar sequences; adds to that Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) Locally conserved patterns Statistical and probabilistic methods
Dynamic Programming Computational complexity – even worse than for pair-wise alignment because we’re finding all the paths through an n- dimensional hyperspace (We can picture this in 2 or 3 dimensions.) Can align about 7 relatively short (200- 300) protein sequences in a reasonable amount of time; not much beyond that
A Heuristic for Reducing the Search Space in Dynamic Programming Let’s picture this in 3 dimensions (pp. 146-157 in book). It generalizes to n. Consider the pair-wise alignments of each pair of sequences. Create a phylogenetic tree from these scores. Consider a multiple sequence alignment built from the phylogenetic tree. These alignments circumscribe a space in which to search for a good (but not necessarily optimal) alignment of all n sequences.
Phylogenetic Tree Dynamic programming uses a phylogenetic tree to build a “first-cut” msa The tree shows how protein could have evolved from shared origins over evolutionary time. See page 143 in Bioinformatics by Mount. Chapter 6 goes into detail on this.
Dynamic Programming -- MSA Create a phylogenetic tree based on pair-wise alignments (Pairs of sequences that have the best scores are paired first in the tree.) Do a “first-cut” msa by incrementally doing pair-wise alignments in the order of “alikeness” of sequences as indicated by the tree. Most alike sequences aligned first. Use the pair-wise alignments and the “first-cut” msa to circumscribe a space within which to do a full msa that searches through this solution space. The score for a given alignment of all the sequences is the sum of the scores for each pair, where each of the pair-wise scores is multiplied by a weight є indicating how far the pair-wise score differs from the first-cut msa alignment score.
Heuristic Dynamic Programming Method for MSA Does not guarantee an optimal alignment of all the sequences in the group. Does get an optimal alignment within the space chosen.
Progressive Methods Similar to dynamic programming method in that it uses the first step (i.e., it creates a phylogenetic tree, aligns the most-alike pair, and incrementally adds sequences to the alignment in order of “alikeness” as indicated by the tree.) Differs from dynamic programming method for MSA in that it doesn’t refine the “first-cut” MSA by doing a full search through the reduced search space. (This is the computationally expensive part of DP MSA in that, even though we’ve cut down the search space, it’s still big when we have many sequences to align.)
Progressive Method Generally proceeds as follows: Choose a starting pair of sequences and align them Align each next sequence to those already aligned, one at a time Heuristic method – doesn’t guarantee an optimal alignment Details vary in implementation: How to choose the first sequence to align? Align all subsequence sequences cumulatively or in subfamilies? How to score?
ClustalW Based on phylogenetic analysis A phylogenetic tree is created using a pairwise distance matrix and nearest-neighbor algorithm The most closely-related pairs of sequences are aligned using dynamic programming Each of the alignments is analyzed and a profile of it is created Alignment profiles are aligned progressively for a total alignment W in ClustalW refers to a weighting of scores depending on how far a sequence is from the root on the phylogenetic tree (See p. 154 of Bioinformatics by Mount.)
Problems with Progressive Method Highly sensitive to the choice of initial pair to align. If they aren’t very similar, it throws everything off. It’s not trivial to come up with a suitable scoring matrix or gap penaties.
Iterative Methods for Multiple Sequence Alignment Get an alignment. Refine it. Repeat until one msa doesn’t change significantly from the next. An example is genetic algorithm approach.
Genetic Algorithms A general problem solving method modeled on evolutionary change. Create a set of candidate solutions to your problem, and cause these solutions to evolve and become more and more fit over repeated generations. Use survival of the fittest, mutation, and crossover to guide evolution.
Evolutionary Change in Genetic Algorithms survival of the fittest – the best solutions survive and reproduce to the next generation mutation – some solutions mutate in random ways (but they must always remain viable solutions) crossover – solutions “exchange parts”
Laying Out the Problem What would a candidate solution look like in a multiple sequence alignment program? (an msa of ~20 proteins) How many candidate solutions should there be? (~100)
Evolving to a Next Generation Which candidate solutions should survive to the next generation? First, take the top half based on best sum of pairs scores Then randomly select second half, giving more chance to an msa’s being selected in proportion to how good its score is
How would mutation work? Can’t change a sequence in the msa. Otherwise you would be created a solution that isn’t really a solution. You can only insert or rearrange gaps.
How would crossover work? See page 160 in Bioinformatics by Mount.
Profiles and Motifs A sequence motif is a relatively short pattern that appears consistently with a family of proteins. (Motifs can also appear in families of DNA or RNA molecules.) Frequently, motif-based analysis is used to detect patterns of amino acids in proteins that correspond to structural or functional features. Motifs are generated during multiple sequence alignment. They can be displayed as patterns of amino acids, as sequence logos, or as profile scoring matrices.