UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple Alignment Methods Lecturer: Dr. Rose Slides by: Dr. Rose April 1, 2003

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Naïve method: 1.merge two strings with minimum edit distance 2.Successively merge in the string with the smallest edit distance from any string in the multiple alignment. Observation: Like Prim’s algorithm for minimum spanning tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Alternative Naïve method: 1.merge two strings with minimum edit distance 2.Successively merge multiple alignments of subsets of strings on the basis of pairwise edit distance. Observation: Like Kruskal’s algorithm for minimum spanning tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Less Naïve method: UPGMA 1.merge two strings with minimum edit distance 2.Successively merge multiple alignments of subsets of strings on the basis of average edit distance. Observation: Average linkage method is a common clustering linkage method. UPGMA: unweighted pair-group method using arithmetic averages.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Iterative pairwise alignment Problem: aligning protein sequences to reveal conserved  -strands. Q: What is a  -strand? Let’s refresh our memories concerning proteins and their structures. For a good overview visit: http://public-1.cryst.bbk.ac.uk/PPS2/course/section3/index.html

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Q: What is a protein? A: One or more polypetide chains. Q: Ughh, what is a polypetide chain? A: A linear polymer of amino acid residues, i.e., a sequence of amino acids. Defn. The primary structure of a protein is the sequence of amino acids comprising it.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods General form of an amino acid (borrowed from Jon Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Amino acids are joined together by peptide bonds. (borrowed from Jon Cooper, Birkbeck Crystallography Dept.) Here the sequence of R-groups along the chain is called the primary structure. Secondary structure refers to the local folding of the polypeptide chain.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods The right-handed spiral conformation is known as the 'alpha-helix‘. (borrowed from Jon Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods A section of polypeptide with residues in the beta- conformation is refered to as a beta-strand (from J. Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods B-strands can form beta-sheets (J. Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods Here we see beta-strands in an antiparallel beta-sheet. (from J. Cooper, Birkbeck Crystallography Dept.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods: Iterative alignment Problem: aligning protein sequences to reveal conserved  -strands. Q: First of all, what is meant by conserved  -strands? A: These are  -strands that are preserved through evolutionary changes. We want to find the location of these conserved  - strands.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Scenario: we have one protein sequence with known locations of  -strands. Q: How do you think we are able to know where the locations of the  -strands are? A: They probably were found by x-ray crystallography.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Note: any multiple alignment will entail gaps. Q: How can we use multiple alignment to find conserved  -strands? We need an approach that will: 1.Align the conserved  -strands. 2.not insert gaps into the conserved  -strands. The method should be tuned to favor similarities in secondary structure. ( Recall: secondary structure refers to the local folding of the polypeptide chain.)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Broad outline of method: 1.Greedy algorithm. 2.Variant maximum spanning tree method 3.Add strings to alignment in order of similarity First we need to define the similarity metric. Next we see that it is not simply a multiple alignment consistent with a node labeling of the maximum spanning tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Similarity metric: For each pair of strings S i, S j : Compute pairwise similarity score Repeat 100 times: Randomly permute the characters in the two strings Compute pairwise similarity score Compute the mean and standard deviation Define sd(i, j) as the ratio of the similarity score and standard deviation computed from the permuted strings.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Q: Why should sd(i, j) favor secondary structure similarity? Expectations: Common nonrandom structures will raise sd(i, j). These structures will be destroyed in the permuted strings. Certainly, sd(i, j) favors similarity. Not clear that the favored similarity is necessarily secondary structure similarity.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Limited empirical results with test data: If sd(i, j) score > 5 then > 70% agreement with reference alignment (reference alignment from x-ray data) Gusfield states, “So sd(i, j) values can be used to give some confidence that the optimal alignment is biologically informative, even when the alignment is obtained from proteins where the secondary structure is not known.” Anybody skeptical?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Iterative alignment: 1.Select the pair of strings with maximal sd(i, j) score. 2.Optimally align these two strings. 3.Repeat: Compute the profile of the current multiple alignment. Find the largest sd(i, j) score where S i is included in the multiple alignment but S j is not. Merge S j into the multiple alignment by aligning S j with the profile.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Q: Are the sd scores selected in the same order that a Prim’s maximum spanning tree algorithm would select them? A: yes. So this is a maximum spanning tree clustering method. Q: Is the multiple alignment consistent with some node labeling of the maximum spanning tree? A: No. Q: Why not?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Do we remember what it means for an alignment to be consistent with a tree? M is consistent with T if the induced pairwise alignment of S i and S j has optimal weighted edit distance for each pair of strings (S i, S j ) that label adjacent nodes in T. Q: So why is the multiple alignment not consistent with some node labeling of the maximum spanning tree?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment A: Because S j is not aligned to have optimal weighted edit distance to S i. Recall, S j is optimally aligned with the profile at the time it is merged. Q: So how well does this algorithm work? A: contradictory results compared to optimal pairwise alignments: Secondary structure alignment is improved when optimal pairwise alignment gives poor result. Poor result when optimal pairwise alignment does a good job.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Q: Why have we been interested in multiple alignment? A: we have focused on two problems: 1.Characterizing protein families & superfamilies 2.Identifying important conserved features. There is a third reason for investigating multiple alignment: Q: any ideas? A: deducing evolutionary history.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Consider: iterative alignment successively merges distinct subset of strings.  This can be represented by a binary tree T.  Each leaf is a single string representing a taxon  Each internal node v represents: 1.The merge of the strings in the subtree rooted at v 2.the multiple alignment alignment of v’s descendents. Idea: choose merge criteria that reflects evolutionary history  Then T represents a deduced evolutionary tree.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Approach: progressive alignment Key idea: pairs strings with minimum edit distance probably represent recent divergence of taxa.  Merging such pairs should provide the best information.  This alignment should conserve the maximum amount of common structure.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment Methods : Iterative alignment Consequence of optimal structure conservation  gap preservation.  Never remove a gap in subsequent merges. any gap inserted in a pairwise alignment should be preserved in the multiple alignment. Note: if subsequent merges are effected by aligning profiles, then gaps are automatically preserved.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Approach: (for local & global alignments) 1.find a “good” motif Q: What is a good motif? A: One that is wide (long) and high (common to many of the strings) 2.Shift the strings containing the motif to align the occurrence of the motif. 3.Recursively align the substrings on either side of the motif.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Q: what happens when we run out of high & wide motifs? A: Use iterative alignment to finish up. Q: What about the strings that didn’t contain the first “good” motif? A: Align these strings separately, starting with their own “good” motif. Note: these separately aligned strings will have to be merged afterwards.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Q: How can “good” motifs be found? 1.Collect fixed-size substrings as candidates 2.Examine the “goodness” of the substrings using: 1.Hashing techniques or 2.Standard substring comparison or 3.Suffix trees or 4.Sorting methods Q: suggest how sorting could be used.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Next, try to extend the motif on both ends. Recall, the candidate motifs are all fixed-length. Gusfield notes: there are many ways to realize this approach. There does not appear to be a best variant.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Vingron/Argos repeated-motif method: (We will limit our discussion to 3 strings for simplicity.) 1.Create a graph of 3n nodes. 2.Look for a similar l-length (chosen by modeler) substring at nodes (i, j) & (i´, j´), j  j´ 3.If the l-length substrings are sufficiently similar, connect node (i, j) to (i´, j´) 4.Remove any edge in the graph that is not part of a clique of size 3.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Q: What is the significance of a clique of size 3? A: This represents a similar l-length substring that appears in each of the 3 strings.  The clique forms a motif.  we delete edges that are not part of such cliques since we are only interested in finding motifs.  represent the clique at (i,1), (i´,2), and (i´´,3) by (i, i´, i´´)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Clique (i, i´, i´´) is to the left of clique (z, z´, z´´) iff i<z, i´<z´, and i´´<z´´. Define two cliques as non-crossing if one is to the left of the other. 5.Find a set of nice non-crossing cliques. We want “nicely spaced” motifs. Q: What does this mean?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods Graphically, we would prefer these “nicely spaced cliques: To these not “nicely spaced” set of cliques.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Common Multiple Alignment: Repeated-Motif Methods We can do this by considering the relative starting position of the motif  compare how the differences i´- i & i´´- i´ match with z´- z & z´´- z´.  Give a high weight to pairs of cliques that are thus “nicely spaced”.  Give a small weight to pairs of “poorly spaced” cliques.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple.

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 14.10: Common Multiple."— Presentation transcript:

Similar presentations

About project

Feedback