UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer: Dr. Rose Slides by: Dr. Rose February 1 & 6, 2007

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Edit Graphs Key idea: weighted edit graph Defn. Given strings S 1 and S 2 of lengths n and m, respectively, a weighted edit graph has (n+1) by (m+1) nodes, labelled (i,j), 0  i  n, 0  j  m. The edges & edge weights are problem specific.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Edit Graphs Example: edit distance problem –The weighted graph for the edit distance problem has directed edges from node (i, j) to the nodes (i + 1, j), (i, j + 1), and (i + 1, j + 1), provided they exist. –The weight of the directed edges to nodes (i + 1, j), (i, j + 1) is 1. –The weight of the directed edge to (i + 1, j + 1) is t(i + 1, j + 1). –Figure 11.4 in the textbook shows an edit graph.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Edit Graphs Thm. An edit transcript for strings S 1 and S 2 has the minimum number of edit operations  it corresponds to a shortest path from 0,0 to n,m in the edit graph. Cor. The set of all shortest paths from 0,0 to n,m in the edit graph specifies all optimal edit transcript of S 1 to S 2.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Weight Edit Distance There are two ways of assigning weight or costs to calculate edit distance: 1.By edit operation 2.By alphabet, i.e., different costs for different characters Our initial approach was to assign weight by edit operation, i.e., 1 for insert, delete, replace, and 0 for match. We can generalize our approach by assigning the weight d for an insertion or deletion, r for a replacement, and e for a match.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Weight Edit Distance Q:What values for d, r, and e have we been using? A: d = 1 r = 1, and e = 0. Q: What would happen if r > 2*d? A: Replacements would never occur. Defn. The operation-weight distance problem entails finding an edit transcript transforming S 1 to S 2 with the minimum total operation weight.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Weight Edit Distance Q: What changes should we make to the definition of edit distance, D(i,j), to reflect operation weight? We have to specify an operation-specific definition. The base conditions become: –D(i,0) = i * d. Why? –D(0,j) = j * d. Why?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Weight Edit Distance The general recurrence becomes: –D(i,j) = min[D(i,j-1) + d, D(i-1,j) + d, D(i-1,j-1) + t(i,j)] –Where t(i,j) = e if S 1 (i) = S 2 (j) o/w t(i,j) = r –Q: Why? –A: the cost of 1.Delete (from i-1,j) is d 2.Insert (from i,j-1) is d 3.Match (from i-1,j-1) is e 4.Replace (from i-1,j-1) is r

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Weight Edit Distance The alternative to operation-weight edit distance is alphabet-weight edit distance. Idea: different characters have different cost. Q: How would we modify the edit distance function, D(i,j), to support alphabet-weight edit distance? A: Let weight(x) denote the weight associated with character x for all x in the alphabet. –Then D(i,0) =  weight(S 1 (i)) –And D(0,j) =  weight(S 2 (j)) Q: what about the general recurrence D(i,j)?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Weight Edit Distance A: D(i,j) = min[D(i,j-1) + weight(S 2 (j)), D(i-1,j) + weight(S 1 (i)), D(i-1,j-1) + t(i,j)] –Where t(i,j)] = weight(S 2 (j)), if S 1 (i)  S 2 (j), o/w 0. Note: for proteins, edit distance usually refers to alphabet- weight edit distance. As the text mentions: the weights are usually derived from the PAM matrices of Dayhoff or the BLOSUM matrices of Henikoff. Edit distance for DNA strings is usually either unweighted or operation-weighted edit distance.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology String Similarity The relatedness of two strings can be expressed in terms of similarity. This similarity is usually expressed in terms of alignment rather than in terms of edit distance. Defn. Let  be the alphabet for strings S 1 and S 2. Let  be  with the additional character ‘-’ denoting space. Let s(x,y) denote the value obtained by aligning character x with character y.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology String Similarity Defn. The value of alignment A is defined as: Where S 1 ´ and S 2 ´ denote strings after the insertion of spaces and their length is denoted by l. If s(x,y) is greater than or equal to zero if x & y match and negative if they mismatch, then we look for the alignment with the largest score

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology String Similarity Example:  = {a, g, c, t}. Let s(x,y) be defined by: Q: What is the value of the following alignment? a t a - a c t g t g t a g a c - g t

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology String Similarity Defn. Given a scoring matrix over  define the similarity of two strings S 1 and S 2 as the value of the alignment A that maximizes the total alignment value of S 1 ´ and S 2 ´. This also defines the optimal alignment value of the strings S 1 and S 2.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Similarity Q: How can we compute the optimal alignment value of the strings S 1 and S 2 ? A: Use dynamic programming. Defn. Let V(i,j) denote the value of the optimal alignment of prefixes S 1 [1..i] and S 2 [1..j]. If strings S 1 and S 2 have lengths n and m, respectively, then the value of the optimal alignment of these strings is given by V(n,m). Q: What do you guess the time complexity will be? A: O(n,m)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Similarity Define the general recurrence relation as: V(i,j) = max[V(i - 1, j - 1) + s(S 1 (i), S 2 (j)), V(i - 1, j ) + s(S 1 (i),_), V(i, j - 1) + s(_, S 2 (j))] The optimal alignment value relation is defined similar to the edit distance relation. Base Conditions:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Similarity V(i,j) = max[V(i - 1, j - 1) + s(S 1 (i), S 2 (j)), V(i - 1, j ) + s(S 1 (i),_), V(i, j - 1) + s(_, S 2 (j))] Q: What does this recurrence relation say? A: The optimal alignment of the prefixes S 1 [1..i] and S 2 [1..j] is the maximum of: 1.The optimal alignment of S 1 [1..i-1] and S 2 [1..j-1] extended by aligning S 1 (i) and S 2 (j). 2.The optimal alignment of S 1 [1..i-1] and S 2 [1..j] extended by aligning S 1 (i) with a space. 3. The optimal alignment of S 1 [1..i] and S 2 [1..j-1] extended by aligning a space with S 2 (j).

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Longest Common Subsequence Defn. A subsequence of a string S, is a subset of characters arranged in their original relative order. Example: S = interdepartmentaladministratorstaskforce subsequence => idiots interdepartmentaladministratorstaskforce Obviously every substring of S is also a subsequence of S. Defn. a common subsequence of two strings is a subsequence that appears in both strings.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Longest Common Subsequence Defn. The longest subsequence problem entails finding the longest common subsequence (lcs) of two strings. Thm. The optimal alignment of A forms a longest common subsequence, if a scoring scheme is use in which each matching pair of characters scores a 1 and a mismatch or space scores 0.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Alignment Graphs Like distance, similarity can be viewed as a path problem: the graph that is analogous to the edit graph (section 11.4) is called an alignment graph. Defn. An alignment graph is a DAG similar to an edit graph in which the edge weights correspond to costs for aligning specific character pairs. The optimal alignment corresponds to the longest path, in terms of sum of edge costs, from 0,0 to n,m of the dynamic programming table. The longest paths (optimal alignments) can be found in O(nm).

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology End-Space Free Alignment End-space free alignment: an alignment variant in which leading and trailing spaces contribute zero weight. Example: e x a m p l e - h e c o u l d a - - - h a d a - - b e e r - - - - - - - - h e w o u l d n t a s h o t h i s d e a r The first eight spaces are free. This encourages (biases towards): 1.Alignment of one string inside the other or 2.Alignment of the prefix of one string with the suffix of the other

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology End-Space Free Alignment Q: When should interior or prefix/suffix matching be preferred? A: When it matches the nature of the problem being modeled. An example is shotgun sequence assembly: Explain! 1.Start with a large collection of partially overlapping substrings that come from multiple copies of one original, but unknown string. 2.Use comparisons of pairs of substrings to infer the original string.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology End-Space Free Alignment Q: Would you expect substrings that overlap in the original string to show significant alignment? A: Perhaps. In any case, with some slop for sequencing errors, either: 1.one string would align inside the other or 2.the prefix of one string would align with the suffix of the other In contrast, a significant alignment of randomly selected substrings from this collection is unlikely. An End-Space Free Alignment would detect this difference and score overlapping substrings higher.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology End-Space Free Alignment We can deduce candidate neighbor pairs by: 1.Computing End-Space Free Alignment for every pair of substrings. 2.High scoring alignments are likely neighbors. To compute this: 1.Use a recurrence for global alignment where spaces count. 2.Change the definition of V(i,0), V(0,j) to address leading spaces: V(i,0) = V(0,j) = 0 for all i and j. 3.Compute the alignment graph in O(mn) How?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology End-Space Free Alignment Unlike global alignment the value of optimal alignment is not necessarily in cell (n,m). The optimal alignment will now be found in 1.A cell in row n, if the last character of S 1 contributes to the value of the alignment but the last characters of S 2 do not. 2.A cell in column m, if the last character of S 2 contributes to the value of the alignment but the last characters of S 1 do not. 3.The optimal alignment will be the cell in row n or column m that has the largest value.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology And now for something completely different: Approximate Matching

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Approximate Matching Basic idea: Threshold-hold defined similarity Defn. A substring T´ of T is an approximate occurrence of P  the optimal alignment of P to T´ has value at least , the threshold parameter. Approach: 1.Use the standard recurrence for global alignment. 2.Do not charge preceding spaces: V(i,0) = V(0,j) = 0 for all i and j. 3.Leave backpointers while computing the table

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Approximate Matching Q: How can we recognize an approximate occurrence of P in T from the table computation? A: If the length of P is n, then for some j, V(n,j)   More specifically: Thm. The approximate occurrence of P in T ends at position j of T  V(n,j)   This tells us where in T the approximate occurrence ends. Where in T does it start?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Approximate Matching Thm.(version 0) The approximate occurrence of P in T ends at position j of T  V(n,j)   This tells us where in T the approximate occurrence ends. Where in T does it start? We can find the start by following the path from cell (n,j) back to (0,k). k is the starting position in T. Thm.(version 1) T[k..j] is an approximate occurrence of P in T  V(n,j)   and there is a path of backpointers from (n,j) to (0,k).

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Approximate Matching The table computation takes O(nm). Consider: depending on the threshold  T may contain a great many approximate occurrences of P. Q: Can all approximate occurrences be explicitly output in O(nm)? A: Perhaps not. Textbook suggest locating all j s.t. V(n,j)   and explicitly outputting a shortest approximate occurrence. 1.Traverse backpointers from (n,j) until reaching (0,k) 2.Choose vertical pointers over diagonal pointers 3.Choose diagonal pointers over horizontal pointers.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Approximate Matching How does this particular preference produce a shortest path? 1.Choose vertical pointers over diagonal pointers 2.Choose diagonal pointers over horizontal pointers. Recall: 1.Horzontal edges correspond to inserting space in P, this lengthens the path. Clearly this is to be avoided. 2.Diagonal edges correspond to matches or mismatches. 3.Vertical edges correspond to inserting space in T. There is no obvious reason for choosing diagonal over vertical edges, however, some preference must be made for tie- breaking. Except choosing vertical results in match that is shortest in T.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Global Alignment vs Local Alignment

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Local Alignment So far we have focused on global alignment. This makes sense if –We expect one string to be contained in the other or –We expect the strings to be close related. Example: comparing amino acid sequences from the same protein family.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Local Alignment Local alignment exposes regions of high similarity. –This may be interesting even if we expect the strings to be globally dissimilar. –Can you think of examples? Comparing proteins from different protein families How about searching for lateral gene transfer from prokaryotic genomes to eukaryotic genomes? –Huh????

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Local Alignment Local alignment problem. Find maximally similar (optimal global alignment) substrings  and  of S 1 and S 2, respectively. Example from text: S 1 = pqraxabcstvq, S 2 = xyaxbacsll  = a x a b - c s  = a x - b a c s This global alignment is predicated on: –a score of 2 for a match –a score of –2 for a mismatch –a score of –1 for a space –Resulting in a value of 8.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Q: How can local alignment be computed? Q: Can global alignment be used to find local alignment? A: Not efficiently. Global alignment effectively averages out local similarity. Use explicit search for local similarity.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Q: Assuming S 1 and S 2 have respective lengths n and m, how many pairs of substrings are there? A: There are O(n 2 m 2 ) pairs of substrings. Q: If we wanted to, how could we show there are this many substrings?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Observation: Computing global alignment for each of the O(n 2 m 2 ) pairs of substrings > O(nm). Surprisingly, we can compute local alignment in O(nm) even though there are O(n 2 m 2 ) pairs of substrings. Assumption: the global alignment of two empty strings has value zero.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment First consider a restricted version of local alignment. –Defn. The local suffix alignment problem entails finding a suffix  of S 1 [1..i] and a suffix  of S 2 [1..j] s.t. V( ,  ) is the maximum over all pairs of suffixes of S 1 [1..i] and S 2 [1..j]. –Let v(i,j) denote the value of the optimal suffix alignment for the index pair i,j.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Local suffix alignment example: –S 1 = abcxdex, S 2 = xxcxdeabc, Score 2 for matches and –1 for mismatches or spaces –v(3,4) = 1, how? –The c’s match but there is an additional ‘-’ aligned with x. –v(4,4) = 4, how? –The c’s match and the final x’s match –v(5,4) = 3, how? –Same as v(4,4) but extended with d aligned with ‘-’

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Observation: v(i,j)  0. Q: Why is this true? A: We can always choose  and/or  to be the empty string. Let v* denote the value of optimal local alignment for strings of length n and m. Thm. v* = max[v(i,j): i  n, j  m]

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment We need to understand why this theorem, v* = max[v(i,j): i  n, j  m], is true. Proof:  –v*  max[v(i,j): i  n, j  m] since any local optimal suffix alignment is also a local alignment.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment  –WLOG assume v* is derived from the optimal solution involving substrings  and  with end indices i* and j*,  and  define the local suffix alignment for indices i* and j*, thus v*  v(i*,j*)  max[v(i,j): i  n, j  m] From this it is clear that a solution to the local suffix alignment problem also solves the local alignment problem.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Thm. v(i,j) = max[0, v(i – 1, j - 1) + s(S 1 (i), S 2 (j)), v(i – 1, j) + s(S 1 (i), _), v(i, j - 1) + s(_, S 2 (j))] –Where v(i, 0) = 0 and v(0, j) = 0 for all i,j Q: What does this recurrence say? A: The solution to the local alignment problem v(i, j) is the larger of: 1. 0, punt and choose  and  to be empty strings 2. v(i – 1, j - 1) extended by aligning S 1 (i) and S 2 (j) 3. v(i – 1, j) extended by aligning S 1 (i) with ‘_’ 4. v(i, j - 1) extended by aligning ‘_’ with S 2 (j)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Q: What is the difference between the equations for global alignment and local suffix alignment? A: There are two differences: 1.The inclusion of 0 in the local local suffix alignment 2.The base conditions for local suffix alignment v(i,0) = 0 and v(0,j) = 0 for all i,j.This is similar for finding approximate occurrences but not for general global alignment.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Approach to computing v*: 1.Compute the table for v(i, j). 2.Search the entire table for the largest value, let (i*, j*) denote the cell containing the largest value. 3.Follow backpointers from cell (i*, j*) to cell (i´, j´) which has the value zero. This gives the optimal local alignment. 4.The local optimal alignment substrings are then  = S 1 ([ i´.. i*] and  = S 2 ([ j´.. j*]

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing Local Alignment Analysis of computing v*: 1.We know that computing the table to solve v* takes time O(nm). 2.The table contains all optimal local alignments for v(i, j). An alignment can be found by locating a cell with v* and tracing back from it.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:"— Presentation transcript:

Similar presentations

About project

Feedback