Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.

Similar presentations


Presentation on theme: "Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming."— Presentation transcript:

1 Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming with cost/similarity/scoring matrix

2 Biological Motivation Read pages 210-214 in textbook “First Fact of Biological Sequence Analysis” –In biomolecular sequences (DNA, RNA, amino acid sequences), high sequence similarity usually implies significant functional or structural similarity –sequence similarity implies functional/structural similarity –Converse is NOT true –Evolution reuses, builds upon, duplicates, and modifies “successful” structures

3 Measuring Distance of S and T Consider S and T We can transform S into T using the following four operations –insertion of a character into S –deletion of a character from S –substitution (replacement) of a character in S by another character (typically in T) –matching (no operation)

4 Example S = vintner T = writers vintner wintner (Replace v with w) wrintner (Insert r) writner (Delete first n) writer (Delete second n) writers (Insert S)

5 Example Edit Transcript (or just transcript): –a string that describes the transformation of one string into the other Example –RIMDMDMMI –v intner –wri t ers

6 Edit Distance Edit distance of strings S and T –The minimum number of edit operations (insertion, deletion, replacement) needed to transform string S into string T –Levenshtein distance [299], Levenshtein appears to have been the first to define this concept Optimal transcript –An edit transcript of S and T that has the minimum number of edit operations –cooptimal transcripts

7 Alignment A global alignment of strings S and T is obtained –by inserting spaces (dashes) into S and T they should have the same number of characters (including dashes) at the end –then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T –Note ALL positions in both S and T are involved –Later, we will consider local alignments

8 Alignments and Edit transcripts Example Alignment –v-intner- –wri-t-ers Alignments and edit transcripts are interrelated –edit transcript: emphasizes process the specific mutational events –alignment: emphasizes product the relationship between the two strings –Alignments are often easier to work with and visualize also generalize better to more than 2 strings

9 Edit Distance Problem Input –2 strings S and T Task –Output edit distance of S and T –Output optimal edit transcript –Output optimal alignment Solution method –Dynamic Programming

10 Definition of D(i,j) Let D(i,j) be the edit distance of S[1..i] and T[1..j] –The edit distance of the first i characters of S with the first j characters of T –Let |S| = n, |T| = m D(n,m) = edit distance of S and T We will compute D(i,j) for all i and j such that 0 <= i <= n, 0 <= j <= m

11 Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i –For 0 <= j <= m, D(0,j) = j Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 D(i-1,j-1) + d(i,j) –d(i,j) = 0 if S(i) = T(j) and is 1 otherwise

12 What the various cases mean D(i,j) = min –D(i-1,j) + 1: Align S[1..i-1] with T[1..j] optimally Match S(i) with a dash in T –D(i,j-1) + 1 Align S[1..i] with T[1..j-1] optimally Match a dash in S with T(j) –D(i-1,j-1) + d(i,j) Align S[1..i-1] with T[1..j-1] optimally Match S(i) with T(j)

13 Computing D(i,j) values D(i,j)writers 01234567 0 v1 i2 n3 t4 n5 e6 r7

14 Initialization: Base Case D(i,j)writers 01234567 001234567 v11 i22 n33 t44 n55 e66 r77

15 Row i=1 D(i,j)writers 01234567 001234567 v111234567 i22 n33 t44 n55 e66 r77

16 Entry i=2, j=3 D(i,j)writers 01234567 001234567 v111234567 i2222? n33 t44 n55 e66 r77

17 Calculation methodologies Location of edit distance –D(n,m) Example was to calculate row by row Can also calculate column by column Can also use antidiagonals Key is to build from upper left corner

18 Traceback Using table to construct optimal transcript Pointers in cell D(i,j) –Set a pointer from cell (i,j) to cell (i, j-1) if D(i,j) = D(i, j-1) + 1 cell (i-1,j) if D(i,j) = D(i-1,j) + 1 cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j) –Follow path of pointers from (n,m) back to (0,0) Example: Figure 11.3 on page 222

19 What the pointers mean horizontal pointer: cell (i,j) to cell (i, j-1) –Align T(j) with a space in S –Insert T(j) into S vertical pointer: cell (i,j) to cell (i-1, j) –Align S(i) with a space in T –Delete S(i) from S diagonal pointer: cell (i,j) to cell (i-1, j-1) –Align S(i) with T(j) –Replace S(i) with T(j)

20 Table and transcripts The pointers represent all optimal transcripts Theorem: –Any path from (n,m) to (0,0) following the pointers specifies an optimal transcript. –Conversely, any optimal transcript is specified by such a path. –The correspondence between paths and transcripts is one to one.

21 Running Time Initialization of table –O(n+m) Calculating table and pointers –O(nm) Traceback for one optimal transcript or optimal alignment –O(n+m)

22 Operation-Weight Edit Distance Consider S and T We can assign weights to the various operations –insertion/deletion of a character: cost d –substitution (replacement) of a character: cost r –matching: cost e –Previous case: d = r = 1, e = 0

23 Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i d –For 0 <= j <= m, D(0,j) = j d Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + d D(i,j-1) + d D(i-1,j-1) + d(i,j) –d(i,j) = e if S(i) = T(j) and is r otherwise

24 Alphabet-Weight Edit Distance Define weight of each possible substitution –r(a,b) where a is being replaced by b for all a,b in the alphabet –For example, with DNA, maybe r(A,T) > r(A,G) –Likewise, I(a) may vary by character Operation-weight edit distance is a special case of this variation Weighted edit distance refers to this alphabet- weight setting

25 Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) =  1 <= k <= i I(S(k)) –For 0 <= j <= m, D(0,j) =  1 <= k <= j I(T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + I(S(i)) D(i,j-1) + I(T(j)) D(i-1,j-1) + d(i,j) –d(i,j) = r(S(i), T(j))

26 Measuring Similarity of S and T Definitions –Let  be the alphabet for strings S and T –Let  ’ be the alphabet  with character - added –For any two characters x,y in  ’, s(x,y) denotes the value (or score) obtained by aligning x with y –For a given alignment A of S and T, let S’ and T’ denote the strings after the chosen insertion of spaces and l their new length –The value of alignment A is  1<=i<=l s(S(i),T(i))

27 Example a b a a - b a b a a a a a b - b 1-2+1+1+0+2+0+2=5 sab- a1-20 b2 -0

28 String Similarity Problem Input –2 strings S and T –Scoring matrix s for alphabet  ’ Task –Output optimal alignment value of S and T The alignment of S and T with maximal, not minimal, value –Output this alignment

29 Modified Recurrence Relation Base Case: –For 0 <= i <= n, V(i,0) =  1 <= k <= i s(S(k),-) –For 0 <= j <= m, V(0,j) =  1 <= k <= j s(-,T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max V(i-1,j) + s(S(i),-) V(i,j-1) + s(-,T(j)) V(i-1,j-1) + s(S(i), T(j))

30 Longest Common Subsequence Problem Given 2 strings S and T, a common subsequence is a subsequence that appears in both S and T. The longest common subsequence problem is to find a longest common subsequence (lcs) of S and T –subsequence: characters need not be contiguous –different than substring O(nm) solution: –Make scoring matrix 1 for match, 0 for mismatch –The matched characters in an alignment of maximal value form a longest common subsequence

31 Similarity and Distance If we are focused on aligning both entire strings, maximizing similarity is essentially identical to minimizing distance –Just need to modify scoring matrices appropriately When we consider substrings of uncertain length, maximizing similarity often makes more sense than minimizing distance –Overlapping strings –Local alignment

32 Overlapping Strings Find best alignment where the two strings overlap without penalizing for the unmatched ends –Application: sequence assembly problem strings are likely to overlap without being substrings of each other Solution method –End-space free variant of dynamic programming –Change base conditions so that V(i,0) = V(0,j) = 0 –Need to search over row n and column n for optimal value Optimal value may not be in entry (n,m) Why is max similarity better than min distance?

33 Maximally Similar Substrings Local alignment problem –Input Two strings S and T –Task Find substrings s and t of S and T that have the maximum possible alignment value as well as this value. Let v* denote this value. Why is max similarity better than min distance? Read pages 230-231 for motivation

34 Local suffix alignments Define v(i,j) to be the value of the optimal alignment of any of the i+1 suffixes of S[1..i] with any of the j+1 suffixes of T[1..j]. –We bound v(i,j) to be at least 0 by scoring the alignment of two empty suffixes to be 0 Theorem –v* (the value of the optimal local alignment) = max{ v(i,j) | 1 <= i <= n, 1 <= j <= m}

35 Recurrences for local suffix alignments Base Case: –For 0 <= i <= n, v(i,0) =  –For 0 <= j <= m, v(0,j) =  Recursive Case: –0 < i <= n, 0 < j <= m –v(i,j) = max 0 v(i-1,j) + s(S(i),-) v(i,j-1) + s(-,T(j)) v(i-1,j-1) + s(S(i), T(j))

36 Comments Traceback –No longer start from cell (n,m) –Search whole table for max value and start from there –Still O(mn) running time Terminology –In the literature, the distinction between problem statements from solution methods is not clear –Global alignment often referred to as Needleman- Wunsch alignment There solution method was cubic in terms of m,n –Smith-Waterman often used to refer to both local alignments and their solution method

37 Comments continued Scoring schemes –The utility of optimal local alignments is highly dependent on the scoring scheme –Examples matches 1, mismatches & spaces 0 leads to longest common subsequence mismaches and spaces big negatives leads to longest common substring –Average score in matrix must be negative, otherise local alignments tend to be global –There is a theory developed about scoring schemes that we will cover later.

38 Aligning with Gaps Gaps: Any maximal run of spaces in a single string of a given alignment Example –S = aaabbbcccdddeeefff –T = aaabbbdddeeefffggg –Alignment aaabbbcccdddeeefff--- aaabbb---dddeeefffggg

39 Scoring with gaps Example Scoring –aaabbbcccdddeeefff--- –aaabbb---dddeeefffggg –111111-1 111111111 -1 = 13 Why include gaps in scoring schemes? –Read 236-240 –When an insertion/deletion event occurs, often more than a single character is inserted or deleted. –A single gap cost helps model the fact that a sequence of insertions/deletions is really one mutational event

40 Constant gap weight model We present a series of possible gap weight models, each of which is a special case of the next one Constant gap weight model –each individual space is free (W s = 0) –each gap has constant cost W g –Alignment problem boils down to finding an alignment that maximizes Match scores - mismatch scores - W g (# of gaps) –Dynamic programming can still solve in O(nm) time

41 Affine gap weight model Gap opening versus gap extension penalties –each gap has constant cost W g –each individual space has cost W s < W g, typically Alignment problem boils down to finding an alignment that maximizes Match scores - mismatch scores - W g (# of gaps) - W s (# of spaces) Dynamic programming can still solve in O(nm) time Probably most commonly used model because of efficiency and generality of model

42 Convex gap weight model Extension penalty should not be a constant but rather decrease as length of gap increases –One example each gap has cost W g + log q where q is the length of the gap Time now requires more than O(nm) time –In chapter 13 is an O(nmlog m) time solution –Further improvement is possible, but costly

43 Arbitrary gap weight model Gap cost is an arbitrary function of gap length –each gap has cost w(q) where q is the length of the gap –no properties are assumed on w(q) such as its second derivative is negative Solution time is now O(nm 2 + n 2 m) –cubic cost, similar to original Needleman-Wunsch solution

44 Recurrences for arbitrary gap weights Base Case: –For 0 <= i <= n, V(i,0) =  w(i) –For 0 <= j <= m, V(0,j) =  w(j) Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max V(i-1,j-1) + s(S(i),T(j)) max 0<=k<j-1 [V(i,k) - w(j-k)] –Match S[1..i] with T[1..k] and gap of length j-k at end of T max 0<=k<i-1 [V(k,j) - w(i-k)] –Match S[1..k] with T[1..j] and gap of length i-k at end of S

45 Recurrences for affine gap weights Base Case: –For 0 <= i <= n, V(i,0) = E(i,0)  W g - iW s –For 0 <= j <= m, V(0,j) = F(0,j) = -W g - jW s Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max [E(i,j), F(i,j), G(i,j)] G(i,j) = V(i-1,j-1) + s(S(i),T(j)) E(i,j) = max [E(i,j-1), V(i,j-1) - W g ] - W s –max checks if gap begins at S(i) or if it began earlier F(i,j) = max [F(i-1,j), V(i-1,j) - W g ] - W s –max checks if gap begins at T(j) or if it began earlier


Download ppt "Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming."

Similar presentations


Ads by Google