Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Inexact Matching, Sequence Alignment, and Dynamic Programming.

Similar presentations


Presentation on theme: "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Inexact Matching, Sequence Alignment, and Dynamic Programming."— Presentation transcript:

1 Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Inexact Matching, Sequence Alignment, and Dynamic Programming

2 Inexact Matching and Alignment Inexact/approximate matching means some errors will be there Alignment generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed opposite spaces made in opposing strings.

3 Importance of Alignment or Approximate Matching It is Central in computational molecular biology Because of active mutational process “Duplication and Modification” is the central part of protein evolution In DNA/RNA/Amino Acid sequences, high sequence similarity implies significant functional or structural similarity.

4 Edit Distance Between Two Strings Difference between two strings It focuses on transforming (or editing) one string into the other by a series of edit operations on individual characters The permitted edit operations are – Insertion (I) of a character into the first string – Deletion (D) of a character from the first string – Substitution (or replacement) (R) of a character in the first string with a character in the second string For Match (M) no operation is necessary v intner Wri t ers RIMDMDMMI

5 Edit Transcript vs. Edit Distance Edit Transcript: A string over the alphabet I, D, R, M that describes a transformation of one string to another is called an edit transcript, or transcript for short, of the two strings. Edit Distance: The minimum number of edit operations – insertions, deletions and substitutions – needed to transform the first string into the second. Also known as Levenshtein distance. v intner wri t ers RIMDMDMMI What is the edit distance in this example?5

6 Optimal Transcript Optimal transcript is an edit transcript that uses minimal number of edit operations. There may be more than one optimal transcript for two strings

7 String Alignment A (global) alignment of two strings S 1 and S 2 is obtained by first inserting chosen spaces, either into or at the ends of S 1 and S 2, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string. v_intner_ wri_t_ers qac_dbd qawx_b_

8 Alignment vs. Edit Transcript Mathematical viewpoint these are equivalent ways to describe relationship between two strings Alignment can easily be converted to edit transcript and vice versa For modeling standpoint they are quite different – Edit transcript emphasizes the putative mutational events that transform one string to another – While alignment displays the relationship only – So, one is process (edit transcript), the other is the product (alignment) v_intner_ wri_t_ers qac_dbd qawx_b_

9 Dynamic Programming Calculation of Edit Distance How to compute the edit distance of two string along with the accompanying edit transcript or alignment? Definition: For two strings S 1 and S 2, D(i, j) is defined to be the edit distance of S 1 [1…i] and S 2 [1 … j] D(n, m) is the desired value if n and m are the lengths of S 1 and S 2

10 Steps of Dynamic Programming Recurrence relation Tabular Computation Traceback

11 The Recurrence Relation Recurrence relation establishes relationship between the value of D(i, j) for i and j and values of D with index pairs smaller than i, j. Base conditions are – D (i, 0) = i, i.e. delete i characters – D (0, j) = j, i.e. j characters to be inserted The recurrence relation is – D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

12 Tabular Computation: Bottom Up Approach D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

13 Tabular Computation: Bottom Up Approach D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)] 3 O (nm)

14 The Traceback For optimal edit transcript, follow any path from cell (n, m) to cell (0, 0) 1.Horizontal edge, from (i, j) to (i, j-1), is insertion (I) of character S 2 (j) into S 1 2.Vertical edge, from (i, j) to (i-1, j), is deletion (D) of S 1 (i) from S 1 3.Diagonal edge, from (I, j) to (i-1, j-1) is a match (M) if S 1 (i) = S 2 (j) and a substitution (R) if S 1 (i) ≠ S 2 (j)

15 The Traceback Alternatively in terms of alignment 1.Horizontal edge specifies a space inserted into S 1 2.Vertical edge specifies a space inserted into S 2 3.Diagonal edge specifies either a match or a mismatch Three traceback paths From (7, 7) to (3, 3) identical t_ers tner_ S 1 = vintner S 2 = writer t_ers tner_ ri_t_ers vintner_ inin _n_n wr vi iiii r_r_ wvwv w_w_ O (n + m)

16 Edit Graphs Often useful to represent dynamic programming solutions of string problems in terms of weighted edit graph – If |S 1 | = n and |S 2 | = m then the weighted edit graph has (n+1) x (m+1) nodes – Each edge has weights In the case of edit distance problem, each edge has weight 1 except the three edges Any shortest path from (0,0) to (n, m) specifies an edit transcript

17 Weighted Edit Distance Easy but crucial generalization is to associate weight or cost or score to every edit operation, as well as with a match – Let, insertion or deletion weight is d – Substitution weight is r, and – Match weight is e, usually very small, often zero Equivalently, in terms of operation-weight alignment – Mismatch costs r – Match costs e – Space costs d Two types of weighted edit distance – Operation weight – Alphabet weight

18 Operation-weight Edit Transcript d = 1, r = 1 and e = 0We get three optimal alignments d = 4, r = 2 and e = 1 writ_ers Vintner_ Total weight is 17, which is optimal Modified Recurrence Relations:, It can also be represented as a shortest path problem on a weighted edit graph

19 Alphabet-weight Edit Distance Assign score/weight depending on characters – For example, it may be more costly to replace an A with a T than with a G – Or, the weight of a deletion / insertion may depend on exactly which character is deleted / inserted Weighted edit distance usually means alphabet- weight version Dominant scoring matrices are PAM matrices, and the newer BLOSUM scoring matrices – They are defined in terms of maximization problem (string similarity) rather than edit distance.

20 String Similarity While edit distance is to minimize weights, string similarity is to maximize weights For string similarity – Matches are greater than or equal to zero – Mismatches are less than zero

21 Computing String Similarity Let V(i, j) is the optimal alignment of prefixes S 1 [1..i] and S 2 [1..j]

22 End-space Free Variant Any spaces at the beginning and end has cost zero Encourages one string to align in the interior of the other Or the suffix of one string to align with a prefix of the other Shotgun sequence assembly (see section and 16.15) problem uses this variant, can be a project. 0 0

23 Local vs. global alignment Global alignment: entire sequences Local alignment: segments of sequences Local alignment often the most relevant –Depends on biological assumptions

24 The Needleman-Wunsch and The SMITH-WATERMAN algorithm for sequence alignment

25 Global Sequence Alignment The Needleman–Wunsch algorithm performs a global alignment on two sequences global alignment It is an example of dynamic programming, and was the first application of dynamic programming to biological sequence comparisondynamic programming Suitable when the two sequences are of similar length, with a significant degree of similarity throughout Aim: The best alignment over the entire length of two sequences

26 Three steps in Needleman-Wunsch Algorithm Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)

27 Scoring Scheme Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix ACGT A1 C 1 G 1 T 1

28 Initialization Step Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty TCG A T-2 C-3 G-4

29 Scoring The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g where S(i, j) is the substitution score for letters i and j, and g is the gap penalty

30 Scoring …. Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = = -1 scoreup = C(i-1, j) + g = = -2 scoreleft = C(i, j-1) + g = = -2 TCG A T-2 C-3 G-4

31 Scoring …. Final Scoring Matrix Note: Always the last cell has the maximum alignment score: 2 TCG A -2-3 T-20-2 C-310 G-4-202

32 Trace back The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the last cell, i.e. position X, Y in the matrix Gives alignment in reverse order

33 Trace back …. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors

34 Trace back …. The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G TCG A -2-3 T-20-2 C-310 G-4-202

35 Trace back …. Final Trace back Best Alignment: A T C G | | _ T C G TCG A -2-3 T-20-2 C-310 G-4-202

36 Local Sequence Alignment The Smith-Waterman algorithm performs a local alignment on two sequences local alignment It is an example of dynamic programmingdynamic programming Useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context Aim: The best alignment over the conserved domain of two sequences

37 Differences in Needleman-Wunsch and Smith-Waterman Algorithms: In the initialization stage, the first row and first column are all filled in with 0s While filling the matrix, if a score becomes negative, put in 0 instead In the traceback, start with the cell that has the highest score and work back until a cell with a score of 0 is reached.

38 Three steps in Smith-Waterman Algorithm Initialization Scoring Trace back (Alignment) Consider the two DNA sequences to be globally aligned are: ATCG (x=4, length of sequence 1) TCG (y=3, length of sequence 2)

39 Scoring Scheme Match Score = +1 Mismatch Score = -1 Gap penalty = -1 Substitution Matrix ACGT A1 C 1 G 1 T 1

40 Initialization Step Create a matrix with X +1 Rows and Y +1 Columns The 1st row and the 1st column of the score matrix are filled with 0s TCG 0000 A0 T0 C0 G0

41 Scoring The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j) scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g And 0 (here S(i, j) is the substitution score for letters i and j, and g is the gap penalty)

42 Scoring …. Example: The calculation for the cell C(2, 2): scorediag = C(i-1, j-1) + S(I, j) = = -1 scoreup = C(i-1, j) + g = = -1 scoreleft = C(i, j-1) + g = = -1 TCG 0000 A00 T0 C0 G0

43 Scoring …. Final Scoring Matrix Note: It is not mandatory that the last cell has the maximum alignment score! TCG 0000 A0000 T0100 C0021 G0013

44 Trace back The trace back step determines the actual alignment(s) that result in the maximum score There are likely to be multiple maximal alignments Trace back starts from the cell with maximum value in the matrix Gives alignment in reverse order

45 Trace back …. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors. This continues till cell with value 0 is reached.

46 Trace back …. The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of Seq 1: G | Seq 2: G TCG 0000 A0000 T0100 C0021 G0013

47 Trace back …. Final Trace back Best Alignment: T C G | | | T C G TCG 0000 A0000 T0100 C0021 G0013

48

49 Gaps A gap is any maximal, consecutive run of spaces in a single string of a given alignment. c t t t a a c _ _ a _ a c c _ _ _ c a c c c a t _ c Four gaps and seven spaces The simplest objective function that includes gaps 1.Where W g is a constant gap for each gap 2.k is the number of gaps 3.s(x, _) = s(_, x) = 0 for every character x

50 Why Gaps? Top row shows part of the RNA sequences of one strain of the HIV-1 virus. The HIV virus mutates rapidly The three bottom rows, each shows the mutated virus strain from the original one. Dark one is the matching portion, white space represents gap Matching means similarity, i.e. mismatch or space could be there but in small percentage of the region

51 cDNA Matching: A Concrete Example cDNA means complemented DNA

52 Connection between DNA and Protein Exon Intron

53 The cDNA Each cell contains the same chromosome, the same set of genes Yet, in each specialized cell (a liver cell for example) only a small fraction of the genes are expressed You want to hunt the location of the encoding gene for that specific protein Capture the mRNA in that cell after it leaves the cell nucleus That mRNA is used to create a DNA string complementary to it, which is known as cDNA

54 cDNA Problem cDNA

55 Why Gaps in the Objective Function You will not get long gaps or you can not get gaps of your own choice or problem specific

56 Choice of Gap Weights Constant – Maximize [W m (# matches) – W ms (# mismatches) – W g (# gaps)] – Or Affine – Maximize [W m (# matches) – W ms (# mismatches) – W g (# gaps) – W s (# spaces)] – W g gap initiation cost, W s gap extension cost Convex Arbitrary c t t t a a c _ _ a _ a c c _ _ _ c a c c c a t _ c

57 Reference Chapter 10, 11: Algorithms on Strings, Trees and Sequences


Download ppt "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Inexact Matching, Sequence Alignment, and Dynamic Programming."

Similar presentations


Ads by Google