Download presentation

Presentation is loading. Please wait.

1
**Dynamic Programming: Sequence alignment**

CS 466 Saurabh Sinha

2
**DNA Sequence Comparison: First Success Story**

Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene A normal growth gene switched on at the wrong time causes cancer !

3
Cystic Fibrosis Cystic fibrosis (CF) is a chronic and frequently fatal genetic disease of the body's mucus glands. CF primarily affects the respiratory systems in children. Search for the CF gene was narrowed to ~1 Mbp, and the region was sequenced. Scanned a database for matches to known genes. A segment in this region matched the gene for some ATP binding protein(s). These proteins are part of the ion transport channel, and CF involves sweat secretions with abnormal sodium content!

4
**Role for Bioinformatics**

Gene similarities between two genes with known and unknown function alert biologists to some possibilities Computing a similarity score between two genes tells how likely it is that they have similar functions Dynamic programming is a technique for revealing similarities between genes

5
**Motivating Dynamic Programming**

6
**Dynamic programming example: Manhattan Tourist Problem**

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink

7
**Dynamic programming example: Manhattan Tourist Problem**

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink

8
**Manhattan Tourist Problem: Formulation**

Goal: Find a “most weighted” path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A “most weighted” path in G from “source” to “sink”

9
**MTP: An Example source sink 3 5 9 13 15 19 20 23 j coordinate**

1 2 3 4 j coordinate source 3 2 4 3 5 9 1 4 3 2 3 2 4 2 13 1 1 4 6 5 2 7 3 4 15 19 i coordinate 2 4 4 5 2 1 3 3 2 3 20 3 5 6 8 5 1 3 2 2 sink 23 4

10
**MTP: Greedy Algorithm Is Not Optimal**

1 2 5 source 22 5 3 10 5 2 1 5 3 5 3 1 2 3 4 promising start, but leads to bad choices! 5 2 sink 18

11
**MTP: Simple Recursive Program**

MT(n,m) if n=0 or m=0 return MT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach?

12
**Here’s what’s wrong M(n,m) needs M(n, m-1) and M(n-1, m)**

Both of these need M(n-1, m-1) So M(n-1, m-1) will be computed at least twice Dynamic programming: the same idea as this recursive algorithm, but keep all intermediate results in a table and reuse

13
**MTP: Dynamic Programming**

j 1 source 1 1 i S0,1 = 1 5 1 5 S1,0 = 5 Calculate optimal path score for each vertex in the graph Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between

14
**MTP: Dynamic Programming (cont’d)**

j 1 2 source 1 2 1 3 i S0,2 = 3 5 3 -5 1 5 4 S1,1 = 4 3 2 8 S2,0 = 8

15
**MTP: Dynamic Programming (cont’d)**

j 1 2 3 source 1 2 5 1 3 8 i S3,0 = 8 5 3 10 -5 1 1 5 4 13 S1,2 = 13 3 5 -5 2 8 9 S2,1 = 9 3 8 S3,0 = 8

16
**MTP: Dynamic Programming (cont’d)**

j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 S1,3 = 8 3 5 -3 -5 3 2 8 9 12 S2,2 = 12 3 8 9 S3,1 = 9 greedy alg. fails!

17
**MTP: Dynamic Programming (cont’d)**

j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 3 5 -3 2 -5 3 3 2 8 9 12 15 S2,3 = 15 -5 3 8 9 9 S3,2 = 9

18
**MTP: Dynamic Programming (cont’d)**

j 1 2 3 source 1 2 5 1 3 8 Done! i 5 3 10 -5 -5 1 -5 1 5 4 13 8 (showing all back-traces) 3 5 -3 2 -5 3 3 2 8 9 12 15 -5 1 3 8 9 9 16 S3,3 = 16

19
MTP: Recurrence Computing the score for a point (i,j) by the recurrence relation: si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) The running time is n x m for a n by m grid (n = # of rows, m = # of columns)

20
Traveling in the Grid By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. We need to traverse the vertices in some order For a grid, can traverse vertices row by row, column by column, or diagonal by diagonal

21
**Traversing the Manhattan Grid**

b) 3 different strategies: a) Column by column b) Row by row c) Along diagonals c)

22
Alignment

23
**Aligning DNA Sequences**

Alignment : 2 x k matrix ( k m, n ) V = ATCTGATG n = 8 4 1 2 matches mismatches deletions insertions m = 7 W = TGCATAC match mismatch V A T C G deletion W insertion indels

24
**Longest Common Subsequence (LCS) – Alignment without Mismatches**

Given two sequences v = v1 v2…vm and w = w1 w2…wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal

25
**LCS: Example Every common subsequence is a path in 2-D grid 1 1 2 2 3**

1 1 2 2 3 4 3 5 4 5 6 6 7 7 8 i coords: elements of v A T -- C -- T G A T C elements of w -- T G C A T -- A -- C j coords: (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) positions in v: 2 < 3 < 4 < 6 < 8 Matches shown in red positions in w: 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2-D grid

26
**Computing LCS Let vi = prefix of v of length i: v1 … vi**

and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si, j = max si-1, j si, j-1 si-1, j if vi = wj

27
**LCS Problem as Manhattan Tourist Problem**

G A T C j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7

28
**Edit Graph for LCS Problem**

j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7

29
**Edit Graph for LCS Problem**

j 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges i T 1 G 2 C 3 A 4 T 5 A 6 C 7

30
**Backtracking si,j allows us to compute the length of LCS for vi and wj**

sm,n gives us the length of the LCS for v and w How do we print the actual LCS ? At each step, we chose an optimal decision si,j = max (…) Record which of the choices was made in order to obtain this max

31
**Computing LCS Let vi = prefix of v of length i: v1 … vi**

and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si, j = max si-1, j si, j-1 si-1, j if vi = wj

32
**Printing LCS: Backtracking**

PrintLCS(b,v,i,j) if i = 0 or j = 0 return if bi,j = “ “ PrintLCS(b,v,i-1,j-1) print vi else PrintLCS(b,v,i-1,j) PrintLCS(b,v,i,j-1)

33
From LCS to Alignment The Longest Common Subsequence problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring scheme: +1 : match premium -μ : mismatch penalty -σ : indel penalty

34
Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)

35
**The Global Alignment Problem**

Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score → = -σ = 1 if match = -µ if mismatch si-1,j if vi = wj si,j = max s i-1,j-1 -µ if vi ≠ wj s i-1,j - σ s i,j-1 - σ m : mismatch penalty σ : indel penalty {

36
Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {

37
**Making a Scoring Matrix**

Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the function of the sequence, therefore some penalties, δ(vi , wj), will be less harsh than others.

38
**Scoring Matrix: Example for protein sequences**

K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11

39
Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate glutamate Nonpolar to nonpolar alanine valine Similarly behaving residues leucine to isoleucine

40
**Local vs. Global Alignment**

The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignment may score higher than Global Alignment

41
**Local Alignment: Example**

Compute a “mini” Global Alignment to get Local Alignment Local alignment Global alignment

42
Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

43
**The Local Alignment Problem**

Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings

44
**The Problem Is … Long run time O(n4):**

- In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time. This can be remedied by allowing every point to be the starting point

45
**The Local Alignment Recurrence**

The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence: Notice there is only this change from the original recurrence of a Global Alignment si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) {

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google