# CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.

## Presentation on theme: "CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment."— Presentation transcript:

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment

Evolution at the DNA level …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… C DNA evolutionary events (sequence edits): Mutation, deletion, insertion

Sequence conservation implies function OK X X Still OK? next generation

Why sequence alignment? Conserved regions are more likely to be functional –Can be used for finding genes, regulatory elements, etc. Similar sequences often have similar origin and function –Can be used to predict functions for new genes / proteins Sequence alignment is one of the most widely used computational tools in biology

Global Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two strings S, T is a pair of strings S ’, T ’ (with spaces) s.t. (1) |S ’ | = |T ’ |, and (|S| = “ length of S ” ) (2) removing all spaces in S ’, T ’ leaves S, T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC S T S’ T’

What is a good alignment? Alignment: The “ best ” way to match the letters of one sequence with those of the other How do we define “ best ” ?

The score of aligning (characters or spaces) x & y is σ (x,y). Score of an alignment: An optimal alignment: one with max score S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Scoring Function Sequence edits: AGGCCTC –Mutations AGGACTC –InsertionsAGGGCCTC –DeletionsAGG-CTC Scoring Function: Match: +m~~~AAC~~~ Mismatch: -s~~~A-A~~~ Gap (indel):-d

Match = 2, mismatch = -1, gap = -1 Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

More complex scoring function Substitution matrix –Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor –It is usually defined by log likelihood ratio (Durbin book) –Active research area. Especially for proteins. –Commonly used: PAM, BLOSUM

An example substitution matrix ACGT A3-2-2 C3 G3-2 T3

How to find an optimal alignment? A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

Analysis Assume |S| = |T| = n Cost of evaluating one alignment: ≥n How many alignments are there: –pick n chars of S,T together –say k of them are in S –match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 2 40 >10 12 operations

Intro to Dynamic Programming

Dynamic programming What is dynamic programming? –A method for solving problems exhibiting the properties of overlapping subproblems and optimal substructureoverlapping subproblemsoptimal substructure –Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly Two simple examples: –Computing Fibonacci numbers –Find the special shortest path in a grid

Example 1: Fibonacci numbers 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … F(0) = 1; F(1) = 1; F(n) = F(n-1) + f(n-2) How to compute F(n)?

A recursive algorithm function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2); F(9) F(8)F(7) F(6) F(5) F(6)F(5) F(4)F(5) F(4) F(3)

Time complexity: –Between 2 n/2 and 2 n –O(1.62 n ), i.e. exponential Why recursive Fib algorithm is inefficient? –Overlapping subproblems n/2 n

An iterative algorithm function fib(n) F[0] = 1;F[1] = 1; for i = 2 to n F[i] = F[i-1] + F[i-2]; Return F[n]; Time complexity: Time: O(n), space: O(n)

Example 2: shortest path in a grid S G m n Each edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length

Optimal substructures Naïve algorithm: enumerate all possible paths and compare costs –Exponential number of paths Key observation: –If a path P(S, G) is the shortest from S to G, any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x

Proof Proof by contradiction –If the path between P(S,x) is not the shortest, i.e., P’(S,x) < P(S,x) –Construct a new path P’(S,G) = P’(S,x) + P(x, G) –P’(S,G) P(S,G) is not the shortest –Contradiction –Therefore, P(S, x) is the shortest S G x

Recursive solution Index each intersection by two indices, (i, j) Let F(i, j) be the total length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted. To compute F(m, n), we need to compute both F(m-1, n) and F(m, n-1) m n (0,0) (m, n) F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min F(m, n-1) + length((m, n-1), (m, n))

Recursive solution But: if we use recursive call, many subpaths will be recomputed for many times Strategy: pre-compute F values starting from the upper-left corner. Fill in row by row (what other order will also do?) m n F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min F(i, j-1) + length((i, j-1), (i, j)) (0,0) (m, n) (i, j) (i-1, j) (i, j-1)

Dynamic programming illustration 3 9 12 3 2 52 2 4 23 3 6 33 1 2 32 53 33 3 23 39 3 62 37 4 46 31 3 3 121315 6 81315 9 111316 11 141720 17 1820 0 5 7 13 17 S G F(i-1, j) + length(i-1, j, i, j) F(i, j) = min F(i, j-1) + length(i, j-1, i, j)

Trackback 3 9 12 3 2 52 2 4 23 3 6 33 1 2 32 53 33 3 23 39 3 62 37 4 46 31 3 3 121315 6 81315 9 111316 11 141720 17 1820 0 5 7 13 17

Elements of dynamic programming Optimal sub-structures –Optimal solutions to the original problem contains optimal solutions to sub-problems Overlapping sub-problems –Some sub-problems appear in many solutions Memorization and reuse –Carefully choose the order that sub-problems are solved

Dynamic Programming for sequence alignment Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j Scoring Function: Match: +m Mismatch: -s Gap (indel):-d

Optimal substructure If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then The alignment between x[1..i] and y[1..j] is also optimal Easy to prove by contradiction... 12iM 12 j N x:x: y:y:

Recursive formula Notice three possible cases: 1.x M aligns to y N ~~~~~~~ x M ~~~~~~~ y N 2.x M aligns to a gap ~~~~~~~ x M ~~~~~~~  3.y N aligns to a gap ~~~~~~~  ~~~~~~~ y N m, if x M = y N F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

Recursive formula Generalize: F(i-1, j-1) +  (X i,Y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d  (X i,Y j ) = m if X i = Y j, and –s otherwise Boundary conditions: –F(0, 0) = 0. –F(0, j) = ? –F(i, 0) = ? -jd: y[1..j] aligned to gaps. -id: x[1..i] aligned to gaps.

What order to fill? F(0,0) F(M,N) F(i, j)F(i, j-1) F(i-1, j)F(i-1, j-1) 1 12 3

What order to fill? F(0,0) F(M,N)

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A T A F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A T-2 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 This only tells us the best score F(i,j) i = 0 1 2 3 4

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = 0 1 2 3 4 A A

Trace-back AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d x = AGTAm = 1 y = ATAs = 1 d = 1 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4 TA TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = 0 1 2 3 4 GTA -TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = 0 1 2 3 4 AGTA A-TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A T-2 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i,j) i = 0 1 2 3 4

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling in scores a.For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

Equivalent graph problem (0,0) (3,4) A G TA A A T 1 1 1 1 S1 = S2 = Number of steps: length of the alignment Path length: alignment score Optimal alignment: find the longest path from (0, 0) to (3, 4) General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.  : a gap in the 2 nd sequence  : a gap in the 1 st sequence : match / mismatch Value on vertical/horizontal line: -d Value on diagonal: m or -s 1

Question If we change the scoring scheme, will the optimal alignment be changed? –Old: Match = 1, mismatch = gap = -1 –New: match = 2, mismatch = gap = 0 –New: Match = 2, mismatch = gap = -2?

Question What kind of alignment is represented by these paths? A BCBC A BCBC A BCBC A BCBC A BCBC A- BC A-- -BC --A BC- -A- B-C -A BC Alternating gaps are impossible if –s > -2d

A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG-------- Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG The first alignment may be biologically more realistic Score = -7 Score = -2

A variant of the basic algorithm Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- Then, we don ’ t want to penalize gaps in the ends

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y N ……………………………… y 1

Different types of overlaps x y x y

A non-bio variant Shell command diff: Compare two text files –Given file1 and file2 –Find the difference between file1 and file2 –Similar to sequence alignment –How to score? Longest common subsequence (LCS) Match has score 1 No mismatch penalty No gap penalty

File1 A B C D E F File2 G B C E F

File1 A B C D E F File2 G B C - E F \$ diff file1 file2 1c1 < A --- > G 4c4 < D --- > - LCS = 4

The LCS variant Changes: 1.Initialization For all i, j, F(i, 0) = F(0, j) = 0 2.Filling in table F(i-1,j) F(i, j) = max F(i, j-1) F(i-1, j-1) + σ(x i, y j ) where σ(x i, y j ) = 1 if x i = y j and 0 otherwise. 3.Termination max i F(i, N) F OPT = max max j F(M, j)

More efficient algorithms What happens if you have 1 million lines of text in each file? O(mn) algorithm is too inefficient Memory inefficient –1 TB memory to store the matrix Bounded DP –maybe the majority of the two files are the same? (e.g., two versions of a software) Linear-space algorithm –same time complexity

Download ppt "CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment."

Similar presentations