Dynamic Programming-- Longest Common Subsequence

Dynamic Programming-- Longest Common Subsequence

Subsequences A subsequence of a character string x0x1x2…xn-1 is a string of the form xi1xi2…xik, where ij < ij+1. Not the same as substring! Gaps can exist in subsequences Example String: ABCDEFGHIJK Subsequence: ACEGIJK Subsequence: DFGHK Not subsequence: DAGH Dynamic Programming

The Longest Common Subsequence (LCS) Problem
Given two strings X and Y find a longest subsequence common to both X and Y Has applications to DNA similarity testing (alphabet is {A,C,G,T}) disease identification, parental tests Example: ABCDEFG and XZACKDFWGH have ACDFG as a longest common subsequence Dynamic Programming

A Poor Approach to the LCS Problem
A Brute-force solution: Enumerate all subsequences of X Test which ones are also subsequences of Y Pick the longest one. Analysis: If X is of length n, then it has 2n subsequences Hint: a char can be presence/absent (binary) exponential-time algorithm! Dynamic Programming

Observations Can be viewed as string editing
Transform one string to another by keeping/adding/deleting characters Can also be viewed as aligning two strings Any ideas? -- T G C A

Observation: Pairs of indices
1 2 1 2 3 3 4 4 5 5 6 7 6 8 7 i index: elements of X -- T G C A T -- A -- C elements of Y A T -- C -- T G A T C j index:

1 2 1 2 3 3 4 4 5 5 6 7 6 8 7 i index: elements of X -- T G C A T -- A -- C elements of Y A T -- C -- T G A T C j index: (0,0) (0,1) (1,2) (2,2) (3,3) (4,3) (5,4) (5,5) (6,6) (6,7) (7,8) Matches shown in green

1 2 1 2 3 3 4 4 5 5 6 7 6 8 7 i index: elements of X -- T G C A T -- A -- C elements of Y A T -- C -- T G A T C j index: (0,0) (0,1) (1,2) (2,2) (3,3) (4,3) (5,4) (5,5) (6,6) (6,7) (7,8) Matches shown in green Pairs of indices correspond to a path in a 2-D grid

Edit Graph for LCS Problem
j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7 (0,0) (0,1) (1,2) (2,2) (3,3) (4,3) (5,4) (5,5) (6,6) (6,7) (7,8)

1 2 1 2 3 3 4 4 5 5 6 7 6 8 7 i index: elements of X -- T G C A T -- A -- C elements of Y A T -- C -- T G A T C j index: Skipping a letter in Y (increment index j) (0,0) to (0,1) (5,4) to (5,5) (6,6) to (6,7)

Going right Increment the column index But not the row index Ie, skipping a col char T G C A 1 2 3 4 5 6 7 i 8 j

1 2 1 2 3 3 4 4 5 5 6 7 6 8 7 i index: elements of X -- T G C A T -- A -- C elements of Y A T -- C -- T G A T C j index: Skipping a letter in X (increment index i) (1,2) to (2,2) (3,3) to (4,3)

1 2 3 4 5 6 7 i 8 j Going down Increment the row index But not the column index Ie, skipping a row char

1 2 1 2 3 3 4 4 5 5 6 7 6 8 7 i index: elements of X -- T G C A T -- A -- C elements of Y A T -- C -- T G A T C j index: Matching a letter in X and Y (increment indices i and j) (0,1) to (1,2) (2,2) to (3,3) (4,3) to (5,4) …

1 2 3 4 5 6 7 i 8 j Going diagonal Increment the row index And the column index Ie, matching char

1 2 3 4 5 6 7 i 8 j Brown Path Skip A (column) Match T Skip G (row) Match C Skip A (row) ? 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 i 8 j Brown Path Skip A (column) Match T Skip G (row) Match C Skip A (row) Skip G (column) Match A Skip T (column) 1 2 3 4 5 6 7 8 9 10

j 1 2 3 4 5 6 7 8 i Always can skip a row char (down edge) Always can skip a col char (right edge) Additional diagonal edges (matching) T 1 G 2 C 3 A 4 T 5 A 6 C 7

j 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with the maximum number of diagonal edges i T 1 G 2 C 3 A 4 T 5 A 6 C 7

Let Li,j = length of common subsequence up to X[i] and Y[j]
Computing LCS Let Li,j = length of common subsequence up to X[i] and Y[j] i-1,j i-1,j -1 1 i,j -1 i,j Li-1,j skip char at Xi Li,j = MAX Li,j skip char at Yj Li-1,j , if Xi = Yj , lengthen the match

LCS length L(Xi,Yj) is computed by:
Computing LCS LCS length L(Xi,Yj) is computed by: Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj

Example for a mismatch LCS length L(Xi,Yj) is computed by: Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj … A 1 C ?

Example for a mismatch LCS length L(Xi,Yj) is computed by: Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj … A 1 C

Example for a match LCS length L(Xi,Yj) is computed by: Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj … A 1 ?

Example for a match LCS length L(Xi,Yj) is computed by: Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj … A 1

Dynamic Programming Example
T C X Y 1 2 3 4 5 6 7 For simplicity: Table index starts from zero String index starts from 1 Table index 1 corresponds to string index 1 as shown Initialize row 0 and col 0 to be zeros in the table A T G 1 2 3 4 5 6 7

T C X Y 1 2 3 4 5 6 7 A T G 1 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

Dynamic Programming: Backtracing
Arrows show where the score originated from. if from the left , skip Xi if from the top, skip Yj if Xi = Yj

T C X Y 1 2 3 4 5 6 7 A T G 1 1 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

T C X Y 1 2 3 4 5 6 7 A T G 1 1 1 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

T C X Y 1 2 3 4 5 6 7 A T G 1 1 1 1 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

T C X Y 1 2 3 4 5 6 7 A T G 1 1 1 1 1 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

T C X Y 1 2 3 4 5 6 7 A T G 1 1 1 1 1 1 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

T C X Y 1 2 3 4 5 6 7 A T G 1 1 1 1 1 1 1 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

T C X Y 1 2 3 4 5 6 7 1 A T G 1 2 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 3 4 5 6 7

T C X 1 2 3 4 5 6 7 Y A T G 1 1 1 1 1 1 1 1 Li, j = max Li-1, j skip Xi Li, j skip Yj Li-1, j if Xi = Yj 2 1 2 2 2 2 2 2 3 1 2 4 1 2 5 1 2 6 1 2 1 7 2

T C X 1 2 3 4 5 6 7 A T G Y Continuing with the dynamic programming algorithm gives this result. 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 3 1 2 2 3 3 3 3 4 1 2 2 3 4 4 4 5 1 2 2 3 4 4 4 6 1 2 2 3 4 5 5 1 7 2 2 3 4 5 5

LCS Algorithm { LCS(X,Y) for i  1 to n // zeros in column 0 Li,0  0
for j  1 to m // zeros in row 0 L0,j  0 for i  1 to n // assume string index starts from 1 for j  1 to m Li-1,j Li,j  max Li,j-1 Li-1,j-1 + 1, if Xi = Yj return (L) {

Length but not sequence yet
1 2 3 4 5 6 7 G A T C X Y LCS(X,Y) creates the alignment grid Now we need a way to read the best alignment of X and Y Follow the arrows backwards from sink

LCS: Backtracing without “arrows”
L i-1, j-1 L i-1, j L i, j-1 L i, j 5 6 D Pair of chars are not the same and L i-1, j >= L i, j-1 From which direction did we get L i, j? Dynamic Programming

LCS: Backtracing without “arrows”
getLCS(X,Y,L) i = m // assume string index from 1 to m (or n) j = n while (Lij > 0) if (Xi = Yj) // from diagonal S.insertFront(Xi ), decrement i, decrement j else if (Li-1,j >= Li,j-1) // cell above is larger decrement i // from above else decrement j // from left return S

Note on book’s implementation
G A T C X Y 1 2 3 4 5 6 7 For simplicity (slides): Table index starts from zero String index starts from 1 Table index 1 corresponds to string index 1 as shown Book: String index starts from zero Table index 1 corresponds to string index 0 A T G 1 2 3 4 5 6 7

Time Complexity Length of two strings: n and m ?

Time Complexity Length of two strings: n and m
O(nm) to build the table O(m + n) to backtrace O(nm)

Dynamic Programming Well, sounds like what technique we learned?
Algorithm for LCS is an example Key observations: Problem can be decomposed into smaller subproblems (“decomposition”) Solution for a problem can be composed from solutions for subproblems (“composition”) Smallest problems with known solutions (“base case”) Well, sounds like what technique we learned? Dynamic Programming

Dynamic Programming Well, sounds like recursion will do as well
Algorithm for LCS is an example Key observations: Problem can be decompose into smaller subproblems (“decomposition”) Solution for a problem can be composed from solutions for subproblems (“composition”) Smallest problems with known solutions (“base case”) Well, sounds like recursion will do as well Yes, but the subproblems are repeated Dynamic Programing remembers the solutions for subproblems Ie, reuse the solutions -> saving time Dynamic Programming

Reusing solutions I needs E, F, H F needs B, C, E H needs D, E, G
A B C D E F G H I I needs E, F, H F needs B, C, E H needs D, E, G Recursion E is a subproblem of I, F, and H E would be calculated 3 times Dynamic programming E would be calculated only once!! Dynamic Programming

The General Dynamic Programming Technique
Simple subproblems (decomposition): can be defined in terms of a few variables Subproblem optimality (composition): the global optimum value can be defined in terms of optimal subproblems Subproblem overlap (repeated) Construct solutions bottom-up (starting with “base cases”) Remembering solutions for subproblems Dynamic Programming

Skipping the rest Dynamic Programming

Another example of Dynamic Programming

Matrix Chain-Products
Dynamic Programming can be used. Review: Matrix Multiplication. C = A*B A is d × e and B is e × f O(def ) time A C B d f e i j i,j Dynamic Programming

Matrix Chain-Products
Compute A=A0*A1*…*An-1 Ai is di × di+1 Problem: How to parenthesize? Example B is 3 × 100 C is 100 × 5 D is 5 × 5 (B*C)*D takes = 1575 ops B*(C*D) takes = 4000 ops Dynamic Programming

A Brute-force Approach
Algorithm Try all possible ways to parenthesize A=A0*A1*…*An-1 Calculate number of ops for each one Pick the one that is best Running time: The number of paranethesizations is equal to the number of binary trees with n nodes This is exponential! It is called the Catalan number, and it is almost 4n. This is a terrible algorithm! Dynamic Programming

A Greedy Approach Idea #1: repeatedly select the product that uses the most operations. Counter-example: A is 10 × 5 B is 5 × 10 C is 10 × 5 D is 5 × 10 Greedy idea #1 gives (A*B)*(C*D), which takes = 2000 ops A*((B*C)*D) takes = 1000 ops Dynamic Programming

Another Greedy Approach
Idea #2: repeatedly select the product that uses the fewest operations. Counter-example: A is 101 × 11 B is 11 × 9 C is 9 × 100 D is 100 × 99 Greedy idea #2 gives A*((B*C)*D)), which takes = ops (A*B)*(C*D) takes = ops The greedy approach is not giving us the optimal value. Dynamic Programming

A “Recursive” Approach
Define subproblems: Find the best parenthesization of Ai*Ai+1*…*Aj. Let Ni,j denote the number of operations done by this subproblem. The optimal solution for the whole problem is N0,n-1. Dynamic Programming

Subproblem optimality: The optimal solution can be defined in terms of optimal subproblems: There has to be a final multiplication (root of the expression tree) for the optimal solution. final multiply is at index i: (A0*…*Ai)*(Ai+1*…*An-1). Dynamic Programming

Subproblem optimality: The optimal solution can be defined in terms of optimal subproblems: There has to be a final multiplication (root of the expression tree) for the optimal solution. final multiply is at index i: (A0*…*Ai)*(Ai+1*…*An-1). Then the optimal solution N0,n-1 is the sum of two optimal subproblems, N0,i and Ni+1,n-1 plus the time for the last multiply. Dynamic Programming

A Characterizing Equation
The global optimal has to be defined in terms of optimal subproblems Consider all possible places (k) for that final multiply: Recall that Ai is a di × di+1 dimensional matrix. a characterizing equation for Ni,j is: Note that subproblems are not independent--the subproblems overlap: Dynamic Programming! Operations in the left product Operations in the right product Operations in multiplying the 2 products Dynamic Programming

Example Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5 D is 5 × 10
Dimensions: d0=10 d1=5 d2=10 d3=5 d4=10 N 1 2 3 Dynamic Programming

AB Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5 D is 5 × 10
Dimensions: d0=10 d1=5 d2=10 d3=5 d4=10 N 1 2 3 500 Dynamic Programming

BC Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5 D is 5 × 10
Dimensions: d0=10 d1=5 d2=10 d3=5 d4=10 N 1 2 3 500 250 Dynamic Programming

CD Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5 D is 5 × 10
Dimensions: d0=10 d1=5 d2=10 d3=5 d4=10 N 1 2 3 500 250 Dynamic Programming

ABC: A(BC) vs (AB)C Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5
D is 5 × 10 N[0,2] A(BC) N[0,0] + N[1,2] + 10*5*5 = 500 N 1 2 3 500 250 Dynamic Programming

ABC: A(BC) vs (AB)C Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5
D is 5 × 10 N[0,2] A(BC) N[0,0] + N[1,2] + 10*5*5 = 500 N[0,2] (AB)C N[0,1] + N[2,2] + 10*10*5 = 1000 N 1 2 3 500 (k=0) 250 Dynamic Programming

BCD: B(CD) vs (BC)D Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5
D is 5 × 10 N[1,3] B(CD) N[1,3] (BC)D N 1 2 3 500 (k=0) 250 ? (k=?) Dynamic Programming

BCD: B(CD) vs (BC)D Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5
D is 5 × 10 N[1,3] B(CD) N[1,1] + N[2,3] + 5*10*10 = 1000 N[1,3] (BC)D N[1,2]+ N[3,3] + 5*5*10 = 500 N 1 2 3 500 (k=0) 250 (k=2) Dynamic Programming

ABCD: A(BCD) vs (AB)(CD) vs (ABC)D
Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5 D is 5 × 10 N[0,3] A(BCD) N[0,0] + N[1,3] + 10*5*10 = 1000 N[0,3] (AB)(CD) N[0,1] + N[2,3] + 10*10*10 =2000 N[0,3] (ABC)D N[0,2]+ N[3,3] + 10*5*10 =1000 N 1 2 3 500 (k=0) 1000 250 (k=2) Dynamic Programming

ABCD Matrices A is 10 × 5 B is 5 × 10 C is 10 × 5 D is 5 × 10
N[0,3] --- ABCD k=0 A(BCD) N[1,3] BCD k=2 (BC)D A((BC)D) N 1 2 3 500 (k=0) 1000 250 (k=2) Dynamic Programming

Time Complexity answer N
Filling in each entry in the N table takes O(n) time. O(n2) entries Building the table: O(n3) O(n) for finding the parenthesis locations Total: O(n3) answer N 1 2 … n-1 j i Dynamic Programming

Keeping track of indices
Initialize diagonal i,j or i,i j=0 j=1 j=2 j=3 i=0 0,0 i=1 1,1 i=2 2,2 i=3 3,3 Dynamic Programming

b = # of products in subchain 1 2 matrices => 1 product (or length of subchain – 1) i = 0 to 2 in this example Generally 0 to n-b-1 [4-1-1] n-b combinations -1 for zero-based index j = 1 to 3 in this example Generally, j = i+b j=0 j=1 j=2 j=3 i=0 0,1 i=1 1,2 i=2 2,3 i=3 ABCD: AB, BC, CD Dynamic Programming

b = # of products in subchain 2 3 matrices => 2 products (or length of subchain – 1) i = 0 to 1 in this example 0 to n-b-1 [4-2-1] j = 2 to 3 in this example j = i+b j=0 j=1 j=2 j=3 i=0 0,2 i=1 1,3 i=2 i=3 ABCD: ABC, BCD Dynamic Programming

b = # of products in subchain 3 4 matrices => 3 products (or length of subchain – 1) i = 0 to 0 in this example 0 to n-b-1 [4-3-1] j = 3 to 3 in this example j = i+b j=0 j=1 j=2 j=3 i=0 0,3 i=1 i=2 i=3 ABCD: ABCD Dynamic Programming

A Dynamic Programming Algorithm
Algorithm matrixChain(S): Input: sequence S of n matrices to be multiplied Output: number of operations in an optimal paranethization of S for i  0 to n-1 do Ni,i  // diagonal entries for b  1 to n-1 do // number of products in subchain for i  0 to n-b-1 do // start of subchain (first matrix) j  i+b // end of subchain (last matrix) Ni,j  +infinity // initial value for min for k  i to j-1 do Ni,j  min{Ni,j , Ni,k +Nk+1,j +di dk+1 dj+1} Dynamic Programming

The General Dynamic Programming Technique
Applies to a problem that at first seems to require a lot of time (possibly exponential), provided we have: Simple subproblems (decomposition): the subproblems can be defined in terms of a few variables, such as j, k, l, m, and so on. Subproblem optimality (composition): the global optimum value can be defined in terms of optimal subproblems Subproblem overlap: the subproblems are not independent, but instead they overlap (hence, should be constructed bottom-up (base case)). Dynamic Programming

Dynamic Programming-- Longest Common Subsequence

Similar presentations

Presentation on theme: "Dynamic Programming-- Longest Common Subsequence"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Programming-- Longest Common Subsequence

Similar presentations

Presentation on theme: "Dynamic Programming-- Longest Common Subsequence"— Presentation transcript:

Similar presentations

About project

Feedback