CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment

Roadmap Review of last lecture –Biology –Dynamic programming Sequence alignment

PolymerMonomer DNADeoxyribonucleotides RNARibonucleotides ProteinAmino Acid

Carboxyl group Amino group Protein zoom-in Side chain R H2N RRRRR COOH N-terminal C-terminal … Composed of a chain of amino acids. R | H 2 N--C--COOH | H

Genome, Chromosome, Gene

DNA Replication The process of copying a double-stranded DNA molecule –Semi-conservative 5’-ACATGATAA-3’ 3’-TGTACTAT-5’  5’-ACATGATAA-3’ 3’-TGTACTATT-5’

Transcription (where genetic information is stored) (for making mRNA) Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA. DNA-RNA pair: A=U, C=G T=A, G=C

The Genetic Code Third letter

Translation The sequence of codons is translated to a sequence of amino acids Gene: -GCT TGT TTA CGA ATT- mRNA: -GCU UGU UUA CGA AUU - Peptide: - Alu - Cys - Leu - Arg - Ile – Start codon: AUG –Also code Met –Stop codon: UGA, UAA, UAA

Dynamic programming What is dynamic programming? –Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them

Elements of dynamic programming Optimal sub-structures –Optimal solutions to the original problem contains optimal solutions to sub-problems –Solutions to sub-problems are independent Overlapping sub-problems –Some sub-problems appear in many solutions –We should not solve each sub-problem for more than once Memorization and reuse –Carefully choose the order that sub-problems are solved –Tabulate the solutions –Bottom-up

Example Find the shortest path in a grid s g 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112 (0,0) (3,3)

Optimal substructure If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal Proof by contradiction –If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x) –Construct a new path P’(s,g) = P’(s,x) + P(x, g) –P’(s,g) P(s,g) is not the shortest –Contradiction

Overlapping sub-problems Some sub-problems are used by many paths (0,0) -> (2,0) used by 3 paths

Memorization and reuse Easy to tabulate and reuse –Number of sub-problems ~ number of nodes –P(s, x), for x in all nodes except s and g Find an order such that no sub-problems need to be recomputed –First compute the smallest sub-problems –Use solutions of small sub-problems to solve large sub-problems

0 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112 Example: shortest path

02 1 56 4 5 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 2 4 5 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 23 4 5 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 236 4 5 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 236 44 5 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 236 446 5 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 236 4468 5 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 236 4468 55 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 236 4468 557 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112

02 1 56 236 4468 55710 1 2 3 1 1 5 2 2 31 13 3 3 3 2 2 2 4 112 Example: shortest path

Analysis For a nxn grid Enumeration: –number of paths = (2n!)/(n!)^2 –Each path has 2n steps –Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n)) Recursive call: O(2^(2n)) DP: O(n^2)

EnumerationRecursionDP N=31206824 N=52,5201,03260 N=103,695,1201,048,576420

Example: Fibonacci Seq F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1 Function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2);

Time complexity: O(1.62^n)

Example: Fibonacci Seq function fib(n) F[0] = 1;F[1] = 1; For i = 2 to n F[n] = F[n-1] + F[n-2]; End Return F[n];

11235813213455 Time: O(n), space: O(n)

What if it is not so easy to figure out an order to fill in the table? Exercise

Today’s lecture Sequence alignment –Global alignment

Why seq alignment? Similar sequences often have similar origin or function –Two genes are said to be homologous if they share a common evolutionary history. –Evolutionary history can tell us a lot about properties of a given gene –Homology can be inferred from similarity between the genes New protein sequences are always compared to sequence databases to search for proteins with same or similar functions Most widely used computational tools in biology

Evolution at the DNA level …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… C Sequence edits: Mutation, deletion, insertion

Evolutionary Rates OK X X Still OK? next generation

Sequence conservation implies function

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two string S, T is a pair of strings S ’, T ’ (with spaces) s.t. (1) |S ’ | = |T ’ |, and (|S| = “ length of S ” ) (2) removing all spaces in S ’, T ’ leaves S, T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

What is a good alignment? Alignment: The “ best ” way to match the letters of one sequence with those of the other How do we define “ best ” ?

The score of aligning (characters or spaces) x & y is σ (x,y). Score of an alignment: An optimal alignment: one with max score S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Scoring Function Sequence edits: AGGCCTC –Mutations AGGACTC –InsertionsAGGGCCTC –DeletionsAGG-CTC Scoring Function: Match: +m~~~AAC~~~ Mismatch: -s~~~A-A~~~ Gap (indel):-d

More complex scoring function Substitution matrix –Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor –It is usually defined by log likelihood ratio (Durbin book) –Active research area. Especially for proteins. –Commonly used: PAM, BLOSUM

An example substitution matrix ACGT A3-2-2 C3 G3-2 T3

Match = 2, mismatch = -1, gap = -1 Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

How to find it? A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

Analysis Assume |S| = |T| = n Cost of evaluating one alignment: ≥n How many alignments are there: –pick n chars of S,T together –say k of them are in S –match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 2 40 >10 12 operations

Dynamic Programming We will now describe a dynamic programming algorithm Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j

Dynamic Programming (cont ’ d) Notice three possible cases: 1.x M aligns to y N ~~~~~~~ x M ~~~~~~~ y N 2.x M aligns to a gap ~~~~~~~ x M ~~~~~~~ - 3.y N aligns to a gap ~~~~~~~ - ~~~~~~~ y N m, if x M = y N F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

Therefore: F(M-1, N-1) +  (X M,Y N ) F(M,N) = max F(M-1, N) – d F(M, N-1) – d  (X M,Y N ) = m if X M = Y N, and –s otherwise Each sub-problem can be solved recursively

Generalize: F(i-1, j-1) +  (X i,Y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d Be careful with the boundary conditions

Remember: –The recursive formula is for understanding the relationship between sub-problems –We cannot afford to really solve them recursively Number of sub-problems: –Each corresponds to calculating an F(i, j) –O(MN) of them –Solve all of them

What order to fill? F(0,0) F(M,N)

F(i-1, j-1) +  (X i,Y j ) F(i, j) = max F(i-1, j) – d F(i, j-1) – d F(i, j)F(i, j-1) F(i-1, j)F(i-1, j-1) [case 1] [case 2] [case 3] 1 2 3

What order to fill? F(0,0) F(M,N)

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A T A F(i,j) i = 0 1 2 3 4 j = 0 1 2 3

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A T-2 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 F(i,j) i = 0 1 2 3 4

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 This only tells us the best score F(i,j) i = 0 1 2 3 4

Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = 0 1 2 3 4

Trace-back AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d x = AGTAm = 1 y = ATAs = -1 d = -1 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = 0 1 2 3 4

Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = 0 1 2 3 4

In some cases, trace-back may be very time consuming Alternative solution: remember where you come from! –Trade-off: more memory

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A T-2 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 F(i,j) i = 0 1 2 3 4

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA 0-2-3-4 A10 -2 T 0010 A-3 02 j = 0 1 2 3 Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i,j) i = 0 1 2 3 4

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling in scores a.For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- Then, we don ’ t want to penalize gaps in the ends

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y N ……………………………… y 1

Different types of overlaps x y x y

A non-bio variant Shell command “diff” in unix –Given file1 and file2 –Find the difference between file1 and file2 –Similar to sequence alignment –How to score? Longest common subsequence (LCS) Match has score 1 No mismatch penalty No gap penalty

File1 A B C D E F File2 G B C E F

File1 A B C D E F File2 G B C - E F $ diff file1 file2 1c1 < A --- > G 4c4 < D --- > - LCS = 4

The LCS variant Changes: 1.Initialization For all i, j, F(i, 0) = F(0, j) = 0 2.Filling in table F(i-1,j) F(i, j) = max F(i, j-1) F(i-1, j-1) + σ(x i, y j ) where σ(x i, y j ) = 1 if x i = y j and 0 otherwise. 3.Termination max i F(i, N) F OPT = max max j F(M, j)

What happens if you have 1 million lines of text in each file? Slow –What if the majority of the two files are the same? (e.g., two versions of a software) –Bounded DP Memory inefficient –At least 1000 GB memory –Linear-space algorithm, same time complexity

See you next week

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.

Similar presentations

Presentation on theme: "CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.

Similar presentations

Presentation on theme: "CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment."— Presentation transcript:

Similar presentations

About project

Feedback