Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences.

Similar presentations


Presentation on theme: "Dynamic-Programming Strategies for Analyzing Biomolecular Sequences."— Presentation transcript:

1 Dynamic-Programming Strategies for Analyzing Biomolecular Sequences

2 2 Dynamic Programming Dynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach.

3 3 Two key ingredients Two key ingredients for an optimization problem to be suitable for a dynamic-programming solution: Each substructure is optimal. (Principle of optimality) 1. optimal substructures2. overlapping subproblems Subproblems are dependent. (otherwise, a divide-and-conquer approach is the choice.)

4 4 Three basic components The development of a dynamic-programming algorithm has three basic components: –The recurrence relation (for defining the value of an optimal solution); –The tabular computation (for computing the value of an optimal solution); –The traceback (for delivering an optimal solution).

5 5 Fibonacci numbers.for 21 1 1 0 0       i>1 i F i F i F F F The Fibonacci numbers are defined by the following recurrence:

6 6 How to compute F 10 ? F 10 F9F9 F8F8 F8F8 F7F7 F7F7 F6F6 ……

7 7 Tabular computation The tabular computation can avoid recompuation. F0F0 F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 F8F8 F9F9 F 10 011235813213455

8 8 Maximum-sum interval Given a sequence of real numbers a 1 a 2 …a n, find a consecutive subsequence with the maximum sum. 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 For each position, we can compute the maximum-sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n 2 ) time.

9 9 O-notation: an asymptotic upper bound f(n) = O(g(n)) iff there exist two positive constant c and n 0 such that 0 f(n) cg(n) for all n n 0 cg(n) f(n)f(n) n0n0

10 10 How functions grow? 30n 92n log n 26n 2 0.68n 3 2n2n 100 0.003 sec. 0.0026 sec. 0.68 sec. 4 x 10 16 yr. 100,000 3.0 sec.2.6 min.3.0 days22 yr. For large data sets, algorithms with a complexity greater than O(n log n) are often impractical! n function (Assume one million operations per second.)

11 11 Maximum-sum interval (The recurrence relation) Define S(i) to be the maximum sum of the intervals ending at position i. aiai If S(i-1) < 0, concatenating a i with its previous interval gives less sum than a i itself.

12 12 Maximum-sum interval (Tabular computation) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum sum

13 13 Maximum-sum interval (Traceback) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum-sum interval: 6 -2 8 4

14 14 Longest increasing subsequence(LIS) The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a 1 a 2 …a n. e.g. 9 2 5 3 7 11 8 10 13 6 23 7 57 10 13 97 11 3 5 11 13 are increasing subsequences. are not increasing subsequences. We want to find a longest one.

15 15 A naive approach for LIS Let L[i] be the length of a longest increasing subsequence ending at position i. L[i] = 1 + max j = 0..i-1 {L[j] | a j < a i } (use a dummy a 0 = minimum, and L[0]=0) 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 ?

16 16 A naive approach for LIS 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 4 5 6 3 L[i] = 1 + max j = 0..i-1 {L[j] | a j < a i } The maximum length The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence. This method runs in O(n 2 ) time.

17 17 Binary search Given an ordered sequence x 1 x 2... x n, where x 1 <x 2 <... <x n, and a number y, a binary search finds the largest x i such that x i < y in O(log n) time. n... n/2 n/4

18 18 Binary search How many steps would a binary search reduce the problem size to 1? n n/2 n/4 n/8 n/16... 1 How many steps? O(log n) steps.

19 19 An O(n log n) method for LIS Define BestEnd[k] to be the smallest number of an increasing subsequence of length k. 9 2 5 3 7 11 8 10 13 6 922 5 2 3 2 3 7 2 3 7 11 2 3 7 8 2 3 7 8 10 2 3 7 8 13 BestEnd[1] BestEnd[2] BestEnd[3] BestEnd[4] BestEnd[5] BestEnd[6]

20 20 An O(n log n) method for LIS Define BestEnd[k] to be the smallest number of an increasing subsequence of length k. 9 2 5 3 7 11 8 10 13 6 922 5 2 3 2 3 7 2 3 7 11 2 3 7 8 2 3 7 8 10 2 3 7 8 13 2 3 6 8 10 13 BestEnd[1] BestEnd[2] BestEnd[3] BestEnd[4] BestEnd[5] BestEnd[6] For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n).

21 21 Longest Common Subsequence (LCS) A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent. The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

22 22 LCS For instance, Sequence 1: president Sequence 2: providence Its LCS is priden. president providence

23 23 LCS Another example: Sequence 1: algorithm Sequence 2: alignment One of its LCS is algm. a l g o r i t h m a l i g n m e n t

24 24 How to compute LCS? Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. len(i, j): the length of an LCS between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, len(i, j)can be computed as follows.

25 25

26 26

27 27

28 28

29 29 Dot Matrix Sequence A : CTTAACT Sequence B : CGGATCAT C G G A T C A T CTTAACTCTTAACT

30 30 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Sequence A Sequence B

31 31 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Insertion gap Match Mismatch Deletion gap

32 32 Alignment Graph Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

33 33 A simple scoring scheme Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

34 34 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows.

35 35 Computing S i,j i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n

36 36 Initializations 0-3-6-9-12-15-18-21-24 -3 -6 -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT

37 37 S 3,5 = ? 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-5? -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT

38 38 S 3,5 = ? 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55-49 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT optimal score

39 39 C T T A A C – T C G G A T C A T 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55-49 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT 8 – 5 –5 +8 -5 +8 -3 +8 = 14

40 40 Global Alignment vs. Local Alignment global alignment: local alignment:

41 41 An optimal local alignment S i,j : the score of an optimal local alignment ending at a i and b j With proper initializations, S i,j can be computed as follows.

42 42 local alignment 000000000 085200852 0530085313 0200085211 0000853? 0 0 0 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

43 43 local alignment 000000000 085200852 0530085313 0200085211 00008531310 0000852118 08525313107 053021310818 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3 The best score

44 44 000000000 085200852 0530085313 0200085211 00008531310 0000852118 08525313107 053021310818 C G G A T C A T CTTAACTCTTAACT The best score A – C - T A T C A T 8-3+8-3+8 = 18

45 45 Affine gap penalties Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) Each gap is charged an extra gap-open penalty: -4. C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 -4 Alignment score: 12 – 4 – 4 = 4

46 46 Affine gap penalties A gap of length k is penalized x + k·y. gap-open penalty gap-symbol penalty Three cases for alignment endings: 1....x...x 2....x...- 3....-...x an aligned pair a deletion an insertion

47 47 Affine gap penalties Let D(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j.

48 48 Affine gap penalties

49 49 Affine gap penalties SI D SI D SI D SI D -y -x-y -y w(a i,b j )

50 50 PAM (percent accepted mutation) Scoring function is critical element above SF is not good enough Developing a new scoring function –Based on a set of native protein sequences

51 51 20 x 20 substitution matrix

52 52 PAM-250

53 53 PAM – 1 The Dayhoff Matrix: Proteins evolve through a succession of independent point mutations, that are accepted in a population and subsequently can be observed in the sequence pool. (Dayhoff, M.O. et al. (1978) Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3 National Biomedical Research Foundation, Washington D.C. U.S.A). First step: Pair Exchange Frequencies A PAM (Percent Accepted Mutation) is one accepted point mutation on the path between two sequences, per 100 residues.

54 54 PAM – 2: Second step: Frequencies of Occurrence of each amino acid 1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Computing the relative frequency of occurrence of a.a. over a large, Sufficiently varied protein sequence set i.e: occurrence in the sample space

55 55 PAM – 3: Third step: Relative Mutabilities 1978 1991 A 100 100 C 20 44 D 106 86 E 102 77 F 41 51 G 49 50 H 66 91 I 96 103 K 56 72 L 40 54 M 94 93 N 134 104 P 56 58 Q 93 84 R 65 83 S 120 117 T 97 107 V 74 98 W 18 25 Y 41 50 All values are taken relative to alanine, which is arbitrarily set at 100.

56 56 PAM – 4: Fourth step: Mutation Probability matrix The probability that an amino acid in row a of the matrix will replace the amino acid in column b : the mutability of amino acid b, multiplied by the pair exchange frequency for aa divided by the sum of all pair exchange frequencies for amino acid a: Matrix M is a Markov model of evolution

57 57 PAM – 5: Last step: the log-odds matrix log to base 10: a value of +1 would mean that the corresponding pair has been observed 10 times more frequently than expected by chance. The most commonly used matrix is the matrix from the 1978 edition of the Dayhoff atlas, at PAM 250: this is also frequently referred to as the PAM250 matrix. k-PAM defines as

58 58 PAM-250

59 59 BLOSUM Matrices –BLOSUM matrices are based on local alignments. –BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. –All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. –BLOSUM 62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix.

60 60 BLOSUM62 is the BLAST default

61 61 BLOSUM 62

62 62 Relationship between PAM & BLOSUM

63 63 Gap Penalty Query LengthSubstitution MatrixGap Costs <35PAM-30(9,1) 35-50PAM-70(10,1) 50-85BLOSUM-80(10,1) >85BLOSUM-62(10,1)

64 64 k best local alignments Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) –linear-space version : sim (Huang and Miller, 1991) –linear-space variants : sim2 (Chao et al., 1995); sim3 (Chao et al., 1997) FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) –linear-space band alignment (Chao et al., 1992) BLAST (Altschul et al., 1990; Altschul et al., 1997) –restricted affine gap penalties (Chao, 1999)

65 65 FASTA 1)Find runs of identities, and identify regions with the highest density of identities. 2)Re-score using PAM matrix, and keep top scoring segments. 3)Eliminate segments that are unlikely to be part of the alignment. 4)Optimize the alignment in a band.

66 66 FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities.

67 67 FASTA Step 2: Re-score using PAM matrix, and keep top scoring segments.

68 68 FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.

69 69 FASTA Step 4: Optimize the alignment in a band.

70 70 BLAST 1)Build the hash table for Sequence A. 2)Scan Sequence B for hits. 3)Extend hits.

71 71 BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT 12345678 AAA AAC.. AGA 1.. ATC 3.. CGA 5.. GAT 2 6.. TCG 4.. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;

72 72 BLAST Step2: Scan sequence B for hits.

73 73 BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. hit Terminate if the score of the sxtension fades away. BLAST 2.0 saves the time spent in extension, and considers gapped alignments.

74 74 Remarks Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. The idea of filtration was used in both FASTA and BLAST.

75 75 Linear-space ideas Hirschberg, 1975; Myers and Miller, 1988 m/2

76 76 Two subproblems ½ original problem size m/2 m/4 3m/4

77 77 Four subproblems ¼ original problem size m/2 m/4 3m/4

78 78 Time and Space Complexity Space: O(M+N) Time: O(MN)*(1+ ½ + ¼ + …) = O(MN) 2

79 79 Band Alignment Sequence B Sequence A

80 80 Band Alignment in Linear Space The remaining subproblems are no longer only half of the original problem. In worst case, this could cause an additional log n factor in time.

81 81 Band Alignment in Linear Space

82 82 Multiple sequence alignment (MSA) The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC

83 83 How to score an MSA? Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score= + +

84 84 Defining scores for alignment columns infocon [Stojanovic et al., 1999] –Each column is assigned a score that measures its information content, based on the frequencies of the letters both within the column and within the alignment. CGGATCAT—GGA CTTAACATTGAA GAGAACATAGTA

85 85 Defining scores (cont’d) phylogen [Stojanovic et al., 1999] –columns are scored based on the evolutionary relationships among the sequences implied by a supplied phylogenetic tree. TTTCCTTTCC T T T C C C T T T TT T T Score = 1 Score = 2

86 86 MSA for three sequences an O(n 3 ) algorithm

87 87 General MSA For k sequences of length n: O(n k ) NP-Complete (Wang and Jiang) The exact multiple alignment algorithms for many sequences are not feasible. Some approximation algorithms are given. (e.g., 2- l/k for any fixed l by Bafna et al.)

88 88 Progressive alignment A heuristic approach proposed by Feng and Doolittle. It iteratively merges the most similar pairs. “Once a gap, always a gap” A B C D E The time for progressive alignment in most cases is roughly the order of the time for computing all pairwise alignment, i.e., O(k 2 n 2 ).

89 89 Concluding remarks Three essential components of the dynamic- programming approach: –the recurrence relation –the tabular computation –the traceback The dynamic-programming approach has been used in a vast number of computational problems in bioinformatics.


Download ppt "Dynamic-Programming Strategies for Analyzing Biomolecular Sequences."

Similar presentations


Ads by Google