Lecture 5: Local Sequence Alignment Algorithms CS 5263 Bioinformatics Lecture 5: Local Sequence Alignment Algorithms
Poll Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?
Roadmap Review of last lecture Local Sequence Alignment Statistics of sequence alignment Substitution matrix Significance of alignment
Bounded Dynamic Programming x1 ………………………… xM O(kM) time O(kM) memory Possibly O(M+k) yN ………………………… y1 k
Linear-space alignment O(M+N) memory 2MN time M/2 k* M/2 N-k*
Graph representation of seq alignment -1 -2 -3 -4 1 2 (0,0) -1 -1 -1 -1 -1 -1 1 1 1 1 (3,4) An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph
Question If I change the scoring scheme, will it change the alignment? Match = 1, mismatch = gap = -2 || v Match = 2, mismatch = gap = -1? Answer: Yes
Proof Let F1 be the score of an optimal alignment under the scoring scheme Match = m > 0 Mismatch = s < 0 Gap = d < 0 Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment F1 = a1m + b1s + c1d
Proof (cont’) Let F2 be the score of a sub-optimal alignment under the same scoring scheme Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment F2 = a2m + b2s + c2d Let F1 = F2 + k, where k > 0
Proof (cont’) Now we change the scoring scheme, so that Match = m + 1 Mismatch = s + 1 Gap = d + 1
Proof (cont’) The new scores for the two alignments become: F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1) = a1m + b1s + c1d + (a1+b1+c1) = F1 + (a1+b1+c1) = F1 + L1 F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1) = F2 + (a2+b2+c2) = F2 + L2 length of alignment 1 length of alignment 2
Proof (cont’) F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2) = k + (a1+b1+c1) – (a2+b2+c2) = k + L1 – L2 In order for F1’ < F2’, we need to have: k + L1 – L2 < 0, i.e. L2 – L1 > k Length of alignment 1 Length of alignment 2
Proof (cont’) This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme. We only need to show one example that it is possible to find such two alignments
d F1 = 2m + 3s F2 = 3m + 4d m m s d m d m s s d
F1 = 2m + 3s F2 = 3m + 4d m = 1, s = d = –2 F1 = 2 – 6 = –4 F1 > F2 m d m s s d
F1 = 2m + 3s F2 = 3m + 4d m = 2, s = d = – 1 F1’ = 4 – 3 = 1 F2’ > F1’ m d m s s d
A A C A G AACAG | | ATCGT F1 = 2x1-3x2 = -4 F1’ = 2x2 – 3x1 = 1 m A m T m C AA-CAG- | | | -ATC-GT F2 = 3x1 – 4x2 = -5 F2’ = 3x2 – 4x1 = 2 m G T
On the other hand, if we had doubled our scores, such that m’ = 2m, s’ = 2s d’ = 2d F1’ = 2F1 F2’ = 2F2 Our alignment won’t be changed
Today How to model gaps more accurately? Local sequence alignment Statistics of alignment
What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d However, gaps usually occur in bunches. During evolution, chunks of DNA may be lost entirely Aligning genomic sequence vs. cDNA (reverse complimentary to mRNA)
Model gaps more accurately Current model: Gap of length n incurs penalty nd General: Convex function E.g. (n) = c * sqrt (n) n n
General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O(N2M) (cubic) Space: O(NM) (linear-space algorithm not applicable)
Compromise: affine gaps (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: 5 Gap extension: 1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1
Additional states The amount of state needed increases In scoring a single entry in our matrix, we need remember an extra piece of information Are we continuing a gap in x? (if no, start is more expensive) Are we continuing a gap in y? (if no, start is more expensive) Are we continuing from a match between xi and yj?
Finite State Automaton Xi aligned to a gap d Xi and Yj aligned d Yj aligned to a gap e
Dynamic programming We encode this information in three different matrices For each element (i,j) we use three variables F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap
F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) d F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) – d Ix(i, j) = max Iy(i, j – 1) – d Ix(i, j – 1) – e F(i – 1, j) – d Iy(i, j) = max Ix(i – 1, j) – d Iy(i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y
F Ix Iy
F Ix Iy
If we stack all three matrices No cyclic dependency We can fill in all three matrices in order
Algorithm for i = 1:m F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) for j = 1:n Fill in F(i, j), Ix(i, j), Iy(i, j) end F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) Time: O(MN) Space: O(MN) or O(N) when combine with the linear-space algorithm
To simplify F(i – 1, j – 1) + (xi, yj) F(i, j) = max I(i – 1, j – 1) + (xi, yj) F(i, j – 1) – d I (i, j) = max I(i, j – 1) – e F(i – 1, j) – d I(i – 1, j) – e I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap This is possible because no alternating gaps allowed
To summarize Global alignment Basic algorithm: Needleman-Wunsch Variants: Overlapping detection Longest common subsequences Achieved by varying initial conditions or scoring Bounded DP (pruning search space) Linear space (divide-and-conquer) Affine gap penalty
Local alignment
The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y
Why local alignment Conserved regions may be a small part of the whole “Active site” of a protein Scattered genes or exons among “junks” Don’t have whole sequence Global alignment might miss them if flanking “junk” outweighs similar regions
Genes are shuffled between genomes C D B D A C A B C D B D C A
Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.
Reminder The overlap detection algorithm We do not give penalty to gaps in the ends Free gap Free gap
Similar here We are free of penalty for the unaligned regions
The big idea Whenever we get to some bad region (negative score), we ignore the previous alignment Reset score to zero
The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max
The Smith-Waterman algorithm Termination: If we want the best local alignment… FOPT = maxi,j F(i, j) If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back
The correctness of the algorithm can be proved by induction using the alignment graph -10 100
x c d e a b Match: 2 Mismatch: -1 Gap: -1
x c d e a b Match: 2 Mismatch: -1 Gap: -1
x c d e a b 2 1 Match: 2 Mismatch: -1 Gap: -1
x c d e a b 2 1 Match: 2 Mismatch: -1 Gap: -1
x c d e a b 2 1 3 Match: 2 Mismatch: -1 Gap: -1
x c d e a b 2 1 3 5 Match: 2 Mismatch: -1 Gap: -1
x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1
Trace back x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1
Trace back x c d e a b 2 1 3 5 4 cxde | || c-de x-de | || xcde a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde
No negative values in local alignment DP array Optimal local alignment will never have a gap on either end Local alignment: “Smith-Waterman” Global alignment: “Needleman-Wunsch”
Analysis Time: Memory: O(MN) for finding the best alignment Depending on the number of sub-opt alignments Memory: O(MN) O(M+N) possible
The statistics of alignment Where does (xi, yj) come from? Are two aligned sequences actually related?
Probabilistic model of alignments We’ll focus on protein alignments without gaps Given an alignment, we can consider two possibilities R: the sequences are related by evolution U: the sequences are unrelated How can we distinguish these possibilities? How is this view related to amino-acid substitution matrix?
Model for unrelated sequences Assume each position of the alignment is independently sampled from some distribution of amino acids ps: probability of amino acid s in the sequences Probability of seeing an amino acid s aligned to an amino acid t by chance is Pr(s, t | U) = ps * pt Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is
Model for related sequences Assume each pair of aligned amino acids evolved from a common ancestor Let qst be the probability that amino acid s in one sequence is related to t in another sequence The probability of an alignment of x and y is give by
Probabilistic model of Alignments How can we decide which possibility (U or R) is more likely? One principled way is to consider the relative likelihood of the two possibilities (the odd ratios) A higher ratio means that R is more likely than U
Log odds ratio Taking the log, we get Recall that the score of an alignment is given by
Therefore, if we define We are actually defining the alignment score as the log odds ratio (log likelihood) between the two models R and U This is indeed how biologists have defined the substitution matrices for proteins
ps can be counted from the available protein sequences But how do we get qst? (the probability that s and t have a common ancestor) Counted from trusted alignments of related sequences
Protein Substitution Matrices Two popular sets of matrices for protein sequences PAM matrices [Dayhoff et al, 1978] Better for aligning closely related sequences BLOSUM matrices [Henikoff & Henikoff, 1992] For both closely or remotely related sequences
Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights
BLOSUM-N matrices Constructed from a database called BLOCKS Contain many closely related sequences Conserved amino acids may be over-counted N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity identity: % of matched columns Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)
If you want to detect homologous genes with high identify, you may want a BLOSUM matrix with higher N. say BLOSUM75 On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 BLOSUM62 is the standard
For DNAs No database of trusted alignments to start with Specify the percentage identity you would like to detect You can then get the substitution matrix by some calculation
For example Suppose pA = pC = pT = pG = 0.25 We want 88% identity qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for s ≠ t.
Substitution matrix A C G T 1.26 -1.83
A C G T 5 -7 Scale won’t change the alignment Multiply by 4 and then round off to get integers
Arbitrary substitution matrix Say you have a substitution matrix provided by someone It’s important to know what you are actually looking for when you use the matrix
What’s the difference? Which one should I use? A C G T 1 -2 A C G T 5 -4 What’s the difference? Which one should I use?
We had Scale it, so that Reorganize:
Since all probabilities must sum to 1, We have Suppose again ps = 0.25 for any s We know (s, t) from the substitution matrix We can solve the equation for λ Plug λ into to get qst
A C G T 1 -2 A C G T 5 -4 Translate: 95% identity = 1.33 qst = 0.24 for s = t, and 0.004 for s ≠ t Translate: 95% identity = 1.21 qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity