Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau

CABADEB 00000000 10000000 C 11111110 B 22221110 A 22222110 D 33332110 A 43332110 C 44432110 B 44432110 D Background - Global similarity LCS- Longest Common Subsequence

(1977) Hirschberg - Algorithms for the longest common subsequence problem. (1977) Hunt, Szymanski - A fast algorithm for computing longest common subsequence. (1987) Apostolico, Guerra – The longest common subsequence problem revisited. (1992) Eppstein, Galil, Giancarlo, Italiano - Sparse dynamic programming I: linear cost functions. Background - LCS milestones

Global alignment algorithms compute the similarity grade of the entire input strings, by computing the best path from the first to the last entry of the table. Background - Global Vs. Local Local alignment algorithms report the most similar substring pair according to their scoring scheme.

T(i,0)=T(0,j)=0, for all i,j (1 ≤ i ≤ m ; 1 ≤ j ≤ n) T(i,j)=max{0, T(i-1,j-1)+ S(Y i,X j ), T(i-1,j)+ D(Y i ), T(i,j-1)+ I(X j )} CBBADECEA 0000000000 00.20.61000.20.610 A 1.21.620.6000.30.70.60 B 2.61.71.60.50.91.31.70.30.20 C 2.21.31.51.92.31.41.3000 D Background – Smith Waterman algorithm (1981) D(Y i )I(X j )S(Y i,X j ) D(Y i ) = I(X j ) = -0.4, S(Y i,X j ) =CBBADECEA0000000000 00.20.61000.20.610A 1.21.620.6000.30.70.60B 2.61.71.60.50.91.31.70.30.20C 2.21.31.51.92.31.41.3000D { 1 if Y i = X j -0.3 if Y i ≠ X j

Maximal score vs. maximal degree of similarity. What would reflect higher similarity level? 71(score)/10,000(symbols) or 70/200 Mosaic effect - Lack of ability to discard poorly conserved intermediate segments. Shadow effect - Short alignments may not be detected because they are overlapped by longer alignments. 70/10000 40/100 The sparsity of the essential data is not exploited. CABADEB 00000000 10000000C 11111110B 22221110A 22222110D 33332110A 43332110C 44432110B 44432110D 40 31 -30 40 1041 The weaknesses of the Smith Waterman algorithm (according to Arslan, Eğecioğlu and Pevzner): This cannot be fixed by post processing Background – The Smith Waterman algorithm

The statistical significance of the local alignment depends on both its score and length. Thus, the solution for these weaknesses is: Normalization Instead of maximizing S(X’,Y’), maximize S(X’,Y’)/(|X’|+|Y’|). Under that scoring scheme, one match is always an optimal alignment. Thus, a minimal length or a minimal score constraint is needed. Background – Normalized local alignment

The algorithm of Arslan, Eğecioğlu and Pevzner (2001) converge to the optimal normalized alignment value through iterations of the Smith Waterman algorithm. They solve the problem SCORE(X’,Y’)/(|X’|+|Y’|+L), where L is a constant that controls the amount of normalization. The ratio between L and |X’|+|Y’| determines the influence of L on the value of the alignment. The time complexity of their algorithm is O(n 2 logn). Background – Normalized sequence alignment

Maximize LCS(X’,Y’)/(|X’|+|Y’|). It can be viewed as measure of the density of the matches. It can be viewed as measure of the density of the matches. A minimal length or score constraint, M, must be enforced, and we chose the score constraint (the value of LCS(X’,Y’)) The value of M is problem related. The value of M is problem related. Our approach

The naïve O(rLloglogn) normalized local LCS algorithm

Definitions A chain is a sequence of matches that is strictly increasing in both components. The length of a chain from match (i,j) to match (i’,j’) is i’-i+j’-j. 0 0 n m Y X 0 0 J’n i i’ m J Y X (i,j) (i’,j’) A k-chain (i,j) is the shortest chain of k matches starting from (i,j). The normalized value of k-chain (i,j) is k divided by its length.

The naïve algorithm For each match (i,j), construct k-chain (i,j) for 1≤k≤L (L=LCS(X,Y)). Computing the best chains starting from each match guarantees that the optimal chain will not be missed. Computing the best chains starting from each match guarantees that the optimal chain will not be missed. Examine all the k-chains with k≥M of all matches and report either: The k-chains with the highest normalized value. The k-chains with the highest normalized value. k-chains whose normalized value exceeds a predefined threshold. k-chains whose normalized value exceeds a predefined threshold.

Problem: k-chain (i,j) is not necessarily the prefix of (k+1)-chain (i,j). a b c a d e c f h c g b f h e c g g g f d e f

Solution: construct (k+1)-chain (i,j) by concatenating (i,j) to k-chain (i’,j’). a b c a d e c f h c g b f h e c g g g f d e f

Question: How to find the proper k-chain (i’,j’) ? What If there are two candidates ((i,j) is in the mutual range of two matches)? If there is only one candidate ((i,j) is in the range of a single match (i’,j’)), it is clear. 0 0 n i m J 0 0 n m

Lemma: A mutual range of two matches is owned completely by one of them. 0 0 m Y X n

We use the lemma in order to maintain L data structures. In the k data structure: All the matches are the heads of k-chains. Each match owns the range to its left. Computing (k+1)-chain of a match is done by concatenating it to the owner of the range it is in. Row i Row 0

Preprocessing: create the list of matches of each row. create the list of matches of each row. Process the matches row by row, from bottom up. For the matches of row i: Stage 1: Construct k-chains 1≤k≤L. Stage 1: Construct k-chains 1≤k≤L. Stage 2: Update the data structures with the matches of row i and their k-chains. They will be used for the computation of next rows. Stage 2: Update the data structures with the matches of row i and their k-chains. They will be used for the computation of next rows. Examine all k-chains of all matches and report the ones with the highest normalized value. The algorithm

Complexity analysis Preprocessing- O(nlogΣ Y ). Stage 1- For each of the r matches we construct at most L For each of the r matches we construct at most L k-chains, with total complexity of O(rLloglogn), when Johnson Trees are used by our data structures. k-chains, with total complexity of O(rLloglogn), when Johnson Trees are used by our data structures. Stage 2- Each of the r matches is inserted and extracted at most once to each of the data structures, and the total complexity is again O(rLloglogn). Each of the r matches is inserted and extracted at most once to each of the data structures, and the total complexity is again O(rLloglogn).

Complexity analysis Checking all k-chains of all matches and reporting the best alignments consumes O(rL) time. Total time complexity of this algorithm is O(nlogΣ Y + rLloglogn). O(nlogΣ Y + rLloglogn). Space complexity is O(rL+nL). r matches with (at most) L records each. r matches with (at most) L records each. The space of L Johnson Trees of size n. The space of L Johnson Trees of size n.

The O(rMloglogn) normalized local LCS algorithm

The algorithm reports the best possible local alignment (value and substrings). This section is divided to: 1.Computing the highest normalized value. 2.Constructing the longest optimal alignment.

Computing the highest normalized value Definition: A sub-chain of a k-Chain is a path that contains a sequence of x ≤ k consecutive matches of the k-Chain. It does not have to start or end at a match. a b c a d e c f h c g b f h e c g

Computing the highest normalized value Claim: When a k-chain is split into a number of non overlapping consecutive sub-chains, the value of the k-chain is at most equal to the value of the best sub-chain. 10 3 + 2 + 3 + 2 40 14 + 5 + 12 + 9 = 10 5 + 2 + 3 + 1 40 20 + 8 + 12 + 4 =

Computing the highest normalized value Result: Any k-chain with k≥M may be split to non overlapping consecutive sub-chains of M matches, followed by a last sub-chain of up to 2M-1 matches. The normalized value of the best sub-chain will be at least equal to that of the k-chain. 10 3 + 3 + 4 40 12 + 14 + 14 = Assume M = 3. = 10-chain

Computing the highest normalized value A sub-chains of less than M matches may not be reported. Sub-chains of 2M matches or more, can be split into shorter sub-chains of M to 2M-1 matches. Question: Is it sufficient to construct all the sub- chains of exactly M matches? 4/10 Vs. 3/8 1 1 2 3 4 5 2 345

Computing the highest normalized value The algorithm: For each match construct all the k-chains, for k≤2M-1. The algorithm constructs all these chains, that are, in fact, the sub-chains of all the longer k- chains. A longer chain cannot be better than its best sub- chain. This algorithm reports the highest normalized value of a sub-chain which is equal to the highest normalized value of a chain.

Constructing the longest optimal alignment Definition: A perfect alignment is an alignment of two identical strings. Its normalized value is ½. Lemma: unless the optimal alignment is perfect, the longest optimal alignment has no more than 2M-1 matches.

Constructing the longest optimal alignment Proof: Assume there is a chain with more than 2M-1 matches whose normalized value is the optimal, denoted by LO. LO may be split to a number of sub-chains of M matches, followed by a single sub-chain of between M and 2M-1 matches. The normalized value of each such sub-chain must be equal to that of LO, otherwise, LO is not optimal. Each such sub-chain must start and end at a match, otherwise, the normalized value of the chain comprised of the same matches will be higher than that of LO. 10/30 0/2 0/3 = 10/35 < 10/30

Constructing the longest optimal alignment It’s number of matches is M+1 and its length is S+2. Since <, <. Thus, we found a chain of M+1 matches whose normalized value is higher than that of LO, in contradiction to the optimality of LO. M 1 M M + 1 S 2 S S + 2 The tails and heads of the sub-chains from which LO is comprised must be next to each other. M/S 2M/2S

Closing remarks

The advantages of the new algorithm Ideal for textual local comparison as well as for screening bio sequences. Normalized and thus, does not suffer from the shadow and mosaic effects. A straight forward approach to the minimal constraint.

The advantages of the new algorithm the minimal constraint is problem related rather than input related. If we refer to it as a constant, the complexity of the algorithm is O(rloglogn). Since for textual comparison we can expect r<<n 2, the complexity may be even better than that of the non normalized local similarity algorithms.

The advantages of the new algorithm The O(rMloglogn) algorithm computes the optimal normalized alignments. The advantage of the O(rLloglogn) algorithm is that it can report all the long alignment that exceed a predefined value and not only the short optimal alignments.

Questions

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.

Similar presentations

Presentation on theme: "Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.

Similar presentations

Presentation on theme: "Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau."— Presentation transcript:

Similar presentations

About project

Feedback