Presentation is loading. Please wait.

Presentation is loading. Please wait.

BWT-Transformation What is BWT-transformation? BWT string compression

Similar presentations


Presentation on theme: "BWT-Transformation What is BWT-transformation? BWT string compression"— Presentation transcript:

1 BWT-Transformation What is BWT-transformation? BWT string compression
BWT string matching - RankAll - BWT array construction

2 BWT transformation We use s to denote a string that we would like to transform. Assume that s terminates with a special character $, which does not appear elsewhere in s and is alphabetically prior to all other characters. In the case of DNA sequences, we have $ < A < C < G < T.

3 Fig. 1: Rotation of a string
BWT transformation As an example, consider s = acagaca$. We can rotate s consecutively to create eight different strings as shown in Fig. 1(a). Last column First column Fig. 1: Rotation of a string $ a c a g a c a a c a g a c a $ c a g a c a $ a a g a c a $ a c g a c a $ a c a a c a $ a c a g c a $ a c a g a a $ a c a g a c a1 $ c1 a1 c2 a3 a2 c1 $ a4 a3 g1 a4 c2 g1 a2 F L

4 BWT transformation By writing all these strings stacked vertically, we generate an n  n matrix, where n = |s| (see Fig. 1(a).) Here, special attention should be paid to the first column, denoted as F, and the last column, denoted as L. For them, the following equation, called the LF mapping, can be immediately observed: F[i] = L[i]’s predecessor, (1) where F[i] (L[i]) is the ith element of F (resp. L).

5 Fig. 2: Sorting the rows of the matrix
BWT transformation Now we sort the rows of the matrix alphabetically. We will get another matrix, called the Burrow-Wheeler Matrix and denoted as BWM(s). Especially, the last column of BWM(s), read from top to bottom, is called the BWT-transformation (or the BWT-array) and denoted as BWT(s). So for s = acagaca$, we have BWT(s) = acg$caaa. Fig. 2: Sorting the rows of the matrix $ a c a g a c a a $ a c a g a c c a $ a c a g a a c a $ a c a g g a c a $ a c a a g a c a $ a c c a g a c a $ a a c a g a c a $

6 Fig. 3: LF-mapping and tank-correspondence
BWT transformation By the BWM matrix, the LF-mapping is obviously not changed. Surprisingly, the rank correspondence also remains. Even though the ranks of different appearances of a certain character (in F or in L) may be different from before, their rank correspondences are not changed as shown in Fig. 2(b), in which a2 now appears in both F and L as the third element among all the a-characters, and c1 the second element among all the c-characters. Fig. 3: LF-mapping and tank-correspondence $ a4 a4 c2 c2 a3 a3 g1 g1 a2 a2 c1 c1 a1 a1 $ F L - 1 2 4 3 rkF By ranking the elements in F, each element in L is also ranked with the same number. rkL F$ = <$; 1, 1> Fa = <a; 2, 5> Fc = <c; 6, 7> Fg = <g; 8, 8>

7 BWT string compression
The first purpose of BWT(s) is for the string compression since same characters with similar right-contexts in s tend to be clustered together in BWT(s), as shown by the following example: BWT(tomorrow and tomorrow and tomorrow) = wwwdd nnoooaatttmmmrrrrrrooo $ooo

8 BWT string matching For the purpose of the string search, the character clustering in F has to be used. Especially, for any DNA sequence, the whole F can be divided into five or less segments: $-segment, A-segment, C-segment, G-segment, and T-segment, denoted as F$, FA, FC, FG, FT, respectively. In addition, for each segment in F, we will rank all its elements from top to bottom, as illustrated in Fig. 2(a). $ is not ranked since it appears only once. F$ = <$; 1, 1> Fa = <a; 2, 5> Fc = <c; 6, 7> Fg = <g; 8, 8>

9 BWT string matching From Fig. 2(a), we can see that the rank of a4, denoted as rkF(a4), is 1 since it is the first element in FA. For the same reason, we have rkF(a3) = 2, rkF(a1) = 3, rkF(a2) = 4, rkF(c2) = 1, rkF(c1) = 2, and rkF(g1) = 1. It can also be seen that each segment in F can be effectively represented as a triplet of the form: <; x, y>, where     {$}, and x, y are the positions of the first and last appearance of  in F, respectively. Now we consider j (the jth appearance of  in s). Assume that rkF(j) = i. Then, the position where j appears in F can be easily determined: F[x + i - 1] = j. (2)

10 BWT string matching In addition, if we rank all the elements in L top-down in such a way that an j is assigned i if it is the ith appearance among all the appearances of  in L. Then, we will have rkF(j) = rkL(j), (3) where rkL(j) is the rank assigned to j in L. With the ranks established, a string matching can be very efficiently conducted by using the formulas (2) and (3).

11 BWT string matching To see this, let’s consider a pattern string p = aca and try to find all its occurrences in s = acagaca$. First, we check p[3] = a in the pattern string p, and then figure out a segment in L, denoted as L, corresponding to Fa = <a; 2, 5>. So L = L[2 .. 5], as illustrated in Fig. 3(a), where we still use the non-compact F for ease of explanation. In the second step, we check p[2] = c, and then search within L to find the first and last c in L. We will find rkL(c2) = 1 and rkL(c1) = 2. By using (3), we will get rkF(c2) = 1 and rkF(c1) = 2. Then, by using (2), we will figure out a sub-segment F in F: F[xc xc ] = F[ ] = F[6 .. 7]. (Note that xc = 6. See Fig. 3(b).)

12 BWT string matching In the third step, we check p[1] = a, and find L = L[6 .. 7] corresponding to F = F[6 .. 7]. Repeating the above operation, we will find rkL(a3) = 2 and rkL(a1) = 3. See Fig. 3(c). Since now we have exhausted all the characters in p and F[xa + 2 – 1, xa + 3 – 1] = F[3, 4] contains only two elements, two occurrences of p in s are found, corresponding to a1 and a3 in s, respectively.

13 BWT string matching Fig. 3: Sample trace $ a4 1 a4 c2 1 c2 a3 2
a3 g1 1 g1 a2 4 a2 c1 2 c1 a1 3 a1 $ - F L To find the first c to find the last c to find the first a to find the last a rkL

14 Fig. 4: LF-mapping and rank-correspondence
RankAll The dominant cost of the above process is the searching of L in each step. However, this can be dramatically reduced by arranging || arrays each for a character    such that [i] (the ith entry in the array for ) is the number of appearances of  within L[1 .. i]. See Fig. 4(a) for illustration.respectively. Fig. 4: LF-mapping and rank-correspondence $ a4 a4 c2 c2 a3 a3 g1 g1 a2 a2 c1 c1 a1 a1 $ F L 1 $ 2 4 3 a c g t Aa Ac Ag At i 6 8 5 7 j For each  = 4 values in L, a rankAll value is stored.

15 RankAll Now, instead of scanning a certain segment L[x .. y] (x  y) to find a subrange for a certain   , we can simply look up the array for  to see whether [x - 1] = [y]. If it is the case, then  does not occur in L[x .. y]. Otherwise, [[x - 1] + 1, [y]] should be the found range. For example, to find the first and the last appearance of c in L[2 .. 5], we only need to find c[2 – 1] = c[1] = 0 and c[5] = 2. So the corresponding range is [c[2 - 1] + 1, c[4]] = [1, 2].

16 RankAll The problem of this method is the high space requirement, which can be mitigated by storing a compact array A for each   , in which, rather than for each L[i], only for some elements in L the number of their appearances will be stored. For example, we can divide L into a set of buckets of the same size and only for each bucket a value will be stored in A. 4 1 Aa 2 Ac Ag At i For each  = 4 values in L, a rankAll value is stored.

17 RankAll Obviously, doing so, more search will be required. In practice, the size  of a bucket (referred to as a compact factor) can be set to different values. For example, we can set  = 4, indicating that for each four contiguous elements in L a group of || integers (each in an A) will be stored. However, each [j] for    can be easily derived from A by using the following formulas: [j] = A[i] + r, (4) where i = j/ and r is the number of ’s appearances within L[i .. j], and [j] = A[i] - r, (5) where i = j/ and r is the number of ’s appearances within L[j .. i].

18 Construction of Arrays
As mentioned above, a string s = c0c1 ... cn−1 is always ended with $ (i.e., ci   for i = 0, …, n – 2, and cn−1 = $). Let s[i] = ci, i =0,1, …, n − 1, be the ith character of s, s[i.. j] = ci ... cj substring and si = s[i .. n − 1] a suffix of s. Suffix array A of s is a permutation of the integers 0, ..., n − 1 such that A[i] is the start position of the ith smallest suffix. The relationship between A and the BWT array L can be determined by the following formulas: L[i] = $, (6) if A[i] = 0; L[i] = s[A[i] – 1], otherwise. Once L is determined. F can be created immediately by using formula (1).


Download ppt "BWT-Transformation What is BWT-transformation? BWT string compression"

Similar presentations


Ads by Google